Groq Blog 前天 02:03
Groq降低GPT-OSS模型价格并推出Prompt Caching
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Groq宣布对GPT-OSS模型进行两项重要更新,旨在提升AI推理的价格性能。首先,GroqCloud上的GPT-OSS模型价格已降低,以提高成本效益,并追溯适用于2025年10月未付账单。其次,Groq正在为GPT-OSS模型推出Prompt Caching功能,该功能将为相同输入前缀的Token提供高达50%的折扣,显著降低延迟,并使速率限制更具优势。此功能无需额外配置,自动应用于API请求,并已在GPT-OSS-20B上推出,即将推广至GPT-OSS-120B。这些更新将使RAG平台、代理应用、评估流水线和聊天机器人等用例受益,进一步赋能开发者。

💰 **降低GPT-OSS模型价格**:Groq大幅降低了其GPT-OSS模型在GroqCloud上的价格,旨在为开发者提供更具成本效益的AI推理解决方案。新价格即时生效,并会追溯应用于2025年10月的未付账单,确保所有用户都能从成本节约中获益。

🚀 **推出Prompt Caching功能**:Groq为其GPT-OSS模型引入了Prompt Caching功能,该功能通过识别并缓存相同的输入Token前缀,为这些Token提供高达50%的折扣。这不仅能显著降低AI推理的成本,还能大幅减少请求延迟,提升整体性能。

⚡ **提升开发效率与可用性**:Prompt Caching的引入还意味着所有缓存的Token不计入GroqCloud的速率限制,从而使开发者的速率限制额度更具优势。该功能无需任何代码更改,能够自动应用于API请求,极大地简化了开发者的集成流程,使其能够专注于应用创新。

💡 **适用场景广泛**:Prompt Caching功能特别适合具有稳定、可重用提示组件的工作负载,例如RAG平台、代理应用、评估流水线以及聊天机器人等。这些应用场景的重复性提示结构(如系统提示、工具定义、少量示例等)将能有效利用缓存机制,实现更快的响应和更低的成本。

At Groq, we’re relentless about fueling developers with the best price‑performance for AI inference. Today we’re starting to roll out two updates for GPT-OSS models that make building at scale faster, cheaper, and simpler.

New, Lower Prices for GPT‑OSS Models

We’ve reduced the price for gpt-oss models on GroqCloud to ensure all developers can ignite their applications with increased cost efficiency. These new prices are effective today and will apply retroactively to all unpaid invoices for the month of October 2025.

Prompt Caching on GPT-OSS Models

We’re rolling out prompt caching on our GPT-OSS models. Last month we quietly rolled out prompt caching on GPT-OSS-20B, and we’ll be rolling it out on GPT-OSS-120B over the next few weeks. What this means for developers using these models:

    Up to 50% discount on cached tokens - All input tokens in the identical prefix get a 50% discount; the tokens after the first difference between prompts are charged at full price.Significantly lower latency – Reduced latency for any request that shares an identical token prefix with a recent request.Rate limits go farther - All cached tokens don’t count towards GroqCloud rate limits.Zero configuration – Pricing updates and prompt caching work automatically on every API request. No code changes required.

The 50 % discount on cached input tokens alongside the reduced pricing for GPT-OSS models improves cost efficiency for inference workloads.

Note: No extra fee for the caching feature itself. The discount only applies when a cache hit occurs.
ModelInput Price for uncached tokens (Per 1M Tokens)Input Price for Cached Tokens (Per 1M Tokens)

openai/gpt-oss-120b

$0.15

$0.075

openai/gpt-oss-20b

$0.075

$0.0375

At Cluely, we specialize in real-time AI, where latency is critical. We already leverage Groq for our most time-sensitive generations, and implementing prompt caching will not only accelerate our product but also enable entirely new use cases. With an average of 92% prompt reuse across our generations. Implementing prompt caching will be game-changing for both speed and quality.

— Guilherme Garibaldi, Founder Engineer, Cluely

Why Prompt Caching Matters on GPT-OSS Models

Prompt caching makes AI workflows faster and lower cost. It’s ideal for any workflow with stable, reusable prompt components and works automatically on every API request. Here are a few example use cases well suited for GPT-OSS models that can benefit from prompt caching:

    RAG platforms & data apps: Long system prompts + retrieval templates are cached and reused across queries.Agentic applications: Repetitive tool/function calls and shot examples are re‑used across calls.Eval pipelines: Identical prompts across large datasets are served from cache and reused across queries.Chatbots: Persistent brand/style & policy preambles are cached and reused across queries.

How Prompt Caching Works

    Prefix Matching: The system identifies matching prefixes from recent requests. Prefixes can include system prompts, tool definitions, few-shot examples, and more. Note: prefixes can only match up to the first difference, even if later parts of the prompt are the same!Cache Hit: If a matching prefix is found, cached computation is reused, dramatically reducing latency and token costs by 50% for cached portions.Cache Miss: If no match exists, your prompt is processed normally, with the prefix temporarily cached for potential future matches.Automatic Expiration: All cached data automatically expires within a few hours.

Ready to take your GPT-OSS build to the next level?

With built-in tools, Responses API, and instant cloud availability in four global regions, Groq offers the most robust feature support for gpt-oss models. This alongside lower token costs and prompt caching support means developers now have even more fuel to power their applications.

Start experimenting with GPT-OSS models on GroqCloud today. To learn more about prompt caching and best practices check out our developer documentation.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Groq GPT-OSS AI推理 Prompt Caching 价格优化 性能提升 开发者工具 AI Inference Price Optimization Performance Improvement Developer Tools
相关文章