VentureBeat 10月03日
DeepSeek-V3.2-Exp 降低 LLM 成本
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

DeepSeek 发布了其最新的实验性大型语言模型 DeepSeek-V3.2-Exp,在性能与 V3.1-Terminus 相当的同时,将 API 成本降低了 50%。该模型采用 DeepSeek 稀疏注意力机制(DSA)减少计算负担,并通过专家蒸馏和强化学习提升性能。API 价格降至每百万输入令牌 0.028 美元,且在接近 128,000 令牌的上下文中仍能保持低成本。模型在 Hugging Face 和 GitHub 上开源,支持多种部署选项,为企业提供了兼具性价比和灵活性的 AI 解决方案。

🔍 DeepSeek-V3.2-Exp 采用稀疏注意力机制(DSA),通过选择最相关的令牌减少计算负担,显著降低内存和计算需求,尤其在长上下文处理中成本优势明显。

📉 API 价格大幅降低,每百万输入令牌成本仅为 0.028 美元,且在接近 128,000 令牌的上下文中仍能保持低成本,使其成为性价比极高的 LLM 选择。

🛠️ 模型开源并支持 MIT 许可,用户可自由下载、修改和部署,同时提供 TileLang、CUDA/FlashMLA 等开源内核和 Docker 镜像,方便本地部署。

🧠 通过专家蒸馏和强化学习,模型在推理、编码、问答等任务中表现稳定,与领域专家模型性能差距显著缩小,兼顾了通用性和专业性。

🔒 企业用户需考虑数据安全和合规问题,使用 API 需将数据传输至香港服务器,而自托管开源模型可降低数据隐私风险,但需承担更多运维责任。

DeepSeek continues to push the frontier of generative AI...in this case, in terms of affordability.

The company has unveiled its latest experimental large language model (LLM), DeepSeek-V3.2-Exp, that mostly matches or slightly improves the benchmarks of its predecessor DeepSeek-3.1-Terminus, but more importantly, comes at a 50 percent reduced cost through DeepSeek's application programming interface (API), down to just $0.028 per million input tokens — and can keep costs down even when approaching the context limit of 128,000 tokens (about 300-400 pages worth of information).

It's available through DeepSeek's first-party API, as well as the code downloadable under an open-source, enterprise-friendly MIT License on Hugging Face and GitHub.

How did the company do it? Read on to find out.

API Costs Reduced

As previously mentioned, DeepSeek announced significant reductions in API pricing. For one million tokens, input cache hits now cost $0.028, cache misses $0.28, and outputs $0.42.

This compares to $0.07, $0.56, and $1.68, respectively, under the earlier V3.1-Terminus pricing.

DeepSeek has kept Terminus temporarily available via a separate API until October 15, allowing developers to directly compare the two models, but Terminus will be deprecated after that — a short lived model that was released just one week ago.

Still, DeepSeek V3.2-Exp appears to be among the cheapest options for developers through the API, though OpenAI's GPT-5 Nano still easily takes the crown for most affordable. Take a look at it in comparison to other leading models below:

Provider

Model (cheap/entry)

Input Price (per 1M tokens)

Output Price (per 1M tokens)

Notes / caveats

DeepSeek

V3.2-Exp

$0.28 / $0.028 cached input

$0.42

OpenAI

GPT-5 Nano

$0.05 / $0.005 cached input

$0.40

Google

Gemini 2.5 Flash-Lite

$0.10

$0.40

No cached input price available

Anthropic

Claude Haiku 3.5

$0.80 / $0.08 for cached input

$4.00

xAI

Grok-4 Fast Non-Reasoning

$0.20 / $0.05 for cached input

$0.50

New Sparse Attention Design

At the heart of V3.2-Exp is DeepSeek Sparse Attention, or DSA, described in a technical report also released by the company today on Github.

Traditional dense attention mechanisms, which calculate interactions between every token and every other token in a sequence, scale quadratically with sequence length. As the number of tokens grows, this results in rapidly increasing memory use and compute requirements, leading to high costs and slow inference.

Most large language models use a “dense” self-attention mechanism, which compares every token in the input to every other token. So if your prompt doubles in length, the model does more than double the work to handle all those cross-token interactions. This drives up GPU time and energy cost, which is reflected in the per-million-token pricing for APIs. During prefill, the amount of computation grows roughly with the square of the context length, and at least linearly during decoding. As a result, longer sequences — tens of thousands or even over 100,000 tokens — cause costs to rise much faster than the token count alone would suggest.

DSA addresses this by using a “lightning indexer” to select only the most relevant tokens for attention.

This reduces the computational load while preserving nearly the same quality of responses.

By reducing the compute burden per token at large context lengths, V3.2-Exp keeps the cost curve flatter and much lower.

This makes it far more practical and affordable to run long-context workloads such as document-scale summarization, multi-turn chat with long histories, or code analysis without facing a runaway increase in inference costs.

Post-Training and Reinforcement Learning Advances

Beyond its architectural changes, DeepSeek-V3.2-Exp introduces refinements in the post-training process. The company employs a two-step approach: specialist distillation and reinforcement learning.

Specialist distillation begins with training separate models for mathematics, competitive programming, logical reasoning, agentic coding, and agentic search. These specialists, fine-tuned from the same base checkpoint, are reinforced with large-scale training to generate domain-specific data. That data is then distilled back into the final checkpoint, ensuring the consolidated model benefits from specialist knowledge while remaining general-purpose.

The reinforcement learning phase marks a significant shift. Instead of the multi-stage approach used in previous DeepSeek models, reasoning, agent, and human alignment training are merged into a single RL stage using Group Relative Policy Optimization (GRPO). This unified process balances performance across domains while avoiding the “catastrophic forgetting” issues often associated with multi-stage pipelines.

The reward design blends rule-based outcome signals, length penalties, and language consistency checks with a generative reward model guided by task-specific rubrics. Experimental results show that the distilled and reinforced model performs nearly on par with domain-specific specialists, with the gap effectively closed after RL training.

Benchmarks Steady

Benchmarking confirms the trade-off works as intended. On widely used public evaluations, V3.2-Exp performs on par with V3.1-Terminus, showing negligible differences in areas such as reasoning, coding, and question answering.

While scores dipped slightly in some reasoning-heavy tasks such as GPQA-Diamond and Humanity’s Last Exam, the model’s efficiency gains and consistent performance elsewhere suggest the sparse approach does not substantially compromise capability.

MMLU-Pro holds steady at 85.0, AIME 2025 slightly improves to 89.3, while GPQA-Diamond dips from 80.7 to 79.9. Coding and agent benchmarks tell a similar story, with Codeforces ratings rising from 2046 to 2121 and BrowseComp improving from 38.5 to 40.1.

This balance reflects the design trade-off. By selecting only a fraction of possible tokens for attention, DSA reduces computational costs significantly. Inference cost comparisons show V3.2-Exp requires less than half the cost per million tokens of V3.1-Terminus when running on long contexts.

Open-Source Access and Deployment Options

In keeping with the company’s open approach, DeepSeek has released the V3.2-Exp model weights on Hugging Face under the MIT License. Researchers and enterprises can freely download, modify, and deploy the model for commercial use.

The release is accompanied by open-source kernels: TileLang for research prototyping and CUDA/FlashMLA kernels for high-performance inference. LMSYS Org, the team behind SGLang, also announced that its framework now officially supports V3.2 with optimized sparse attention kernels, dynamic key-value caching, and scaling to 128,000 tokens. vLLM provides day-one support as well.

For local deployment, DeepSeek has provided updated demo code, along with Docker images compatible with NVIDIA H200s, AMD MI350s, and NPUs. The model, at 685 billion parameters, supports multiple tensor types including BF16, FP8, and FP32.

Background: DeepSeek’s Iterative Push

The launch of V3.2-Exp comes just one week after DeepSeek released V3.1-Terminus, a refinement of its V3.1 model. Terminus was designed to address user feedback, improving tool-based reasoning and reducing language-mixing errors, such as inserting Chinese words into English responses.

According to reporting from VentureBeat, Terminus builds on the V3 family introduced in December 2024, which positioned DeepSeek’s models as versatile, cost-efficient alternatives to its more reasoning-heavy R1 series. While R1 excels in structured logic, math, and multi-step reasoning, it is slower and more expensive. V3 models, by contrast, are built for general-purpose applications such as writing, summarization, customer-facing chat, and basic coding.

With V3.2-Exp, DeepSeek is layering in architectural innovation through sparse attention while keeping the MIT License and open-source release model intact.

Considerations for Enterprise Decision-Makers

For enterprises—especially those in the U.S.—the cost savings offered by DeepSeek’s API are compelling, but there are additional considerations before adoption.

For U.S. decision-makers evaluating DeepSeek, the calculus isn’t just about API pricing—it’s about aligning affordability with risk tolerance, regulatory requirements, and infrastructure strategy.

What's Next for DeepSeek?

DeepSeek-V3.2-Exp demonstrates how an open-source player can push frontier-scale models while also addressing the practical challenges of cost and deployment.

By introducing sparse attention, cutting API prices, merging reinforcement learning into a unified stage, and maintaining full transparency through Hugging Face and GitHub releases, DeepSeek is offering both a research testbed and a viable enterprise option.

The addition of frameworks like SGLang and vLLM in the official release ecosystem reinforces that DeepSeek is cultivating broad community integration rather than locking down distribution.

At the same time, the experimental nature of V3.2-Exp leaves room for iteration. Internal evaluations show promising results, but DeepSeek acknowledges it is actively testing the architecture in real-world scenarios to uncover any limitations.

Whether this experimental architecture becomes the foundation for a broader V3.3 or V4 release remains to be seen. But for now, the launch of V3.2-Exp signals DeepSeek’s determination to stay visible and competitive in the global AI landscape.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DeepSeek-V3.2-Exp LLM 稀疏注意力 API 成本 开源模型
相关文章