DeepSeek-V3.2-Exp 降低 LLM 成本

DeepSeek continues to push the frontier of generative AI...in this case, in terms of affordability.

The company has unveiled its latest experimental large language model (LLM), DeepSeek-V3.2-Exp, that mostly matches or slightly improves the benchmarks of its predecessor DeepSeek-3.1-Terminus, but more importantly, comes at a 50 percent reduced cost through DeepSeek's application programming interface (API), down to just $0.028 per million input tokens — and can keep costs down even when approaching the context limit of 128,000 tokens (about 300-400 pages worth of information).

It's available through DeepSeek's first-party API, as well as the code downloadable under an open-source, enterprise-friendly MIT License on Hugging Face and GitHub.

How did the company do it? Read on to find out.

API Costs Reduced

As previously mentioned, DeepSeek announced significant reductions in API pricing. For one million tokens, input cache hits now cost $0.028, cache misses $0.28, and outputs $0.42.

This compares to $0.07, $0.56, and $1.68, respectively, under the earlier V3.1-Terminus pricing.

DeepSeek has kept Terminus temporarily available via a separate API until October 15, allowing developers to directly compare the two models, but Terminus will be deprecated after that — a short lived model that was released just one week ago.

Still, DeepSeek V3.2-Exp appears to be among the cheapest options for developers through the API, though OpenAI's GPT-5 Nano still easily takes the crown for most affordable. Take a look at it in comparison to other leading models below:

Provider	Model (cheap/entry)	Input Price (per 1M tokens)	Output Price (per 1M tokens)	Notes / caveats
DeepSeek	V3.2-Exp	$0.28 / $0.028 cached input	$0.42
OpenAI	GPT-5 Nano	$0.05 / $0.005 cached input	$0.40
Google	Gemini 2.5 Flash-Lite	$0.10	$0.40	No cached input price available
Anthropic	Claude Haiku 3.5	$0.80 / $0.08 for cached input	$4.00
xAI	Grok-4 Fast Non-Reasoning	$0.20 / $0.05 for cached input	$0.50

New Sparse Attention Design

At the heart of V3.2-Exp is DeepSeek Sparse Attention, or DSA, described in a technical report also released by the company today on Github.

Traditional dense attention mechanisms, which calculate interactions between every token and every other token in a sequence, scale quadratically with sequence length. As the number of tokens grows, this results in rapidly increasing memory use and compute requirements, leading to high costs and slow inference.

Most large language models use a “dense” self-attention mechanism, which compares every token in the input to every other token. So if your prompt doubles in length, the model does more than double the work to handle all those cross-token interactions. This drives up GPU time and energy cost, which is reflected in the per-million-token pricing for APIs. During prefill, the amount of computation grows roughly with the square of the context length, and at least linearly during decoding. As a result, longer sequences — tens of thousands or even over 100,000 tokens — cause costs to rise much faster than the token count alone would suggest.

DSA addresses this by using a “lightning indexer” to select only the most relevant tokens for attention.

This reduces the computational load while preserving nearly the same quality of responses.

By reducing the compute burden per token at large context lengths, V3.2-Exp keeps the cost curve flatter and much lower.

This makes it far more practical and affordable to run long-context workloads such as document-scale summarization, multi-turn chat with long histories, or code analysis without facing a runaway increase in inference costs.

Post-Training and Reinforcement Learning Advances

Beyond its architectural changes, DeepSeek-V3.2-Exp introduces refinements in the post-training process. The company employs a two-step approach: specialist distillation and reinforcement learning.

Specialist distillation begins with training separate models for mathematics, competitive programming, logical reasoning, agentic coding, and agentic search. These specialists, fine-tuned from the same base checkpoint, are reinforced with large-scale training to generate domain-specific data. That data is then distilled back into the final checkpoint, ensuring the consolidated model benefits from specialist knowledge while remaining general-purpose.

The reinforcement learning phase marks a significant shift. Instead of the multi-stage approach used in previous DeepSeek models, reasoning, agent, and human alignment training are merged into a single RL stage using Group Relative Policy Optimization (GRPO). This unified process balances performance across domains while avoiding the “catastrophic forgetting” issues often associated with multi-stage pipelines.

The reward design blends rule-based outcome signals, length penalties, and language consistency checks with a generative reward model guided by task-specific rubrics. Experimental results show that the distilled and reinforced model performs nearly on par with domain-specific specialists, with the gap effectively closed after RL training.

Benchmarks Steady

Benchmarking confirms the trade-off works as intended. On widely used public evaluations, V3.2-Exp performs on par with V3.1-Terminus, showing negligible differences in areas such as reasoning, coding, and question answering.

While scores dipped slightly in some reasoning-heavy tasks such as GPQA-Diamond and Humanity’s Last Exam, the model’s efficiency gains and consistent performance elsewhere suggest the sparse approach does not substantially compromise capability.

MMLU-Pro holds steady at 85.0, AIME 2025 slightly improves to 89.3, while GPQA-Diamond dips from 80.7 to 79.9. Coding and agent benchmarks tell a similar story, with Codeforces ratings rising from 2046 to 2121 and BrowseComp improving from 38.5 to 40.1.

This balance reflects the design trade-off. By selecting only a fraction of possible tokens for attention, DSA reduces computational costs significantly. Inference cost comparisons show V3.2-Exp requires less than half the cost per million tokens of V3.1-Terminus when running on long contexts.

Open-Source Access and Deployment Options

In keeping with the company’s open approach, DeepSeek has released the V3.2-Exp model weights on Hugging Face under the MIT License. Researchers and enterprises can freely download, modify, and deploy the model for commercial use.

The release is accompanied by open-source kernels: TileLang for research prototyping and CUDA/FlashMLA kernels for high-performance inference. LMSYS Org, the team behind SGLang, also announced that its framework now officially supports V3.2 with optimized sparse attention kernels, dynamic key-value caching, and scaling to 128,000 tokens. vLLM provides day-one support as well.

For local deployment, DeepSeek has provided updated demo code, along with Docker images compatible with NVIDIA H200s, AMD MI350s, and NPUs. The model, at 685 billion parameters, supports multiple tensor types including BF16, FP8, and FP32.

Background: DeepSeek’s Iterative Push

The launch of V3.2-Exp comes just one week after DeepSeek released V3.1-Terminus, a refinement of its V3.1 model. Terminus was designed to address user feedback, improving tool-based reasoning and reducing language-mixing errors, such as inserting Chinese words into English responses.

According to reporting from VentureBeat, Terminus builds on the V3 family introduced in December 2024, which positioned DeepSeek’s models as versatile, cost-efficient alternatives to its more reasoning-heavy R1 series. While R1 excels in structured logic, math, and multi-step reasoning, it is slower and more expensive. V3 models, by contrast, are built for general-purpose applications such as writing, summarization, customer-facing chat, and basic coding.

With V3.2-Exp, DeepSeek is layering in architectural innovation through sparse attention while keeping the MIT License and open-source release model intact.

Considerations for Enterprise Decision-Makers

For enterprises—especially those in the U.S.—the cost savings offered by DeepSeek’s API are compelling, but there are additional considerations before adoption.

Data security and compliance: Using DeepSeek’s hosted API means data flows through servers operated by a Hong Kong–based company. Enterprises with sensitive customer data, regulated industries, or strict compliance frameworks (e.g., healthcare, finance, defense) will need to carefully assess legal and governance implications. Self-hosting the open-source weights may mitigate these risks, though it shifts infrastructure and maintenance responsibilities in-house.

Performance versus control: The API offers immediate access with predictable costs and scaling. Self-hosting provides maximum control—especially over data residency and latency—but requires significant engineering resources and GPU availability. Decision makers must weigh speed of adoption against operational overhead.

Vendor diversification: With many U.S.-based enterprises already reliant on OpenAI, Anthropic, or Google, DeepSeek’s open-source approach offers a hedge against vendor lock-in. However, integrating models from a Chinese provider may raise questions from boards or security officers.

Total cost of ownership: While the API is cheaper per token, enterprises with steady high-volume workloads may find long-term savings by running the open-source model on their own infrastructure or through trusted third-party hosts. However, based on the model architecture, even those running the new DeepSeek V3.2-Exp should still see considerably lower costs for longer token-count inputs on their own servers and hardware. The choice comes down to scale, workload predictability, and appetite for internal operations.

For U.S. decision-makers evaluating DeepSeek, the calculus isn’t just about API pricing—it’s about aligning affordability with risk tolerance, regulatory requirements, and infrastructure strategy.

What's Next for DeepSeek?

DeepSeek-V3.2-Exp demonstrates how an open-source player can push frontier-scale models while also addressing the practical challenges of cost and deployment.

By introducing sparse attention, cutting API prices, merging reinforcement learning into a unified stage, and maintaining full transparency through Hugging Face and GitHub releases, DeepSeek is offering both a research testbed and a viable enterprise option.

The addition of frameworks like SGLang and vLLM in the official release ecosystem reinforces that DeepSeek is cultivating broad community integration rather than locking down distribution.

At the same time, the experimental nature of V3.2-Exp leaves room for iteration. Internal evaluations show promising results, but DeepSeek acknowledges it is actively testing the architecture in real-world scenarios to uncover any limitations.

Whether this experimental architecture becomes the foundation for a broader V3.3 or V4 release remains to be seen. But for now, the launch of V3.2-Exp signals DeepSeek’s determination to stay visible and competitive in the global AI landscape.

API Costs Reduced

New Sparse Attention Design

Post-Training and Reinforcement Learning Advances

Benchmarks Steady

Open-Source Access and Deployment Options

Background: DeepSeek’s Iterative Push

Considerations for Enterprise Decision-Makers

What's Next for DeepSeek?

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签