MarkTechPost@AI 09月30日 17:30
DeepSeek V3.2-Exp 引入稀疏注意力,降低长上下文模型成本
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

DeepSeek 发布了 V3.2-Exp 模型,作为 V3.1 的一次重要更新,引入了可训练的 DeepSeek Sparse Attention (DSA) 技术,旨在提升长上下文处理的效率。此举伴随着 API 价格的显著下调(超过50%),以反映其效率提升。DSA 通过一个轻量级的“索引器”来评估上下文标记,然后仅在选定的标记子集上执行稀疏注意力,从而将计算复杂度从 O(L^2) 降低到 O(Lk),显著优化了长上下文的推理成本,同时保持了模型在各项基准测试中的性能。该更新已在 SGLang 和 vLLM 等平台上得到支持,标志着其在实际应用中的潜力。

💡 **DeepSeek V3.2-Exp 引入可训练稀疏注意力 (DSA)**:该模型在 V3.1 的基础上进行了更新,重点在于集成 DeepSeek Sparse Attention (DSA) 技术。DSA 是一种旨在提高长上下文处理效率的可训练稀疏化路径,它通过一个轻量级的“索引器”来评估上下文标记,并仅在选定的标记子集上执行注意力计算,从而将计算复杂度从二次方 O(L^2) 降低到线性 O(Lk),其中 k 远小于 L。

💰 **API 价格大幅下调与效率提升**:与 DSA 技术一同发布的是 DeepSeek API 价格的显著降低(超过50%)。这一举措与模型效率的提升直接相关,旨在为用户提供更具成本效益的长上下文推理解决方案。官方数据显示,在128k上下文长度下,解码成本大幅降低,社区讨论中甚至有提及高达6倍的节省。

📊 **性能与成本的平衡**:DeepSeek V3.2-Exp 在引入 DSA 技术的同时,努力保持了与 V3.1 相当的模型性能。虽然在某些特定任务上可能出现微小变动,但整体基准测试结果(如 MMLU-Pro)保持不变或略有提升。文章指出,在推理效率得到显著提升的同时,模型在代理任务和搜索任务上的表现甚至有所改善,表明了其在实际应用中的价值。

🚀 **广泛的集成与应用潜力**:该更新已获得 SGLang 和 vLLM 等主流推理框架的“Day-0”支持,表明其技术已为生产环境做好准备。DeepSeek 还提到了 TileLang、DeepGEMM 和 FlashMLA 等开源内核的配合使用,进一步展示了其在底层技术上的优化和对开源社区的贡献。这些信号预示着 V3.2-Exp 将成为处理长文档和 RAG (Retrieval-Augmented Generation) 等场景的有力工具。

DeepSeek released DeepSeek-V3.2-Exp, an “intermediate” update to V3.1 that adds DeepSeek Sparse Attention (DSA)—a trainable sparsification path aimed at long-context efficiency. DeepSeek also reduced API prices by 50%+, consistent with the stated efficiency gains.

DeepSeek-V3.2-Exp keeps the V3/V3.1 stack (MoE + MLA) and inserts a two-stage attention path: (i) a lightweight “indexer” that scores context tokens; (ii) sparse attention over the selected subset.

https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf

FP8 index → top-k selection → sparse core attention

DeepSeek Sparse Attention (DSA) splits the attention path into two compute tiers:

(1) Lightning indexer (FP8, few heads): For each query token ℎ𝑡∈𝑅𝑑ht ​∈Rd, a lightweight scoring function computes index logits 𝐼𝑡,𝑠It,s ​ against preceding tokens ℎ𝑠hs ​. It uses small indexer heads with a ReLU nonlinearity for throughput. Because this stage runs in FP8 and with few heads, its wall-time and FLOP cost are minor relative to dense attention.<>
(2) Fine-grained token selection (top-k): The system selects only the top-k=2048 key-value entries for each query and then performs standard attention only over that subset. This changes the dominant term from 𝑂(𝐿2)O(L2) to 𝑂(𝐿𝑘)O(Lk) with 𝑘≪𝐿k≪L, while preserving the ability to attend to arbitrarily distant tokens when needed.

Training signal: The indexer is trained to imitate the dense model’s head-summed attention distribution via KL-divergence, first under a short dense warm-up (indexer learns targets while the main model is frozen), then during sparse training where gradients for the indexer remain separate from the main model’s language loss. Warm-up uses ~2.1B tokens; sparse stage uses ~943.7B tokens with top-k=2048, LR ~7.3e-6 for the main model.

Instantiation: DSA is implemented under MLA (Multi-head Latent Attention) in MQA mode for decoding so each latent KV entry is shared across query heads, aligning with the kernel-level requirement that KV entries be reused across queries for throughput.

https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf

Lets Talk about it’s efficiency and accuracy

Summary

DeepSeek V3.2-Exp shows that trainable sparsity (DSA) can hold benchmark parity while materially improving long-context economics: official docs commit to 50%+ API price cuts, with day-0 runtime support already available, and community threads claim larger decode-time gains at 128k that warrant independent replication under matched batching and cache policies. The near-term takeaway for teams is simple: treat V3.2-Exp as a drop-in A/B for RAG and long-document pipelines where O(L2)O(L^2)O(L2) attention dominates costs, and validate end-to-end throughput/quality on your stack.


FAQs

1) What exactly is DeepSeek V3.2-Exp?
V3.2-Exp is an experimental, intermediate update to V3.1-Terminus that introduces DeepSeek Sparse Attention (DSA) to improve long-context efficiency.

2) Is it truly open source, and under what license?
Yes. The repository and model weights are licensed under MIT, per the official Hugging Face model card (License section).

3) What is DeepSeek Sparse Attention (DSA) in practice?
DSA adds a lightweight indexing stage to score/select a small set of relevant tokens, then runs attention only over that subset—yielding “fine-grained sparse attention” and reported long-context training/inference efficiency gains while keeping output quality on par with V3.1.


Check out the GitHub Page and Hugging Face Model Card. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post DeepSeek V3.2-Exp Cuts Long-Context Costs with DeepSeek Sparse Attention (DSA) While Maintaining Benchmark Parity appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DeepSeek V3.2-Exp 稀疏注意力 Sparse Attention 长上下文 long-context AI模型 efficiency cost reduction DSA
相关文章