MarkTechPost@AI 10月31日 04:52
蚂蚁集团发布Ling 2.0系列语言模型,实现高效推理
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

蚂蚁集团的Inclusion AI团队发布了Ling 2.0系列语言模型,该系列模型在保持每个token计算量几乎不变的情况下,实现了模型容量的显著增长。Ling 2.0系列基于稀疏大模型理念,采用Mixture of Experts(MoE)架构,通过精细设计的路由机制,每次仅激活少量专家网络,从而实现高效训练和推理。该系列包含Ling mini 2.0 (16B参数, 1.4B激活), Ling flash 2.0 (100B级, 6.1B激活) 和 Ling 1T (1T参数, 约50B激活)。通过Ling Scaling Laws、优化的预训练和后训练流程,以及FP8基础设施,Ling 2.0系列在保持低激活率的同时,展现出强大的推理能力和效率。

💡 **稀疏MoE架构与低激活率**: Ling 2.0系列模型核心采用稀疏Mixture of Experts(MoE)层,每层包含256个路由专家和一个共享专家。通过路由机制,每个token仅激活约9个专家(约3.5%的激活率),显著降低了计算成本,实现了与等效密集模型相比约7倍的效率提升,同时允许模型参数量从16B扩展到1T。

⚖️ **Ling Scaling Laws与模型设计**: 模型架构并非凭空设计,而是基于Ling Scaling Laws,通过“Ling Wind Tunnel”实验在低成本下预测不同规模下的损失、激活和专家平衡。这使得团队能够提前确定1/32的激活率、256个路由专家和1个共享专家的最优配置,并确保Ling mini 2.0、Ling flash 2.0和Ling 1T在尺寸上保持一致性。

📚 **多阶段预训练与长上下文推理**: Ling 2.0系列在超过20T tokens的数据上进行训练,并逐步增加推理密集型数据(如数学和代码)的比例。预训练过程中,上下文长度从4K逐步扩展到128K,并结合了高质的Chain-of-Thought(CoT)数据注入,确保模型在早期就具备强大的长上下文理解和推理能力。

🧠 **分阶段后训练与对齐**: 模型对齐过程分为能力塑造和偏好优化两步。首先,Decoupled Fine Tuning通过不同的系统提示引导模型在快速响应和深度推理之间切换;接着,进化CoT阶段扩展和多样化推理链;最后,通过逐句级别的策略优化(Sentence level policy optimization)和Group Arena Reward,使模型输出在细粒度上与人类判断保持一致,从而在不膨胀每个回答的情况下实现优异的数学、代码和指令遵循能力。

🚀 **FP8与异构并行基础设施**: Ling 2.0系列原生支持FP8训练,损失曲线与BF16接近,硬件利用率提升约15%。更显著的40%加速来自异构流水线并行、交错的“一前一后”执行模式以及对MTP块有意识的分区策略。结合Warmup Stable Merge(用合并检查点替代学习率衰减),这套系统栈使得在现有集群上实现1T规模的训练成为可能。

How do you build a language model that grows in capacity but keeps the computation for each token almost unchanged? The Inclusion AI team from the Ant Group is pushing sparse large models in a methodical way by releasing Ling 2.0. Ling 2.0 is a reasoning based language model family built on the idea that each activation should translate directly into stronger reasoning behavior. It is one of the latest approaches that shows how to keep activation small while moving from 16B to 1T without rewriting the recipe. The series has three versions, Ling mini 2.0 at 16B total with 1.4B activated, Ling flash 2.0 in the 100B class with 6.1B activated, and Ling 1T with 1T total and about 50B active per token.

Sparse MoE as the central design

Every Ling 2.0 model uses the same sparse Mixture of Experts layer. Each layer has 256 routed experts and one shared expert. The router picks 8 routed experts for every token, the shared expert is always on, so about 9 experts out of 257 are used for every token, this is about 3.5 percent activation, which matches the 1/32 activation ratio. The research team reports about 7 times efficiency compared to an equivalent dense model because you train and serve only a small part of the network per token while keeping a very large parameter pool. 

https://arxiv.org/abs/2510.22115

Ling 2.0 brings coordinated advances across four layers of the stack, model architecture, pre training, post training, and the underlying FP8 infrastructure:

Model architecture: The architecture is chosen using Ling Scaling Laws, not by trial and error. To support the Ling Scaling Laws, the team runs what they call the Ling Wind Tunnel, a fixed set of small MoE runs trained under the same data and routing rules, then fitted to power laws to predict loss, activation and expert balance at much larger sizes. This gives them a low cost way to choose 1/32 activation, 256 routed experts and 1 shared expert before committing GPUs to 1T scale. Routing is aux-loss-free with sigmoid scoring, and the stack uses QK Norm, MTP loss and partial RoPE to keep depth stable. Because the same law picked the shape, Ling mini 2.0, Ling flash 2.0 and Ling 1T can all share the consistency across sizes.

Pre training: The series is trained on more than 20T tokens, starting with 4K context and a mix in which reasoning heavy sources such as math and code gradually increase to almost half of the corpus. A later mid training stage extends context to about 32K on a selected 150B token slice, then injects another 600B tokens of high quality chain of thought, before finally stretching to 128K with YaRN while preserving short context quality. This pipeline ensures that long context and reasoning are introduced early, not just added at the SFT step. 

Post training: Alignment is separated into a capability pass and a preference pass. First, Decoupled Fine Tuning teaches the model to switch between quick responses and deep reasoning through different system prompts, then an evolutionary CoT stage expands and diversifies chains, and finally a sentence level policy optimization with a Group Arena Reward aligns outputs to human judgments at fine granularity. This staged alignment is what lets a non thinking base reach strong math, code and instruction performance without inflating every answer.

Infrastructure: Ling 2.0 trains natively in FP8 with safeguards, keeping the loss curve within a small gap of BF16 while gaining about 15% utilization on the reported hardware. The larger speedups, around 40 percent, come from heterogeneous pipeline parallelism, interleaved one forward one backward execution and partitioning that is aware of the MTP block, not from precision alone. Together with Warmup Stable Merge, which replaces LR decay by merging checkpoints, this systems stack makes 1T scale runs practical on existing clusters. 

Understanding the Results

Evaluations are consistent in pattern, small activation MoE models deliver competitive quality while keeping per token compute low. Ling mini 2.0 has 16B total parameters, activates 1.4B per token, and is reported to perform in the 7 to 8B dense band. (Reddit) Ling flash 2.0 keeps the same 1/32 activation recipe, has 100B and activates 6.1B per token. Ling 1T is the flagship non thinking model, it has 1T total parameters and about 50B active per token, preserving the 1/32 sparsity and extending the same Ling Scaling Laws to trillion scale. 

https://arxiv.org/abs/2510.22115
https://arxiv.org/abs/2510.22115
https://arxiv.org/abs/2510.22115

Key Takeaways

    Ling 2.0 is built around a 1/32 activation MoE architecture, selected using Ling Scaling Laws so that 256 routed experts plus 1 shared expert stay optimal from 16B up to 1T.Ling mini 2.0 has 16B total parameters with 1.4B activated per token and is reported to match 7B to 8B dense models while generating at more than 300 tokens per second in simple QA on H20.Ling flash 2.0 keeps the same recipe, has 6.1B active parameters and sits in the 100B range, giving a higher capacity option without increasing per token compute.Ling 1T exposes the full design, 1T total parameters with about 50B active per token, 128K context, and an Evo CoT plus LPO style post training stack to push efficient reasoning.Across all sizes, efficiency gains above 7 times over dense baselines come from the combination of sparse activation, FP8 training, and a shared training schedule, so quality scales predictably without re tuning compute.

Editorial Comments

This release demonstrates a complete sparse MoE stack. Ling Scaling Laws identify a 1/32 activation as optimal, the architecture locks in 256 routed experts plus 1 shared expert, and the same shape is used from 16B to 1T. Training, context extension and preference optimization are all aligned to that choice, so small activation does not block math, code or long context, and FP8 plus heterogeneous pipelines keep cost in a practical range. It is a clear signal that trillion scale reasoning can be organized around fixed sparsity instead of growing dense compute.


Check out the Weights on HF, Repo and Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Ant Group Releases Ling 2.0: A Reasoning-First MoE Language Model Series Built on the Principle that Each Activation Enhances Reasoning Capability appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Ling 2.0 蚂蚁集团 AI 语言模型 稀疏大模型 MoE 推理 Ant Group AI Language Model Sparse Large Models MoE Reasoning
相关文章