Nvidia Developer 9小时前
NVIDIA NeMo Automodel:简化大规模MoE模型训练
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA NeMo Automodel是一个开源库,旨在降低大规模混合专家(MoE)模型训练的门槛。该库允许开发者直接在PyTorch中使用熟悉的工具,高效地在数百甚至数千个GPU上训练模型,而无需深厚的分布式系统知识或昂贵的专业基础设施。NeMo Automodel通过集成PyTorch的分布式并行能力和NVIDIA的硬件加速优化,显著提升了训练效率和可访问性,使研究人员、初创企业和大型企业都能更轻松地探索和应用先进的MoE架构,并实现了成本效益。

🚀 **简化MoE模型训练流程**: NVIDIA NeMo Automodel允许开发者直接在PyTorch中训练大规模MoE模型,无需管理复杂的并行策略或专门的基础设施。这使得原本需要专业知识的任务变得简单易行,降低了开发者进入大规模MoE模型训练领域的门槛。

💡 **提升模型训练效率与可访问性**: 该库通过结合PyTorch原生分布式并行和NVIDIA的性能优化,实现了从8到1000多个GPU的高效扩展,同时保持了出色的性能和成本效益。这使得研究人员、初创企业和企业用户能够以前所未有的方式进行MoE架构的实验和应用。

⚙️ **集成先进优化技术**: NeMo Automodel整合了Megatron-Core的DeepEP Token Dispatcher和GroupedGEMM等先进技术,优化了Token路由和专家计算,有效减少了通信开销,提高了GPU利用率,从而在H100等硬件上实现了超过200 TFLOPs/GPU的性能,甚至在特定模型上达到250 TFLOPs/sec/GPU。

Training massive mixture-of-experts (MoE) models has long been the domain of a few advanced users with deep infrastructure and distributed-systems expertise. For most developers, the challenge wasn’t building smarter models—it was scaling them efficiently across hundreds or even thousands of GPUs without breaking the bank.

With NVIDIA NeMo Automodel, an open-source library within NVIDIA NeMo framework, developers can now train large-scale MoE models directly in PyTorch—using the same familiar tools they already know. Built on accelerated PyTorch distributed with NVIDIA performance optimizations, NeMo Automodel democratizes large-scale MoE training—making it:

    Simple – Train billion-parameter models directly in PyTorch without managing complex parallelism or specialized infrastructure.Accessible – Empower researchers, startups, and enterprises to experiment with MoE architectures previously out of reach.Efficient – Scale from eight to over 1,000 GPUs while maintaining strong performance and cost-effectiveness through built-in optimizations.

In this post, you’ll see how NeMo Automodel combines PyTorch-native distributed parallelism with NVIDIA acceleration to make large-scale MoE training easier, faster, and more accessible than ever. You’ll also find a detailed quick-start guide to reproduce benchmark results, run your own experiments, and explore configuration options—so you can experience the benefits firsthand.

Why training large MoEs is hard

Training MoEs efficiently at scale requires solving several interconnected challenges:

    Expert parallelism: Distribute hundreds of experts across GPUs without overwhelming communication bandwidth.Token routing overhead: Move tokens quickly and efficiently to the correct experts.Memory management: Shard massive parameter sets to fit within GPU memory constraints.Communication-computation fusion: Minimize latency from all-to-all communication and token permutation operations.

As a result of these system challenges, achieving more than 150 TFLOPs/GPU on H100 systems at BF16 precision has historically been difficult—leaving performance untapped.

NVIDIA NeMo Automodel, an open-source library within the NVIDIA NeMo framework, removes these barriers by building on top of native PyTorch parallelisms. It incorporates advanced infrastructure optimizations—previously reserved for expert ML engineers—directly into the PyTorch ecosystem. 

Developers can now use PyTorch APIs while achieving over 200 TFLOPs per GPU on H100s with BF16 precision for a variety of popular 100B+ MoE architectures. For instance, DeepSeek V3 reached 250 TFLOPs/sec/GPU on 256 GPUs.

This makes large-scale MoE training accessible—empowering the broader community to research, experiment, and innovate with billion-parameter models. 

Inside NeMo Automodel: architecture and optimizations

NeMo AutoModel bridges PyTorch native distributed parallelisms with NVIDIA acceleration technologies, creating a unified, high-performance training stack for MoEs.

Scaling efficiently via PyTorch distributed parallelisms

Built on PyTorch distributed, NeMo Automodel seamlessly scales models using:

Accelerating training with NVIDIA Transformer Engine

Using NVIDIA Transformer Engine kernels—including CUDNN RMSNorm, CUDNN Linear, and DotProductAttention—NeMo Automodel accelerates transformer blocks and supports different attention mechanisms such as multi-head latent attention (MLA), grouped-query attention (GQA), and sliding-window attention (SWA).

Smarter expert routing and computation with Megatron-Core DeepEP and GroupedGEMM

To achieve high efficiency at massive scale, NeMo Automodel integrates advanced token routing and expert computation components from Megatron-Core, designed specifically for MOE training.

    DeepEP token dispatcher (Experimental): Scales token routing to 64+ expert parallelism degrees with highly efficient all-to-all communication and  optional permute/unpermute fusion. By leveraging DeepSeek’s DeepEP optimization, NeMo Automodel minimizes communication overhead and maintains balanced expert utilization, enabling smoother scaling across hundreds of GPUs.GroupedGEMM for MoE Experts:  Aggregates multiple local expert computations into a single batched GEMM operation. This reduces kernel launches overhead, increases GPU occupancy, and significantly improves throughput and hardware utilization—especially when multiple experts share the same device..

Breakthrough performance: cost-effective MoE training for everyone

The table below shows pre-training benchmarks on DGX H100 systems with BF16 precision across major MoE architectures:

Model#GPUsGBS (Global Batch Size)Parallelism
[TP,PP,CP,EP,VP, FSDP]
OptimizationsTFLOPs
/sec/GPU
Tokens/sec/GPU
DeepSeek V3 671B2565121,4,1,64,8,64TE + DeepEP2501,002
DeepSeek V3 671B102481921,4,1,64,8,256TE + DeepEP216865
Kimi K22565121,8,1,32,4,32TE + DeepEP189924
Qwen3 MoE 30B85121,1,1,8,-,8TE + DeepEP27712,040
GPT-OSS 20B82561,1,1,-,-,8TE + DeepEP + FlexAttn27913,058
GPT-OSS 120B645121,1,1,-,-,64TE + DeepEP + FlexAttn2317,626
Table 1. Pre-training performance of representative mixture-of-experts (MoE) architectures on DGX H100 systems (BF16 precision). Note: All benchmarks use consistent measurement methodology with mock data, for a sequence length of 4096, and balanced expert routing. Peak H100 BF16 performance is 989 TFLOPs.

NeMo Automodel delivers industry-leading efficiency and scalability across diverse MoE architectures and GPU counts. Models sustain 190 to 280 TFLOPs/sec per GPU and process up to 13,000 tokens/sec, demonstrating near-linear scaling from eight to 1,024 GPUs, with DeepSeek V3 671B model reaching 250 TFLOPs/sec per GPU on 256 GPUs. All of these are done via native PyTorch parallelisms coupled with NVIDIA optimizations, unlocking peak hardware utilization and cost-effective large-scale MoE training for everyone in the PyTorch community.

Empowering developers through native PyTorch distributed training

By leveraging native PyTorch distributed parallelisms, NeMo Automodel brings high-performance large-scale MOE training directly into the PyTorch ecosystem. This approach eliminates dependency on external or proprietary model-parallel libraries, giving developers full flexibility to scale using tools and APIs they already know.

Most importantly, it reflects NVIDIA commitment to strengthening PyTorch and the broader open source AI ecosystem—making large-model training not just faster, but more open, interoperable, and accessible to the entire developer community.

Key benefits for developers:

    Faster iteration cycles: Achieve higher throughput for quicker experimentation and model development.Lower training costs: Better GPU utilization means fewer GPU-hours per training run.Scalable performance: Consistent, near-linear scaling from eight GPUs to over 1,000 GPUs enables flexible infrastructure planning.Native PyTorch integration: Leveraged PyTorch distributed to remove reliance on external model-parallel frameworks—keeping everything within the PyTorch workflow. Ecosystem commitment: Demonstrates NVIDIA long-term investment in advancing PyTorch, ensuring future innovations are directly integrated into the core framework.Production-ready: Includes proven, battle-tested configurations for leading open-source MoE architectures.

Quick start: train and benchmark large MoE models

Getting started with NeMo Automodel is fast and familiar for any PyTorch developer.

You can use the provided benchmark scripts and configuration files to reproduce results or train your own large-scale MoE models with NVIDIA-optimized performance.

Minimum requirements

At least eight GPUs (80 GB memory each) are recommended to reproduce the benchmark results and run fine-tuning experiments efficiently.

Follow these simple steps to run a benchmark or fine-tuning experiment: 

# 1. Pull the NeMo docker image and start a containerdocker pull nvcr.io/nvidia/nemo:25.09docker run -it -d --ulimit memlock=-1 --ulimit stack=67108864 --gpus all nvcr.io/nvidian/nemo:25.09 bash# 2. Once  inside the container, clone the repo and navigate to Automodelgit clone https://github.com/NVIDIA-NeMo/Automodel.gitcd Automodel

Run a benchmark

Example: Benchmark Qwen3 MoE 30B on eight GPUs

torchrun --nproc-per-ndoe 8 nemo_automodel/recipes/llm/benchmark.py \  --config examples/benchmark/configs/qwen3_moe_30b_te_deepep.yaml

Run fine-tuning

Example: Fine-tune Qwen3 MoE 30B

Note: 

    You’ll need to download the model checkpoint from Hugging Face first: hf download Qwen/Qwen3-30B-A3BIf you encounter a dataset instantiation error, upgrade the datasets library: pip install --upgrade datasets
torchrun --nproc-per-node 8 examples/llm_finetune/finetune.py --config examples/llm_finetune/qwen/qwen3_moe_30b_te_deepep.yaml

Available configuration files:

    deepseek_v3_te_deepep.yaml – DeepSeek V3 (671B parameter)kimi_k2_te_deepep.yaml – Optimized configuration for Kimi K2 qwen3_moe_30b_te_deepep.yaml – Qwen3 MoE 30B with full NVIDIA optimizationsgptoss_20b_te_deepep.yaml – GPT-OSS 20B with FlexAttentiongptoss_120b_te_deepep.yaml – GPT-OSS 120B production configuration

Check out docs for complete performance documentation and implementation details.

Looking ahead: Join us in advancing open MoE training

This release marks a major milestone in democratizing large-scale mixture-of-experts (MoE) training with accelerated PyTorch. But it’s only the beginning.

We’re actively working on:

Expanding model support: Adding new MoE and hybrid architectures.

    Deeper optimizations: Further kernel-level and communication improvements for even higher efficiency.Technical deep dives: Detailed explainers of NeMo AutoModel MoE design and performance techniques.Broader benchmarking: Extending performance validation across diverse hardware and cluster configurations.

We’d love for you to get started with NeMo Automodel and be part of this journey—try the configurations, share your results, and contribute feedback through GitHub Issues. Your insights help shape the next generation of scalable, open AI training tools.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA NeMo Automodel MoE Mixture-of-Experts PyTorch AI Training Deep Learning GPU Acceleration Distributed Systems
相关文章