NVIDIA NeMo Automodel：简化大规模MoE模型训练

Training massive mixture-of-experts (MoE) models has long been the domain of a few advanced users with deep infrastructure and distributed-systems expertise. For most developers, the challenge wasn’t building smarter models—it was scaling them efficiently across hundreds or even thousands of GPUs without breaking the bank.

With NVIDIA NeMo Automodel, an open-source library within NVIDIA NeMo framework, developers can now train large-scale MoE models directly in PyTorch—using the same familiar tools they already know. Built on accelerated PyTorch distributed with NVIDIA performance optimizations, NeMo Automodel democratizes large-scale MoE training—making it:

Simple

Accessible

Efficient

In this post, you’ll see how NeMo Automodel combines PyTorch-native distributed parallelism with NVIDIA acceleration to make large-scale MoE training easier, faster, and more accessible than ever. You’ll also find a detailed quick-start guide to reproduce benchmark results, run your own experiments, and explore configuration options—so you can experience the benefits firsthand.

Why training large MoEs is hard

Training MoEs efficiently at scale requires solving several interconnected challenges:

Expert parallelism

Token routing overhead

Memory management

Communication-computation fusion

As a result of these system challenges, achieving more than 150 TFLOPs/GPU on H100 systems at BF16 precision has historically been difficult—leaving performance untapped.

NVIDIA NeMo Automodel, an open-source library within the NVIDIA NeMo framework, removes these barriers by building on top of native PyTorch parallelisms. It incorporates advanced infrastructure optimizations—previously reserved for expert ML engineers—directly into the PyTorch ecosystem.

Developers can now use PyTorch APIs while achieving over 200 TFLOPs per GPU on H100s with BF16 precision for a variety of popular 100B+ MoE architectures. For instance, DeepSeek V3 reached 250 TFLOPs/sec/GPU on 256 GPUs.

This makes large-scale MoE training accessible—empowering the broader community to research, experiment, and innovate with billion-parameter models.

Inside NeMo Automodel: architecture and optimizations

NeMo AutoModel bridges PyTorch native distributed parallelisms with NVIDIA acceleration technologies, creating a unified, high-performance training stack for MoEs.

Scaling efficiently via PyTorch distributed parallelisms

Built on PyTorch distributed, NeMo Automodel seamlessly scales models using:

Accelerating training with NVIDIA Transformer Engine

Using NVIDIA Transformer Engine kernels—including CUDNN RMSNorm, CUDNN Linear, and DotProductAttention—NeMo Automodel accelerates transformer blocks and supports different attention mechanisms such as multi-head latent attention (MLA), grouped-query attention (GQA), and sliding-window attention (SWA).

Smarter expert routing and computation with Megatron-Core DeepEP and GroupedGEMM

To achieve high efficiency at massive scale, NeMo Automodel integrates advanced token routing and expert computation components from Megatron-Core, designed specifically for MOE training.

DeepEP token dispatcher

DeepSeek’s DeepEP

GroupedGEMM for MoE Experts

Breakthrough performance: cost-effective MoE training for everyone

The table below shows pre-training benchmarks on DGX H100 systems with BF16 precision across major MoE architectures:

Model	#GPUs	GBS (Global Batch Size)	Parallelism [TP,PP,CP,EP,VP, FSDP]	Optimizations	TFLOPs /sec/GPU	Tokens/sec/GPU
DeepSeek V3 671B	256	512	1,4,1,64,8,64	TE + DeepEP	250	1,002
DeepSeek V3 671B	1024	8192	1,4,1,64,8,256	TE + DeepEP	216	865
Kimi K2	256	512	1,8,1,32,4,32	TE + DeepEP	189	924
Qwen3 MoE 30B	8	512	1,1,1,8,-,8	TE + DeepEP	277	12,040
GPT-OSS 20B	8	256	1,1,1,-,-,8	TE + DeepEP + FlexAttn	279	13,058
GPT-OSS 120B	64	512	1,1,1,-,-,64	TE + DeepEP + FlexAttn	231	7,626

Table 1. Pre-training performance of representative mixture-of-experts (MoE) architectures on DGX H100 systems (BF16 precision). Note: All benchmarks use consistent measurement methodology with mock data, for a sequence length of 4096, and balanced expert routing. Peak H100 BF16 performance is 989 TFLOPs.

NeMo Automodel delivers industry-leading efficiency and scalability across diverse MoE architectures and GPU counts. Models sustain 190 to 280 TFLOPs/sec per GPU and process up to 13,000 tokens/sec, demonstrating near-linear scaling from eight to 1,024 GPUs, with DeepSeek V3 671B model reaching 250 TFLOPs/sec per GPU on 256 GPUs. All of these are done via native PyTorch parallelisms coupled with NVIDIA optimizations, unlocking peak hardware utilization and cost-effective large-scale MoE training for everyone in the PyTorch community.

Empowering developers through native PyTorch distributed training

By leveraging native PyTorch distributed parallelisms, NeMo Automodel brings high-performance large-scale MOE training directly into the PyTorch ecosystem. This approach eliminates dependency on external or proprietary model-parallel libraries, giving developers full flexibility to scale using tools and APIs they already know.

Most importantly, it reflects NVIDIA commitment to strengthening PyTorch and the broader open source AI ecosystem—making large-model training not just faster, but more open, interoperable, and accessible to the entire developer community.

Key benefits for developers:

Faster iteration cycles

Lower training costs

Scalable performance

Native PyTorch integration

Ecosystem commitment

Production-ready

Quick start: train and benchmark large MoE models

Getting started with NeMo Automodel is fast and familiar for any PyTorch developer.

You can use the provided benchmark scripts and configuration files to reproduce results or train your own large-scale MoE models with NVIDIA-optimized performance.

Minimum requirements

At least eight GPUs (80 GB memory each) are recommended to reproduce the benchmark results and run fine-tuning experiments efficiently.

Follow these simple steps to run a benchmark or fine-tuning experiment:

# 1. Pull the NeMo docker image and start a containerdocker pull nvcr.io/nvidia/nemo:25.09docker run -it -d --ulimit memlock=-1 --ulimit stack=67108864 --gpus all nvcr.io/nvidian/nemo:25.09 bash# 2. Once  inside the container, clone the repo and navigate to Automodelgit clone https://github.com/NVIDIA-NeMo/Automodel.gitcd Automodel

Run a benchmark

Example: Benchmark Qwen3 MoE 30B on eight GPUs

torchrun --nproc-per-ndoe 8 nemo_automodel/recipes/llm/benchmark.py \  --config examples/benchmark/configs/qwen3_moe_30b_te_deepep.yaml

Run fine-tuning

Example: Fine-tune Qwen3 MoE 30B

Note:

hf download Qwen/Qwen3-30B-A3B

datasets

pip install --upgrade datasets

torchrun --nproc-per-node 8 examples/llm_finetune/finetune.py --config examples/llm_finetune/qwen/qwen3_moe_30b_te_deepep.yaml

Available configuration files:

deepseek_v3_te_deepep.yaml

kimi_k2_te_deepep.yaml

qwen3_moe_30b_te_deepep.yaml

gptoss_20b_te_deepep.yaml

gptoss_120b_te_deepep.yaml

Check out docs for complete performance documentation and implementation details.

Looking ahead: Join us in advancing open MoE training

This release marks a major milestone in democratizing large-scale mixture-of-experts (MoE) training with accelerated PyTorch. But it’s only the beginning.

We’re actively working on:

Expanding model support: Adding new MoE and hybrid architectures.

Deeper optimizations

Technical deep dives

Broader benchmarking

We’d love for you to get started with NeMo Automodel and be part of this journey—try the configurations, share your results, and contribute feedback through GitHub Issues. Your insights help shape the next generation of scalable, open AI training tools.

Why training large MoEs is hard

Inside NeMo Automodel: architecture and optimizations

Scaling efficiently via PyTorch distributed parallelisms

Accelerating training with NVIDIA Transformer Engine

Smarter expert routing and computation with Megatron-Core DeepEP and GroupedGEMM

Breakthrough performance: cost-effective MoE training for everyone

Empowering developers through native PyTorch distributed training

Quick start: train and benchmark large MoE models

Looking ahead: Join us in advancing open MoE training

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签