Nvidia Developer 09月03日
NeMo-RL v0.3 引入Megatron-Core,大幅提升大模型训练效率
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA NeMo-RL v0.3 版本引入了基于Megatron-Core的后端支持,显著提升了大规模语言模型的强化学习(RL)后训练效率。与之前的PyTorch DTensor(FSDP2)后端相比,Megatron-Core通过GPU优化内核、6D并行策略以及序列打包和重要性采样等技术,在大模型(如Llama 70B)的训练中实现了更快的速度和更优的吞吐量,同时保持了相同的收敛性。新版本简化了Megatron-Core的配置过程,使得开发者能更便捷地利用这些先进的优化技术。此外,NeMo-RL v0.3还支持长上下文训练(如16k序列长度)以及异步Rollouts等新特性,进一步扩展了其在模型后训练领域的应用能力。

🚀 **Megatron-Core后端提升训练效率**:NVIDIA NeMo-RL v0.3 集成了Megatron-Core作为强化学习后训练的后端,相较于PyTorch DTensor,在大模型训练中展现出显著的性能提升。Megatron-Core利用GPU优化内核、6D并行策略(包括张量、流水线、数据和序列并行)以及内存优化技术,有效解决了DTensor在处理百亿亿参数模型时遇到的激活内存开销和计算效率瓶颈,实现了更快的训练速度和更高的吞吐量。

💡 **关键优化技术赋能高性能**:Megatron-Core后端通过引入序列打包(Sequence Packing)和重要性采样(Importance Sampling)等关键优化技术,进一步提升了训练效率和稳定性。序列打包通过将多个序列合并以充分利用最大序列长度,减少了填充标记,尤其在序列长度变化较大时效果显著。重要性采样则通过调整样本权重来降低训练与推理之间的差异,减少模型训练的方差,确保了与DTensor策略相当的收敛性。

🔧 **简化配置,易于上手**:尽管Megatron-Core提供了丰富的底层配置选项,NeMo-RL v0.3通过在其YAML配置中添加`policy.megatron_cfg`部分,并设置`enabled: true`,极大地简化了Megatron-Core的启用和配置过程。NeMo-RL在后台自动处理了许多复杂的调优工作,为用户提供了一套更简洁直观的配置接口,降低了使用门槛,使开发者能够更专注于模型训练本身。

🌟 **长上下文支持与多项新特性**:NeMo-RL v0.3不仅在标准训练上进行了优化,还扩展了对长上下文训练的支持,例如在Llama 3.3 70B模型上实现了16k序列长度的有效训练。此外,新版本还加入了异步Rollouts(Async rollouts)功能,通过启用vLLM异步引擎将多轮RL训练速度提升2-3倍;并支持非共置生成(Non-colocated generation),允许训练和生成后端部署在不同的GPU集上,为复杂场景提供了更大的灵活性。

📈 **实际性能对比验证**:通过对Llama 3.1 8B和70B等模型的性能对比测试,结果显示使用Megatron-Core后端的训练在总步长时间、策略训练、Refit和生成等方面均优于PyTorch DTensor。例如,Llama 3.1 70B模型在Megatron-Core下步进时间显著缩短,且在GRPO训练的奖励曲线方面,Megatron-Core与DTensor实现了相似的收敛效果,证明了其在大模型强化学习后训练中的优越性。

The initial release of NVIDIA NeMo-RL included training support through PyTorch DTensor (otherwise known as FSDP2). This backend enables native integration with the HuggingFace ecosystem, quick experimentation, and scaling with PyTorch native parallelisms (FSDP2, tensor parallel, sequence parallel, and context parallel). 

However, when model sizes approach hundreds of billions of parameters, the DTensor path becomes insufficient. Activation memory from large models introduces significant recompute overhead, resulting in infeasibly slow step times. The DTensor path also lacks optimized NVIDIA CUDA kernels and other performance enhancements necessary for optimal throughput. These challenges highlight the need for a more efficient solution, which is exactly what the NVIDIA Megatron-Core library is designed to provide.

Explore the latest NeMo-RL v0.3 release, where you’ll find detailed documentation, example scripts, and configuration files to efficiently post-train large models with Megatron-Core backend support.

Reinforcement learning with the Megatron backend

Built with GPU-optimized techniques and high-throughput performance enhancements, Megatron-Core enables seamless training of massive language models. The library’s 6D parallelism strategy optimizes communication and computation patterns and supports a diverse range of model architectures.

NeMo-RL has added support for Megatron-Core, enabling developers to use these optimizations during post-training. While Megatron-Core offers many low-level settings, configuring them can be overwhelming for those new to the library. NeMo-RL streamlines this process by automatically handling much of the complex tuning behind the scenes and instead presenting users with a simpler, more intuitive set of configuration options.

Getting started with Megatron training

Enabling Megatron-based training is straightforward. Add the policy.megatron_cfg section to your YAML configuration:

policy:    ...    megatron_cfg:    enabled: true    activation_checkpointing: false    tensor_model_parallel_size: 1    pipeline_model_parallel_size: 1        ...         optimizer:      ...        scheduler:      ...        distributed_data_parallel_config:      grad_reduce_in_fp32: false      overlap_grad_reduce: true      overlap_param_gather: true      average_in_collective: true      use_custom_fsdp: falsedata_parallel_sharding_strategy: "optim_grads_params"

See a complete working example.

All arguments within the config will be forwarded to Megatron during training. After adding the megatron section to your config and setting enabled=True, you’re ready to train a model. Launching training is done in the same way as with DTensor, as described in the README or our guide on reproducing DeepScaleR.

Results

Megatron-based training supports both dense and Mixture of Experts (MoE) models. The following shows a step time breakdown for Group Relative Policy Optimization (GRPO) on a few commonly used models. The timing reported in the table is an average over steps 22-29 of each training run.

ModelBackendNodesGPUs per nodeTotal step time (s)Policy training (s)Refit (s)Generation (s)Get logprobs (s)Avg. generated tokens per sample
Llama 3.1-8B InstructMegatron181122855818795
PyT DTensor181223845719777
Llama 3.1-70B BaseMegatron8814728148418398
PyT DTensor*8823097158228395
Qwen3 32B**Megatron8821368796403283
Qwen3 30B-A3B**Megatron88167501278233251
Table 1. Model performance comparison for different training configurations across Megatron compared to PyTorch DTensor

All runs were conducted with the following settings. Max sequence length 4096, rollout batch size 2048, global batch size 512, and sequence packing enabled (see the subsequent section for details on sequence packing). For the Megatron-Core runs, Llama 3.1 8B was run with only data in parallel. Llama 3.1 70B was run with 4-way tensor and 4-way pipeline parallel. Qwen3 32B was run with 4-way tensor and 2-way pipeline parallel, and Qwen3 30B-A3B was run with 8-way expert and 2-way tensor parallel.
*Llama 70B DTensor results were gathered using dynamic batching rather than sequence packing because of a known out-of-memory issue with sequence packing.
**Qwen3 32B and 30B-A3B DTensor fail due to a known assertion error. See the issue.
Figure 1. Total step time comparison for Llama 3.1 8B instruct model using Megatron-core and PyTorch DTensor backends

By using performance optimizations provided by Megatron-Core, we achieved superior training performance relative to DTensor with the same convergence properties, as shown.

Figure 3. Llama 8B GRPO Megatron-Core vs PyT DTensor reward curves
Figure 4. 70B GRPO Megatron-Core vs DTensor reward curves

The following commands were used to generate these reward curves:

## 8B -- requires a single node## dtensoruv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml \    loss_fn.use_importance_sampling_correction=True## megatron uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B_megatron.yaml \    policy.sequence_packing.enabled=True loss_fn.use_importance_sampling_correction=True## 70B -- requires 8 nodes## dtensoruv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml \    policy.model_name=meta-llama/Llama-3.1-70B policy.tokenizer.name=meta-llama/Llama-3.1-70B-Instruct \    policy.generation.vllm_cfg.tensor_parallel_size=4 policy.max_total_sequence_length=4096 \    cluster.num_nodes=8 policy.dtensor_cfg.enabled=True policy.dtensor_cfg.tensor_parallel_size=8 \    policy.dtensor_cfg.sequence_parallel=True policy.dtensor_cfg.activation_checkpointing=False \    loss_fn.use_importance_sampling_correction=True## megatronuv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_70B_megatron.yaml \    policy.model_name=meta-llama/Llama-3.1-70B policy.tokenizer.name=meta-llama/Llama-3.3-70B-Instruct \    policy.sequence_packing.enabled=True loss_fn.use_importance_sampling_correction=True

These runs use some performance and convergence enhancements to ensure that we achieve both optimal throughput and convergence.

    Sequence packing: Multiple sequences are packed to the max_total_sequence_length. Sequence packing reduces the number of padding tokens and is particularly useful when there are large variations in sequence length. For Llama 70B, enabling sequence packing yields an approximate 1x reduction in overall step time with no impact on convergence. This enhancement is supported for the Megatron-Core and DTensor backends. For more details on sequence packing in NeMo-RL, refer to our documentation.Importance sampling: NeMo-RL uses different frameworks for inference and training to achieve the best performance; however, there may be small differences in token probabilities between training and inference. One way to mitigate this issue is to use importance sampling, which assigns a weight to each sample that is a function of the inference and training probabilities. Enabling importance sampling reduces the variance between runs and enables better match convergence between Megatron-Core and DTensor policies. For more information on importance sampling in NeMo-RL, refer to our documentation.

Long sequence support

We can also use context parallelism with Megatron-Core and DTensor for long-context training. For example, the following shows current performance results for Llama 3.3 70B at 16k sequence length using the Megatron backend. Even longer sequence lengths are supported, and performance optimizations for long context training are ongoing.

ModelMax sequence lengthNodesGPUs per nodeContext parallel sizeTotal step time (s)Policy training (s)Refit (s)Generation (s)Get logprobs (s)Avg. generated tokens per sample
Llama 3.3-70B Instruct16,3841684445641728775749

Table 2:  Performance of Llama 3.3-70B Instruct with 16K long context window with Megatron backend

Other notable features

In addition to the Megatron training backend, NeMo-RL V0.3 introduces several exciting features that help democratize efficient post-training on a wide range of models:

    Async rollouts: Users can now switch on the vLLM async engine by setting policy.generation.async_engine=True, which speeds up multi-turn RL by 2-3x.Non-colocated generation (DTensor backend): Users now have the option to place the training and generation backends on different sets of GPUs. This can be useful if training and generation have incompatible parallelisms/world sizes, or if the memory after offloading for training or generation is not low enough with colocation. See the 0.3.0 release notes for more details.

Coming soon

Stay on the lookout for the following features coming very soon:

    Efficient, larger MOE model support using the Megatron backend to run models on the order of hundreds of billions of parameters, including DeepSeek-V3 and Qwen3-235B-A22B.Highly optimized refit.FP8 generation support.Megatron and DTensor VLM support.Non-colocated generation with the Megatron-Core backend.

Conclusion

In this post, we showed how NeMo-RL v0.3 with Megatron-Core backend significantly improves reinforcement learning training throughput compared to PyTorch DTensor, especially for large models like Llama 70B. With GPU-optimized kernels, 4D parallelism, and features like sequence packing and importance sampling, NeMo-RL ensures both efficiency and convergence across model scales. We also showed how long-context training is supported, delivering strong performance even at 16k sequence lengths.

Explore the NVIDIA NeMo RL documentation, example configs, and scripts to start post-training your large models with Megatron-Core optimizations.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NeMo-RL Megatron-Core 强化学习 大模型训练 AI NVIDIA Reinforcement Learning Large Model Training AI
相关文章