MarkTechPost@AI 10月16日 12:32
QeRL:4位量化增强RL训练,赋能单H100上的32B大模型
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA研究人员与MIT等合作,开源了QeRL(Quantization-enhanced Reinforcement Learning)框架。该框架将强化学习(RL)训练后处理推至4位FP4(NVFP4)量化,同时通过LoRA保持梯度计算的高精度。QeRL在单块H100 GPU上实现了32B大模型的RL训练,并带来了显著的速度提升。其核心创新在于,将策略权重路径迁移至NVFP4,利用高效FP4×BF16内核(Marlin)加速采样过程,同时保留LoRA的梯度计算稳定性。此外,QeRL还引入了自适应量化噪声(AQN),将量化视为一种可调控的探索机制,在训练初期通过引入高斯噪声来提升策略熵,促进探索,并在后期逐渐过渡到利用阶段,从而在数学推理等任务上取得了更快的收敛速度和更高的准确率。

💡 QeRL框架实现了将大型语言模型(LLM)的强化学习(RL)训练后处理推至4位FP4(NVFP4)量化,同时通过LoRA技术在高精度下进行梯度计算。这项创新使得在单块H100 GPU上训练32B参数的LLM成为可能,并且在回滚(rollout)阶段实现了超过1.5倍的速度提升,端到端训练速度相比QLoRA也有约1.8倍的提升。

🚀 QeRL通过将策略的权重路径迁移至NVFP4,并集成Marlin等硬件高效FP4×BF16内核,显著加速了模型在回滚(采样)阶段的预填充和解码过程。同时,LoRA技术限制了可训练参数的数量,确保了反向传播的稳定性,无需维护独立的完整精度策略模型,从而降低了内存占用和计算成本。

📈 QeRL将确定性的FP4量化视为一种有效的探索手段。研究发现,FP4量化能够提升策略的熵,使早期训练中的token分布更加平缓,从而增强了探索能力。为了在训练过程中调控这一效果,QeRL引入了自适应量化噪声(AQN),通过层归一化(LayerNorm)的尺度参数引入通道级的高斯扰动,并采用指数调度进行退火,实现了从探索到利用的平滑过渡。

🎯 在数学推理等任务上,QeRL展示了优于16位LoRA和QLoRA的性能。例如,使用Qwen2.5模型时,7B模型在GSM8K上达到90.8%的准确率,在MATH500上达到77.4%,甚至超过了完整参数微调的基线。QeRL通过改进的探索机制,在更广泛的数学基准测试中保持了相当的竞争力或优势,并实现了更快的收敛。

What would you build if you could run Reinforcement Learning (RL) post-training on a 32B LLM in 4-bit NVFP4—on a single H100—with BF16-level accuracy and 1.2–1.5× step speedups? NVIDIA researchers (with collaborators from MIT, HKU, and Tsinghua) have open-sourced QeRL (Quantization-enhanced Reinforcement Learning), a training framework that pushes Reinforcement Learning (RL) post-training into 4-bit FP4 (NVFP4) while keeping gradient math in higher precision via LoRA. The research team reports >1.5× speedups in the rollout phase, ~1.8× end-to-end vs QLoRA in one setting, and the first demonstration of RL training for a 32B policy on a single H100-80GB GPU.

https://arxiv.org/pdf/2510.11696

What QeRL changes in the Reinforcement Learning (RL) loop?

Most RLHF/GRPO/DAPO pipelines spend the bulk of wall-clock time in rollouts (token generation). QeRL shifts the policy’s weight path to NVFP4 (FP4) with dual-level scaling and keeps logits/gradients in higher precision via LoRA, so backprop remains stable while the sampling path hits hardware-efficient FP4×BF16 kernels (Marlin). The result is faster prefill/decoding during rollouts without maintaining a separate full-precision policy.

Mechanically, the research team integrates Marlin-based FP4 kernels in both rollout and prefill, while LoRA limits trainable parameters. This directly targets the stage that dominates RL cost and latency for long reasoning traces.

https://arxiv.org/pdf/2510.11696

Quantization as exploration, made schedulable

A core empirical finding: deterministic FP4 quantization raises policy entropy, flattening token distributions early in training and improving exploration versus 16-bit LoRA and NF4-based QLoRA baselines. To control that effect over time, QeRL introduces Adaptive Quantization Noise (AQN)channel-wise Gaussian perturbations mapped into LayerNorm scale parameters and annealed with an exponential schedule. This keeps kernel fusion intact (no extra weight tensors) while transitioning from exploration to exploitation.

In ablations, QeRL shows faster reward growth and higher final scores on math-reasoning tasks under both GRPO and DAPO, aligning with the hypothesis that structured noise in parameter space can be a useful exploration driver in RL, even though such noise is typically detrimental in supervised fine-tuning.

Reported results

On Qwen2.5 backbone model, the research team show that NVFP4+LoRA outperforms vanilla LoRA and QLoRA in rollout throughput and overall training time, with >2× rollout throughput on 14B/32B models against QLoRA and ~1.8× end-to-end vs QLoRA in a representative setup. They also demonstrate training a 32B policy with GRPO on a single H100-80GB, enabled by the lower memory footprint of weight-only FP4.

Accuracy is competitive with higher-precision baselines. For a 7B model, the research team reports GSM8K = 90.8% and MATH500 = 77.4%, surpassing 16-bit LoRA and QLoRA under their setup and matching full-parameter fine-tuning. Across broader math benchmarks (e.g., BigMath), QeRL maintains parity or advantage, while converging faster due to improved exploration.

https://arxiv.org/pdf/2510.11696

What this is—and isn’t?

QeRL is weight-only FP4 with LoRA updates; it does not claim FP4 precision for logits/gradients. The benefits concentrate in rollout/prefill throughput and memory footprint, with empirical evidence that quantization-induced entropy aids RL exploration when AQN modulates it over training. Generalization to modalities beyond math-reasoning tasks or to safety/tool-use RL depends on reward design and sequence lengths.

Key Takeaways

Editorial Comments

QeRL speeds up the RL rollout stage. It quantizes weights to NVFP4 and keeps updates and logits in higher precision using LoRA. It reports >1.5× rollout speedups and can train a 32B policy on a single H100-80GB GPU. It adds Adaptive Quantization Noise to make exploration a controlled signal during training. Results are shown mainly on math-reasoning tasks using GRPO and DAPO. The gains rely on NVFP4 kernel support such as Marlin.


Check out the FULL CODES here and Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post QeRL: NVFP4-Quantized Reinforcement Learning (RL) Brings 32B LLM Training to a Single H100—While Improving Exploration appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

QeRL Reinforcement Learning LLM Quantization NVFP4 LoRA H100 NVIDIA AI Training Exploration AQN Qwen2.5 Math Reasoning 深度学习 人工智能 模型训练 量化 强化学习 大模型 算力优化
相关文章