MarkTechPost@AI 4小时前
RL后训练可预测性:Meta等机构提出ScaleRL
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta等机构的研究团队提出了一种创新的计算-性能框架,用于预测语言模型(LLM)在强化学习(RL)后训练中的表现。与传统的预训练模型不同,RL后训练缺乏可预测的扩展规则,导致研究人员在投入大量计算资源时难以估计效果。新框架通过拟合S形曲线来模拟RL训练的进展,并提供了一种名为ScaleRL的优化配方,该配方在高达10万GPU小时的计算量下均能遵循预测曲线。该研究通过超过40万GPU小时的实验验证了该框架的有效性,能够帮助研究人员在早期阶段就判断是否值得继续投入计算资源,从而将RL后训练从试错过程转变为可预测的工程实践。

📈 **S形曲线预测RL训练进展**:研究发现,与常用于预训练的幂律模型不同,S形曲线能更稳定、更准确地拟合RL后训练目标(如通过率/平均奖励)与计算量之间的关系,尤其是在进行从小型实验到大型计算预算的推断时。该模型能预测渐进性能上限、效率和收益最快的中间点,从而在早期阶段提供可预测性。

🛠️ **ScaleRL优化配方**:ScaleRL是一套经过验证的优化技术组合,旨在实现稳定且可推断的扩展。它结合了异步流水线RL(用于离策略吞吐量)、CISPO损失函数、FP32精度(避免数值不匹配)、提示级损失平均、批级优势归一化、强制长度中断、零方差过滤以及无正样本重采样等关键组件,确保了训练的稳定性和高效性。

📊 **区分“天花板”与“效率”**:该框架能够区分不同的设计选择对模型性能的影响。模型规模(如MoE)和更长的生成长度等因素能提高最终的性能上限(asymptotic ceiling),而损失聚合、优势归一化、数据课程和离策略流水线等主要影响达到该上限的速度,即计算效率。这种区分有助于研究人员优先提升模型上限,再优化效率。

🚀 **早期预测与下游迁移**:该研究通过超过40万GPU小时的实验表明,在约1000-2000 GPU小时后,即可拟合S形曲线并预测将计算量扩展到10万GPU小时是否值得。同时,在i.i.d.验证集上的通过率提升也追踪了下游评估指标(如AIME-24),证明了计算-性能曲线具有外部有效性。

💡 **预测性工程而非试错**:这项研究的核心贡献在于将RL后训练从一个依赖直觉和反复试验的过程转变为一个可预测的工程领域。通过S形曲线拟合和ScaleRL配方,研究团队能够精确预测并匹配扩展训练的实验结果,为LLM的RL后训练提供了坚实的方法论基础。

Reinforcement Learning RL post-training is now a major lever for reasoning-centric LLMs, but unlike pre-training, it hasn’t had predictive scaling rules. Teams pour tens of thousands of GPU-hours into runs without a principled way to estimate whether a recipe will keep improving with more compute. A new research from Meta, UT Austin, UCL, Berkeley, Harvard, and Periodic Labs provides a compute-performance framework—validated over >400,000 GPU-hours—that models RL progress with a sigmoidal curve and supplies a tested recipe, ScaleRL, that follows those predicted curves up to 100,000 GPU-hours.

Fit a sigmoid, not a power law

Pre-training often fits power laws (loss vs compute). RL fine-tuning targets bounded metrics (e.g., pass rate/mean reward). The research team show sigmoidal fits to pass rate vs training compute are empirically more robust and stable than power-law fits, especially when you want to extrapolate from smaller runs to larger budgets. They exclude the very early, noisy regime (~first 1.5k GPU-hours) and fit the predictable portion that follows. The sigmoidal parameters have intuitive roles: one sets the asymptotic performance (ceiling), another the efficiency/exponent, and another the midpoint where gains are fastest.

https://arxiv.org/pdf/2510.13786

Why that matters: After ~1–2k GPU-hours, you can fit the curve and forecast whether pushing to 10k–100k GPU-hours is worth it—before you burn the budget. The research also shows power-law fits can produce misleading ceilings unless you only fit at very high compute, which defeats the purpose of early forecasting.

ScaleRL: a recipe that scales predictably

ScaleRL is not just new algorithm; it’s a composition of choices that produced stable, extrapolatable scaling in the study:

The research team validated each component with leave-one-out (LOO) ablations at 16k GPU-hours and show that ScaleRL’s fitted curves reliably extrapolate from 8k → 16k, then hold at much larger scales—including a single run extended to 100k GPU-hours.

https://arxiv.org/pdf/2510.13786

Results and generalization

Two key demonstrations:

    Predictability at scale: For an 8B dense model and a Llama-4 17B×16 MoE (“Scout”), the extended training closely followed the sigmoid extrapolations derived from smaller-compute segments. Downstream transfer: Pass-rate improvements on an iid validation set track downstream evaluation (e.g., AIME-24), suggesting the compute-performance curve isn’t a dataset artifact.

The research also compares fitted curves for prevalent recipes (e.g., DeepSeek (GRPO), Qwen-2.5 (DAPO), Magistral, MiniMax-M1) and reports higher asymptotic performance and better compute efficiency for ScaleRL in their setup.

https://arxiv.org/pdf/2510.13786

Which knobs move the ceiling vs the efficiency?

The framework lets you classify design choices:

Operationally, the research team advises fitting curves early and prioritizing interventions that raise the ceiling, then tune the efficiency knobs to reach it faster at fixed compute.

Key Takeaways

Editorial Comments

This work turns RL post-training from trial-and-error into forecastable engineering. It fits sigmoidal compute-performance curves (pass-rate vs. log compute) to predict returns and decide when to stop or scale. It also provides a concrete recipe, ScaleRL, that uses PipelineRL-style asynchronous generation/training, the CISPO loss, and FP32 logits for stability. The study reports >400,000 GPU-hours of experiments and a single-run extension to 100,000 GPU-hours. Results support a clean split: some choices raise the asymptote; others mainly improve compute efficiency. That separation helps teams prioritize ceiling-moving changes before tuning throughput knobs.


Check out the PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Sigmoidal Scaling Curves Make Reinforcement Learning RL Post-Training Predictable for LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Reinforcement Learning LLMs Post-training ScaleRL Compute-Performance Framework Sigmoidal Scaling Curves Meta AI Predictable Scaling AI Research 深度强化学习 大语言模型 后训练 ScaleRL 计算-性能框架 S形扩展曲线 Meta 可预测扩展 人工智能研究
相关文章