RL后训练可预测性：Meta等机构提出ScaleRL

Reinforcement Learning RL post-training is now a major lever for reasoning-centric LLMs, but unlike pre-training, it hasn’t had predictive scaling rules. Teams pour tens of thousands of GPU-hours into runs without a principled way to estimate whether a recipe will keep improving with more compute. A new research from Meta, UT Austin, UCL, Berkeley, Harvard, and Periodic Labs provides a compute-performance framework—validated over >400,000 GPU-hours—that models RL progress with a sigmoidal curve and supplies a tested recipe, ScaleRL, that follows those predicted curves up to 100,000 GPU-hours.

Fit a sigmoid, not a power law

Pre-training often fits power laws (loss vs compute). RL fine-tuning targets bounded metrics (e.g., pass rate/mean reward). The research team show sigmoidal fits to pass rate vs training compute are empirically more robust and stable than power-law fits, especially when you want to extrapolate from smaller runs to larger budgets. They exclude the very early, noisy regime (~first 1.5k GPU-hours) and fit the predictable portion that follows. The sigmoidal parameters have intuitive roles: one sets the asymptotic performance (ceiling), another the efficiency/exponent, and another the midpoint where gains are fastest.

https://arxiv.org/pdf/2510.13786

Why that matters: After ~1–2k GPU-hours, you can fit the curve and forecast whether pushing to 10k–100k GPU-hours is worth it—before you burn the budget. The research also shows power-law fits can produce misleading ceilings unless you only fit at very high compute, which defeats the purpose of early forecasting.

ScaleRL: a recipe that scales predictably

ScaleRL is not just new algorithm; it’s a composition of choices that produced stable, extrapolatable scaling in the study:

Asynchronous Pipeline RL

CISPO

FP32 precision at the logits

Prompt-level loss averaging

batch-level advantage normalization

Forced length interruptions

Zero-variance filtering

No-Positive-Resampling

The research team validated each component with leave-one-out (LOO) ablations at 16k GPU-hours and show that ScaleRL’s fitted curves reliably extrapolate from 8k → 16k, then hold at much larger scales—including a single run extended to 100k GPU-hours.

https://arxiv.org/pdf/2510.13786

Results and generalization

Two key demonstrations:

Predictability at scale

8B dense

Llama-4 17B×16 MoE (“Scout”)

extended training

sigmoid extrapolations

Downstream transfer

track

AIME-24

The research also compares fitted curves for prevalent recipes (e.g., DeepSeek (GRPO), Qwen-2.5 (DAPO), Magistral, MiniMax-M1) and reports higher asymptotic performance and better compute efficiency for ScaleRL in their setup.

https://arxiv.org/pdf/2510.13786

Which knobs move the ceiling vs the efficiency?

The framework lets you classify design choices:

Ceiling movers (asymptote):

model size

longer generation lengths

32,768 tokens

asymptotic

global batch size

Efficiency shapers:

loss aggregation

advantage normalization

data curriculum

off-policy pipeline

how fast

Operationally, the research team advises fitting curves early and prioritizing interventions that raise the ceiling, then tune the efficiency knobs to reach it faster at fixed compute.

Key Takeaways

sigmoidal compute-performance curves

ScaleRL

PipelineRL-k

CISPO

FP32 logits

predicted and matched

100k GPU-hours (8B dense)

50k GPU-hours (17B×16 MoE “Scout”)

Ablations

asymptotic ceiling (A)

compute efficiency (B)

early forecasting

track downstream metrics

Editorial Comments

This work turns RL post-training from trial-and-error into forecastable engineering. It fits sigmoidal compute-performance curves (pass-rate vs. log compute) to predict returns and decide when to stop or scale. It also provides a concrete recipe, ScaleRL, that uses PipelineRL-style asynchronous generation/training, the CISPO loss, and FP32 logits for stability. The study reports >400,000 GPU-hours of experiments and a single-run extension to 100,000 GPU-hours. Results support a clean split: some choices raise the asymptote; others mainly improve compute efficiency. That separation helps teams prioritize ceiling-moving changes before tuning throughput knobs.

Check out the PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Sigmoidal Scaling Curves Make Reinforcement Learning RL Post-Training Predictable for LLMs appeared first on MarkTechPost.

Fit a sigmoid, not a power law

ScaleRL: a recipe that scales predictably

Results and generalization

Which knobs move the ceiling vs the efficiency?

Key Takeaways

Editorial Comments

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签