SFT分数与RL训练效果的关联性研究

cs.AI updates on arXiv.org 10月03日 12:16

SFT分数与RL训练效果的关联性研究

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本研究对大型语言模型（LLMs）在推理能力后训练的实践提出质疑，挑战了监督微调（SFT）的高分数是否必然转化为强化学习（RL）后的性能提升。研究发现，高SFT分数可能偏向于简单或同质化数据，并不能可靠预测RL收益或后续训练效果。在某些情况下，SFT性能提升的模型在RL训练后表现反而不如未进行SFT的基础模型。为解决此问题，研究提出了基于泛化损失和Pass@large k性能的替代指标，这些指标能更准确地预测RL结果。通过对大量模型（最高12B参数）进行实验，并使用GRPO进行SFT和RLVR训练，在多个数学基准上进行了广泛评估，花费了超过100万GPU小时。实验结果表明，基于泛化损失和Pass@large k的预测比直接预测RL前性能具有更高的精度，R²系数和Spearman秩相关系数提升高达0.5（2倍），这为实际应用提供了重要参考。例如，在大多数实验中，发现使用独特样本进行一个epoch的SFT训练，其在SFT或SFT-then-RL后的表现，普遍劣于使用一半样本进行两个epoch的训练。在相同的SFT预算下，仅训练短样本可能获得更好的SFT性能，但其在RL后的结果往往不如训练包含不同长度样本的模型。

📊 **SFT分数与RL训练效果关联性存疑**：研究发现，大型语言模型（LLMs）在监督微调（SFT）阶段获得的高分数，并不能可靠地预测其在强化学习（RL）阶段的性能提升。高SFT分数可能源于模型对简单或同质化数据的偏好，而非真正的推理能力增强，甚至可能导致RL训练效果下降。

🔍 **发现更优的预测指标**：为了克服SFT分数预测RL效果的局限性，研究提出采用“泛化损失”和“Pass@large k性能”作为更可靠的代理指标。这些指标能够更准确地预测模型在RL训练后的表现，显著提高了预测精度，为模型选择和训练策略提供了重要参考。

⏳ **训练策略对RL效果的影响**：实验结果表明，在相同的SFT预算下，不同的训练策略会产生显著差异。例如，使用独特样本进行单轮SFT训练的效果，通常不如使用一半样本进行两轮训练。此外，仅使用短样本进行训练虽然可能在SFT阶段表现良好，但在RL阶段的表现可能不如训练包含不同长度样本的模型。

arXiv:2510.01624v1 Announce Type: cross Abstract: In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as ``RL'' below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending $>$1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving $R^2$ coefficient and Spearman's rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. Evaluation tool will be open-sourced.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签