MarkTechPost@AI 09月08日
MIT研究:强化学习比监督微调更能减少灾难性遗忘
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一项新的MIT研究深入探讨了在部署后,大型基础模型在学习新任务时如何避免遗忘先前能力的问题。研究发现,与传统的监督微调(SFT)相比,在线强化学习(RL)在保持模型原有知识方面表现出显著优势。通过引入一个基于KL散度(KL divergence)的经验遗忘法则,研究人员量化了遗忘程度,并证明了RL方法通过其“on-policy”更新机制,倾向于选择与基础模型分布更接近的解决方案,从而有效减少了灾难性遗忘。这项研究不仅解释了RL的鲁棒性,还为开发能够持续学习且不丢失旧技能的AI代理提供了新思路。

🧠 **灾难性遗忘的挑战**:基础模型在部署后通常是静态的,对其进行微调以适应新任务时,往往会丢失先前学到的能力,这阻碍了AI代理的长期改进和持续学习。

💡 **RL的知识保留优势**:研究对比了强化学习(RL)和监督微调(SFT)在学习新任务时的表现。尽管两者都能达到高新任务准确率,但SFT倾向于覆盖原有知识,而RL则能更好地保留它们。关键在于RL如何调整模型输出分布,使其与基础策略更接近。

📏 **量化遗忘的KL散度法则**:研究团队提出了一个经验遗忘法则,指出遗忘程度与基础模型(π0)和微调后模型(π)之间的KL散度(KL(π0 || π))成正比。这种度量方式使得遗忘程度可被量化,且无需原始任务数据。

🚀 **跨领域实验验证**:在大型语言模型(如LLMs)的数学推理、科学问答和工具使用任务,以及机器人控制任务的实验中,RL方法在提升新任务表现的同时,保持了先前任务的准确性。相比之下,SFT则不可避免地牺牲了部分旧知识。

⚖️ **RL的“剃刀”原则**:RL的“on-policy”更新机制通过从模型自身输出来采样并根据奖励进行增量重加权,将学习过程约束在接近基础模型的分布内。理论分析表明,策略梯度会收敛到KL最小化的最优解,这正式化了RL在减少遗忘方面的优势。

What is catastrophic forgetting in foundation models?

Foundation models excel in diverse domains but are largely static once deployed. Fine-tuning on new tasks often introduces catastrophic forgetting—the loss of previously learned capabilities. This limitation poses a barrier for building long-lived, continually improving AI agents.

Why does online reinforcement learning forget less than supervised fine-tuning?

A new MIT study compares reinforcement learning (RL) and supervised fine-tuning (SFT). Both can achieve high performance on new tasks, but SFT tends to overwrite prior abilities. RL, by contrast, preserves them. The key lies in how each method shifts the model’s output distribution relative to the base policy.

https://arxiv.org/pdf/2509.04259

How can forgetting be measured?

The research team proposes an empirical forgetting law:

Forgetting∝KL(π0​∣∣π)

where π0 is the base model and π is the fine-tuned model. The forward KL divergence, measured on the new task, strongly predicts the extent of forgetting. This makes forgetting quantifiable without needing data from prior tasks.

What do experiments on large language models reveal?

Using Qwen 2.5 3B-Instruct as the base model, fine-tuning was performed on:

Performance was evaluated on prior benchmarks such as HellaSwag, MMLU, TruthfulQA, and HumanEval. Results showed that RL improved new-task accuracy while keeping prior-task accuracy stable, whereas SFT consistently sacrificed prior knowledge.

How does RL compare to SFT in robotics tasks?

In robotic control experiments with OpenVLA-7B fine-tuned in SimplerEnv pick-and-place scenarios, RL adaptation maintained general manipulation skills across tasks. SFT, while successful on the new task, degraded prior manipulation abilities—again illustrating RL’s conservatism in preserving knowledge.

What insights come from the ParityMNIST study?

To isolate mechanisms, the research team introduced a toy problem, ParityMNIST. Here, RL and SFT both reached high new-task accuracy, but SFT induced sharper declines on the FashionMNIST auxiliary benchmark. Crucially, plotting forgetting against KL divergence revealed a single predictive curve, validating KL as the governing factor.

Why do on-policy updates matter?

On-policy RL samples from the model’s own outputs, incrementally reweighting them by reward. This process constrains learning to distributions already close to the base model. SFT, in contrast, optimizes against fixed labels that may be arbitrarily distant. Theoretical analysis shows policy gradients converge to KL-minimal optimal solutions, formalizing RL’s advantage.

Are other explanations sufficient?

The research team tested alternatives: weight-space changes, hidden representation drift, sparsity of updates, and alternative distributional metrics (reverse KL, total variation, L2 distance). None matched the predictive strength of forward KL divergence, reinforcing that distributional closeness is the critical factor.

What are the broader implications?

Conclusion

The MIT research reframes catastrophic forgetting as a distributional problem governed by forward KL divergence. Reinforcement learning forgets less because its on-policy updates naturally bias toward KL-minimal solutions. This principle—RL’s Razor—provides both an explanation for RL’s robustness and a roadmap for developing post-training methods that support lifelong learning in foundation models.

Key Takeaways


Check out the PAPER and PROJECT PAGE. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


The post A New MIT Study Shows Reinforcement Learning Minimizes Catastrophic Forgetting Compared to Supervised Fine-Tuning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

灾难性遗忘 强化学习 监督微调 基础模型 AI代理 Catastrophic Forgetting Reinforcement Learning Supervised Fine-Tuning Foundation Models AI Agents
相关文章