RePULSe：改进语言模型奖励学习的训练方法

cs.AI updates on arXiv.org 10月27日 14:24

RePULSe：改进语言模型奖励学习的训练方法

本文提出RePULSe训练方法，通过增加一个额外的损失函数来降低低奖励输出的概率，从而优化语言模型奖励学习，实验证明其在平均奖励与不良输出概率之间的平衡上优于传统方法。

arXiv:2510.21184v1 Announce Type: cross Abstract: Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs' probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Reinforcement Learning Language Models Reward Learning RePULSe Training Method

相关文章

Language Agents: From Reasoning to Acting

仅靠5000+样本，全新强化学习范式让30B轻松击败671B的DeepSeek V3

从Q学习到AI游戏大师：DQN如何破解复杂任务？丨集智百科

浙大联合港理工团队新作InfiGUI-G1：通过自适应探索策略优化，攻克GUI智能体定位语义对齐瓶颈

Datawhale成员，组织NeurIPS 2025 世界模型 Workshop！

苹果创新“清单法”：用 AI 大模型当“老师”，教小模型更精准执行复杂指令

笑死，小扎亲自出马挽留 AI 大神，结果毒鸡汤把人劝跑了

Chain-of-Agents重磅登场！通用智能体新范式横扫SOTA，模型代码数据全开源

PPO为何稳赢Policy Gradient？答案藏在这刀“黄金剪裁”

让AI自己“动手刷”手机！清华、智谱团队推出MobileRL框架，9B模型实现SOTA