PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning

cs.AI updates on arXiv.org 08月21日

PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning

本文介绍PersRM-R1，一个基于推理的奖励模型框架，旨在通过少量个人示例识别和表示个人因素，结合合成数据生成与两阶段训练，提高个性化LLM的准确性和泛化能力。

arXiv:2508.14076v1 Announce Type: cross Abstract: Reward models (RMs), which are central to existing post-training methods, aim to align LLM outputs with human values by providing feedback signals during fine-tuning. However, existing RMs struggle to capture nuanced, user-specific preferences, especially under limited data and across diverse domains. Thus, we introduce PersRM-R1, the first reasoning-based reward modeling framework specifically designed to identify and represent personal factors from only one or a few personal exemplars. To address challenges including limited data availability and the requirement for robust generalization, our approach combines synthetic data generation with a two-stage training pipeline consisting of supervised fine-tuning followed by reinforcement fine-tuning. Experimental results demonstrate that PersRM-R1 outperforms existing models of similar size and matches the performance of much larger models in both accuracy and generalizability, paving the way for more effective personalized LLMs.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

奖励模型个性化LLM 推理框架合成数据训练方法

相关文章

Import AI 369: Conscious machines are possible; AI agents; the varied uses of synthetic data

Synthetic Data Generation for Robotics with Bill Vass - #588

和@歸藏一起视频会议看完 OpenAI 的发布，讨论了一会，背脊发凉… 1️⃣ 没想到卷推理卷到了这种程度? 现实交流场景下300ms 左右的体验奇点真没想到就这样被...

Using generative AI to improve software testing

Google AI Described New Machine Learning Methods for Generating Differentially Private Synthetic Data

Children's visual experience may hold key to better computer vision training

读者问我为啥【筱思萌想】断更了，小竹林也更的如星星之火般少，那当然是因为我这个半吊子作者和小伙伴们去做了个公司???。诺，CEO是这个家伙@kevin_大...

Synthetic Data Generation in Foundation Models and Differential Privacy: Three Papers from Microsoft Research

研究表明，像 ChatGPT 这样的人工智能系统可能很快就会耗尽数据资源

Scaling AI Models: Combating Collapse with Reinforced Synthetic Data