过程奖励模型在LLMs中的应用与挑战

cs.AI updates on arXiv.org 10月10日 12:15

过程奖励模型在LLMs中的应用与挑战

本文对过程奖励模型（PRMs）进行了系统综述，包括生成过程数据、构建PRMs、以及PRMs在测试时间和强化学习中的应用。文章总结了PRMs在数学、代码、文本、多模态推理、机器人和智能体等领域的应用，并回顾了新兴的基准。目标在于明确设计空间，揭示开放挑战，并指导未来研究向细粒度、鲁棒推理对齐方向发展。

arXiv:2510.08049v1 Announce Type: cross Abstract: Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签