混合框架提升偏好学习效率

cs.AI updates on arXiv.org 11月07日 13:49

混合框架提升偏好学习效率

本文提出一种结合RLHF与PBO的混合框架，通过集成获取驱动模块提升偏好学习效率，并在高维偏好优化和LLM微调两个领域验证了其有效性和样本效率。

arXiv:2511.04286v1 Announce Type: cross Abstract: Learning from human preferences is a cornerstone of aligning machine learning models with subjective human judgments. Yet, collecting such preference data is often costly and time-consuming, motivating the need for more efficient learning paradigms. Two established approaches offer complementary advantages: RLHF scales effectively to high-dimensional tasks such as LLM fine-tuning, while PBO achieves greater sample efficiency through active querying. We propose a hybrid framework that unifies RLHF's scalability with PBO's query efficiency by integrating an acquisition-driven module into the RLHF pipeline, thereby enabling active and sample-efficient preference gathering. We validate the proposed approach on two representative domains: (i) high-dimensional preference optimization and (ii) LLM fine-tuning. Experimental results demonstrate consistent improvements in both sample efficiency and overall performance across these tasks.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

混合框架偏好学习 RLHF PBO 样本效率

相关文章

Runway Gen-2: Generative AI for Video Creation with Anastasis Germanidis - #622

AI Trends 2023: Reinforcement Learning - RLHF, Robotic Pre-Training, and Offline RL with Sergey Levine - #612

Relational, Object-Centric Agents for Completing Simulated Household Tasks with Wilka Carvalho - #402

Beyond the Reference Model: SimPO Unlocks Efficient and Scalable RLHF for Large Language Models

OpenAI训练模型CriticGPT，用于捕捉Chat-4代码中的漏洞

让GPT-4帮人类训练GPT-4！OpenAI推出新模型CriticGPT

OpenAI的武当梯云纵、左脚踩右脚：推出新模型CriticGPT修正AI错误

GPT-4批评GPT-4实现「自我提升」，OpenAI前超级对齐团队又一力作被公开

OpenAI Introduces CriticGPT: A New Artificial Intelligence AI Model based on GPT-4 to Catch Errors in ChatGPT’s Code Output

支付宝+清华FoRAG：用创作的思路解决RAG生成逻辑问题！