基于演示的问答生成学习研究

cs.AI updates on arXiv.org 10月20日 12:14

基于演示的问答生成学习研究

本文研究了基于演示的多答案生成问题，提出了一种新颖的算法，通过降低奖励模型复杂度，以对数复杂度学习样本复杂性。

arXiv:2510.15464v1 Announce Type: cross Abstract: We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each training question, as in Supervised Fine Tuning (SFT). We formalize the problem as offline imitation learning in contextual bandits, with demonstrations from some optimal policy, without explicitly observed rewards. Prior work assumes that the demonstrator belongs to a low-complexity policy class, which motivates maximum likelihood estimation (i.e., log-loss minimization). In contrast, we propose relying only on the reward model (specifying which answers are correct) being in a low-cardinality class, which we argue is a weaker assumption. We show that likelihood maximization methods can fail in this case, and instead devise an alternative novel approach that learns with sample complexity logarithmic in the cardinality of the reward class. Our work motivates looking beyond likelihood maximization when learning from correct demonstrations.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

问答生成演示学习多答案

相关文章

Allen Institute for AI Releases Tulu 2.5 Suite on Hugging Face: Advanced AI Models Trained with DPO and PPO, Featuring Reward and Value Models

Nvidia開源Nemotron-4 340B家族，以供開發者建置大型語言模型

两名新加坡留学生提出AI分析框架，助力提升大模型的推理能力

NVIDIA Introduces RankRAG: A Novel RAG Framework that Instruction-Tunes a Single LLM for the Dual Purposes of Top-k Context Ranking and Answer Generation in RAG

昆仑万维：方向又对了？

昆仑万维发布奖励模型 Skywork-Reward，登顶 RewardBench 排行榜

活动报名｜LLM Alignment综述及RLHF、DPO、UNA的深入分析

This AI Paper from China Introduces a Reward-Robust Reinforcement Learning from Human Feedback RLHF Framework for Enhancing the Stability and Performance of Large Language Models

LASER: An Adaptive Method for Selecting Reward Models RMs and Iteratively Training LLMs Using Multiple Reward Models RMs

英伟达开源模型 Nemotron-70B 超越 GPT-4o 和 Claude 3.5，仅次于 OpenAI o1