热点
"奖励过优化" 相关文章
Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment
cs.AI updates on arXiv.org 2025-10-08T04:12:10.000000Z
与OpenAI o1技术理念相似,TDPO-R算法有效缓解奖励过优化问题
机器之心 2024-10-26T10:21:57.000000Z