热点
"可验证奖励" 相关文章
Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration
cs.AI updates on arXiv.org 2025-11-05T05:26:58.000000Z
SimKO: Simple Pass@K Policy Optimization
cs.AI updates on arXiv.org 2025-10-17T04:10:24.000000Z
SimKO: Simple Pass@K Policy Optimization
cs.AI updates on arXiv.org 2025-10-17T04:10:24.000000Z
Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
cs.AI updates on arXiv.org 2025-10-14T04:19:29.000000Z
Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
cs.AI updates on arXiv.org 2025-10-14T04:19:29.000000Z
拒绝“熵崩塌”和“熵爆炸”,这项研究让大模型学会“精确探索”,推理成绩飙升
36氪 - 科技频道 2025-10-13T11:41:45.000000Z
On the optimization dynamics of RLVR: Gradient gap and step size thresholds
cs.AI updates on arXiv.org 2025-10-10T04:19:30.000000Z
The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View
cs.AI updates on arXiv.org 2025-10-07T04:16:13.000000Z
The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View
cs.AI updates on arXiv.org 2025-10-07T04:16:13.000000Z
The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View
cs.AI updates on arXiv.org 2025-10-07T04:16:13.000000Z
Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs
cs.AI updates on arXiv.org 2025-10-07T04:07:57.000000Z
The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models
cs.AI updates on arXiv.org 2025-10-03T04:08:50.000000Z
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
cs.AI updates on arXiv.org 2025-10-02T04:18:38.000000Z
Quantile Advantage Estimation for Entropy-Safe Reasoning
cs.AI updates on arXiv.org 2025-09-29T04:16:38.000000Z
Reasoning Core: A Scalable RL Environment for LLM Symbolic Reasoning
cs.AI updates on arXiv.org 2025-09-23T05:28:10.000000Z
攻克AI过度思考难题!美团新研究让通过“可验证”过程奖励激活LRM的高效推理
智源社区 2025-09-12T13:23:03.000000Z
DCPO: Dynamic Clipping Policy Optimization
cs.AI updates on arXiv.org 2025-09-03T04:17:49.000000Z
Writing-Zero: 打破 AI 写作天花板, 让 AI 写作更像"人"
夕小瑶科技说 2025-08-11T08:59:57.000000Z
Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning
cs.AI updates on arXiv.org 2025-08-05T11:29:05.000000Z
CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment
cs.AI updates on arXiv.org 2025-08-05T11:10:02.000000Z