可验证奖励_Fishai

热点

"可验证奖励" 相关文章

Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration

cs.AI updates on arXiv.org 2025-11-05T05:26:58.000000Z

SimKO: Simple Pass@K Policy Optimization

cs.AI updates on arXiv.org 2025-10-17T04:10:24.000000Z

SimKO: Simple Pass@K Policy Optimization

cs.AI updates on arXiv.org 2025-10-17T04:10:24.000000Z

Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

cs.AI updates on arXiv.org 2025-10-14T04:19:29.000000Z

Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

cs.AI updates on arXiv.org 2025-10-14T04:19:29.000000Z

拒绝“熵崩塌”和“熵爆炸”，这项研究让大模型学会“精确探索”，推理成绩飙升

36氪 - 科技频道 2025-10-13T11:41:45.000000Z

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

cs.AI updates on arXiv.org 2025-10-10T04:19:30.000000Z

The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View

cs.AI updates on arXiv.org 2025-10-07T04:16:13.000000Z

The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View

cs.AI updates on arXiv.org 2025-10-07T04:16:13.000000Z

The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View

cs.AI updates on arXiv.org 2025-10-07T04:16:13.000000Z

Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs

cs.AI updates on arXiv.org 2025-10-07T04:07:57.000000Z

The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models

cs.AI updates on arXiv.org 2025-10-03T04:08:50.000000Z

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

cs.AI updates on arXiv.org 2025-10-02T04:18:38.000000Z

Quantile Advantage Estimation for Entropy-Safe Reasoning

cs.AI updates on arXiv.org 2025-09-29T04:16:38.000000Z

Reasoning Core: A Scalable RL Environment for LLM Symbolic Reasoning

cs.AI updates on arXiv.org 2025-09-23T05:28:10.000000Z

攻克AI过度思考难题！美团新研究让通过“可验证”过程奖励激活LRM的高效推理

智源社区 2025-09-12T13:23:03.000000Z

DCPO: Dynamic Clipping Policy Optimization

cs.AI updates on arXiv.org 2025-09-03T04:17:49.000000Z

Writing-Zero: 打破 AI 写作天花板, 让 AI 写作更像"人"

夕小瑶科技说 2025-08-11T08:59:57.000000Z

Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning

cs.AI updates on arXiv.org 2025-08-05T11:29:05.000000Z

CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment

cs.AI updates on arXiv.org 2025-08-05T11:10:02.000000Z

Copyright © 2019 FISHAI.All Rights Reserved