热点
"离策略强化学习" 相关文章
RL without TD learning
The Berkeley Artificial Intelligence Research Blog 2025-11-07T07:20:30.000000Z
Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends
cs.AI updates on arXiv.org 2025-09-30T04:06:21.000000Z
Functional Critic Modeling for Provably Convergent Off-Policy Actor-Critic
cs.AI updates on arXiv.org 2025-09-30T04:03:46.000000Z