MarkTechPost@AI 10月09日
RA3:通过时间动作抽象优化模型训练,加速代码LLM的RL后训练
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

苹果公司的一项新研究首次正式定义了“中期训练”在强化学习(RL)后训练前的作用,并提出了RA3(Reasoning as Action Abstractions)算法。RA3采用一种类似EM(期望最大化)的流程,从专家轨迹中学习时间上一致的潜在动作,然后利用这些引导后的轨迹进行微调。研究表明,中期训练应(1)修剪到紧凑的近优动作子空间,以及(2)缩短有效的规划视野,从而提高RL收敛性。实验证明,RA3在HumanEval/MBPP上比基础模型/NTP模型分别提高了约8/4个点,并在HumanEval+、MBPP+、LiveCodeBench和Codeforces上的RLVR表现加速。

✨ **正式化中期训练的作用:** 该研究首次对“中期训练”在强化学习(RL)后训练前的作用进行了正式定义,将其分解为“修剪效率”和“RL收敛性”两个维度。研究强调,中期训练在决策空间紧凑且有效规划视野短时最为有效,这有利于利用时间抽象而非逐个词的原始动作。

💡 **RA3算法实现:** RA3算法通过优化一个序列变分下界(时间ELBO),并采用类似EM(期望最大化)的迭代过程来实现。第一步(E-步)是利用RL推断出与专家序列一致的、时间上连贯的潜在结构(动作抽象);第二步(M-步)是在带有潜在标签的引导轨迹上进行下一个词预测,从而将这些抽象融入模型的策略中。

🚀 **代码生成和RLVR性能提升:** 在Python代码任务上,RA3在HumanEval和MBPP上的平均pass@k得分比基础模型和NTP中期训练基线分别提高了约8和4个点。此外,当使用RA3进行初始化时,RLVR在HumanEval+、MBPP+、LiveCodeBench和Codeforces上的训练收敛速度更快,并达到了更高的最终性能。

TL;DR: A new research from Apple, formalizes what “mid-training” should do before reinforcement learning RL post-training and introduces RA3 (Reasoning as Action Abstractions)—an EM-style procedure that learns temporally consistent latent actions from expert traces, then fine-tunes on those bootstrapped traces. It shows mid-training should (1) prune to a compact near-optimal action subspace and (2) shorten the effective planning horizon, improving RL convergence. Empirically, RA3 improves HumanEval/MBPP by ~8/4 points over base/NTP and accelerates RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

What does the research present?

The research team present the first formal treatment of how mid-training shapes post-training reinforcement learning RL: they breakdown outcomes into (i) pruning efficiency—how well mid-training selects a compact near-optimal action subset that shapes the initial policy prior—and (ii) RL convergence—how quickly post-training improves within that restricted set. The analysis argues mid-training is most effective when the decision space is compact and the effective horizon is short, favoring temporal abstractions over primitive next-token actions.

https://arxiv.org/pdf/2509.25810

Algorithm: RA3 in one pass

RA3 derives a sequential variational lower bound (a temporal ELBO) and optimizes it with an EM-like loop:

Results: code generation and RLVR

On Python code tasks, the research team reports that across multiple base models, RA3 improves average pass@k on HumanEval and MBPP by ~8 and ~4 points over the base model and an NTP mid-training baseline. In post-training, RLVR converges faster and to higher final performance on HumanEval+, MBPP+, LiveCodeBench, and Codeforces when initialized from RA3. These are mid- and post-training effects respectively; the evaluation scope is code generation.

Key Takeaways

    The research team formalizes mid-training via two determinants—pruning efficiency and impact on RL convergence—arguing effectiveness rises when the decision space is compact and the effective horizon is short.RA3 optimizes a sequential variational lower bound by iteratively discovering temporally consistent latent structures with RL and then fine-tuning on bootstrapped traces (EM-style).On code generation, RA3 reports ~+8 (HumanEval) and ~+4 (MBPP) average pass@k gains over base/NTP mid-training baselines across several model scales.Initializing post-training with RA3 accelerates RLVR convergence and improves asymptotic performance on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

Editorial Comments

RA3’s contribution is concrete and narrow: it formalizes mid-training around two determinants—pruning efficiency and RL convergence—and operationalizes them via a temporal ELBO optimized in an EM loop to learn persistent action abstractions before RLVR. The researchers report ~+8 (HumanEval) and ~+4 (MBPP) average pass@k gains over base/NTP and faster RLVR convergence on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.


Check out the Technical Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post RA3: Mid-Training with Temporal Action Abstractions for Faster Reinforcement Learning (RL) Post-Training in Code LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RA3 中期训练 强化学习 代码LLM 动作抽象 RA3 Mid-Training Reinforcement Learning Code LLMs Action Abstractions
相关文章