MarkTechPost@AI 10月15日 17:51
Meta AI新方法:无需奖励,语言智能体训练超越模仿学习
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta Superintelligence Labs提出了一种名为“早期经验”的全新训练方法,能够让语言智能体仅凭自身行为的反馈进行学习,无需奖励信号或人类示范,并在八项基准测试中超越了模仿学习。该方法通过让智能体在专家状态下分支,自行采取行动并收集后续状态,将这些结果转化为监督信号。具体策略包括“隐式世界模型”(IWM)和“自我反思”(SR)。这种方法在保持与模仿学习相似的优化预算下,显著提高了智能体的成功率和泛化能力,尤其是在专家数据有限的情况下,展现出更高的效率。

🌟 **“早期经验”:一种创新的无奖励训练范式**:该方法革新了语言智能体的训练方式,使其能够摆脱对奖励函数和大量人类示范的依赖。通过让智能体自主探索并从自身行为的后果中学习,它在各种复杂环境中展现出强大的学习能力,并能在后续的强化学习阶段提供更优的初始化。

🌐 **隐式世界模型(IWM)与自我反思(SR)双管齐下**:IWM通过预测给定状态和动作后的下一个状态,帮助智能体构建更精确的环境动态模型,减少策略漂移。SR则通过对比专家与备选动作的实际结果,让模型解释专家动作的优越性,并从中提取监督信号,进一步优化策略。

🚀 **显著的性能提升与效率优化**:在WebShop、TravelPlanner、ScienceWorld等八个语言智能体环境中,“早期经验”方法相较于传统的模仿学习,在成功率上平均提升了9.6%,在域外(OOD)能力上提升了9.4%。更重要的是,在固定优化预算下,它仅使用极少量(如WebShop的1/8)专家数据即可达到甚至超越模仿学习使用全部数据的效果,大大提高了数据效率。

🤝 **作为强化学习的强大预训练基础**:当“早期经验”被用作强化学习(如GRPO)的初始化方法时,其后续的强化学习阶段性能得到了显著增强,最终成功率最高可提升6.4%。这表明“早期经验”为后续的RL训练提供了一个更坚实、更有效的起点,尤其适用于奖励信号稀缺的实际应用场景。

How would your agent stack change if a policy could train purely from its own outcome-grounded rollouts—no rewards, no demos—yet beat imitation learning across eight benchmarks? Meta Superintelligence Labs propose ‘Early Experience‘, a reward-free training approach that improves policy learning in language agents without large human demonstration sets and without reinforcement learning (RL) in the main loop. The core idea is simple: let the agent branch from expert states, take its own actions, collect the resulting future states, and convert those consequences into supervision. The research team instantiates this with two concrete strategies—Implicit World Modeling (IWM) and Self-Reflection (SR)—and reports consistent gains across eight environments and multiple base models.

https://arxiv.org/pdf/2510.08558

What Early Experience changes?

Traditional pipelines lean on imitation learning (IL) over expert trajectories, which is cheap to optimize but hard to scale and brittle out-of-distribution; reinforcement learning (RL) promises learning from experience but needs verifiable rewards and stable infrastructure—often missing in web and multi-tool settings. Early Experience sits between them: it is reward-free like imitation learning (IL), but the supervision is grounded in consequences of the agent’s own actions, not just expert actions. In short, the agent proposes, acts, and learns from what actually happens next—no reward function required.

Both strategies use the same budgets and decoding settings as IL; only the data source differs (agent-generated branches rather than more expert trajectories).

https://arxiv.org/pdf/2510.08558

Understanding the Benchmarks

The research team evaluate on eight language-agent environments spanning web navigation, long-horizon planning, scientific/embodied tasks, and multi-domain API workflows—e.g., WebShop (transactional browsing), TravelPlanner (constraint-rich planning), ScienceWorld, ALFWorld, Tau-Bench, and others. Early Experience yields average absolute gains of +9.6 success and +9.4 out-of-domain (OOD) over IL across the full matrix of tasks and models. These gains persist when the same checkpoints are used to initialize RL (GRPO), improving post-RL ceilings by up to +6.4 compared to reinforcement learning (RL) started from imitation learning (IL).

Efficiency: less expert data, same optimization budget

A key practical win is demo efficiency. With a fixed optimization budget, Early Experience matches or beats IL using a fraction of expert data. On WebShop, 1/8 of the demonstrations with Early Experience already exceeds IL trained on the full demo set; on ALFWorld, parity is hit at 1/2 the demos. The advantage grows with more demonstrations, indicating the agent-generated future states provide supervision signals that demonstrations alone do not capture.

How the data is built?

The pipeline seeds from a limited set of expert rollouts to obtain representative states. At selected states, the agent proposes alternative actions, executes them, and records the next observations.

Where reinforcement learning (RL) fits?

Early Experience is not “RL without rewards.” It is a supervised recipe that uses agent-experienced outcomes as labels. In environments with verifiable rewards, the research team simply add RL after Early Experience. Because the initialization is better than IL, the same RL schedule climbs higher and faster, with up to +6.4 final success over IL-initialized RL across tested domains. This positions Early Experience as a bridge: reward-free pre-training from consequences, followed (where possible) by standard reinforcement learning (RL).

Key Takeaways

Editorial Comments

Early Experience is a pragmatic contribution: it replaces brittle rationale-only augmentation with outcome-grounded supervision that an agent can generate at scale, without reward functions. The two variants—Implicit World Modeling (next-observation prediction to anchor environment dynamics) and Self-Reflection (contrastive, outcome-verified rationales against expert actions)—directly attack off-policy drift and long-horizon error accumulation, explaining the consistent gains over imitation learning across eight environments and the stronger RL ceilings when used as an initializer for GRPO. In web and tool-use settings where verifiable rewards are scarce, this reward-free supervision is the missing middle between IL and RL and is immediately actionable for production agent stacks.


Check out the PAPER here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Meta AI’s ‘Early Experience’ Trains Language Agents without Rewards—and Outperforms Imitation Learning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Meta AI Early Experience 语言智能体 无奖励学习 模仿学习 强化学习 隐式世界模型 自我反思 AI训练 Meta Superintelligence Labs Language Agents Reward-Free Training Imitation Learning Reinforcement Learning Implicit World Modeling Self-Reflection AI Training
相关文章