Meta AI新方法：无需奖励，语言智能体训练超越模仿学习

How would your agent stack change if a policy could train purely from its own outcome-grounded rollouts—no rewards, no demos—yet beat imitation learning across eight benchmarks? Meta Superintelligence Labs propose ‘Early Experience‘, a reward-free training approach that improves policy learning in language agents without large human demonstration sets and without reinforcement learning (RL) in the main loop. The core idea is simple: let the agent branch from expert states, take its own actions, collect the resulting future states, and convert those consequences into supervision. The research team instantiates this with two concrete strategies—Implicit World Modeling (IWM) and Self-Reflection (SR)—and reports consistent gains across eight environments and multiple base models.

https://arxiv.org/pdf/2510.08558

What Early Experience changes?

Traditional pipelines lean on imitation learning (IL) over expert trajectories, which is cheap to optimize but hard to scale and brittle out-of-distribution; reinforcement learning (RL) promises learning from experience but needs verifiable rewards and stable infrastructure—often missing in web and multi-tool settings. Early Experience sits between them: it is reward-free like imitation learning (IL), but the supervision is grounded in consequences of the agent’s own actions, not just expert actions. In short, the agent proposes, acts, and learns from what actually happens next—no reward function required.

Implicit World Modeling (IWM):

next observation

Self-Reflection (SR):

why the expert action is better

Both strategies use the same budgets and decoding settings as IL; only the data source differs (agent-generated branches rather than more expert trajectories).

https://arxiv.org/pdf/2510.08558

Understanding the Benchmarks

The research team evaluate on eight language-agent environments spanning web navigation, long-horizon planning, scientific/embodied tasks, and multi-domain API workflows—e.g., WebShop (transactional browsing), TravelPlanner (constraint-rich planning), ScienceWorld, ALFWorld, Tau-Bench, and others. Early Experience yields average absolute gains of +9.6 success and +9.4 out-of-domain (OOD) over IL across the full matrix of tasks and models. These gains persist when the same checkpoints are used to initialize RL (GRPO), improving post-RL ceilings by up to +6.4 compared to reinforcement learning (RL) started from imitation learning (IL).

Efficiency: less expert data, same optimization budget

A key practical win is demo efficiency. With a fixed optimization budget, Early Experience matches or beats IL using a fraction of expert data. On WebShop, 1/8 of the demonstrations with Early Experience already exceeds IL trained on the full demo set; on ALFWorld, parity is hit at 1/2 the demos. The advantage grows with more demonstrations, indicating the agent-generated future states provide supervision signals that demonstrations alone do not capture.

How the data is built?

The pipeline seeds from a limited set of expert rollouts to obtain representative states. At selected states, the agent proposes alternative actions, executes them, and records the next observations.

IWM

next-state prediction

grounded rationale

Where reinforcement learning (RL) fits?

Early Experience is not “RL without rewards.” It is a supervised recipe that uses agent-experienced outcomes as labels. In environments with verifiable rewards, the research team simply add RL after Early Experience. Because the initialization is better than IL, the same RL schedule climbs higher and faster, with up to +6.4 final success over IL-initialized RL across tested domains. This positions Early Experience as a bridge: reward-free pre-training from consequences, followed (where possible) by standard reinforcement learning (RL).

Key Takeaways

future states

Implicit World Modeling

Self-Reflection

+18.4 (WebShop)

+15.0 (TravelPlanner)

+13.3 (ScienceWorld)

Demo efficiency:

1/8

1/2

Early Experience

up to +6.4

imitation learning (IL)

reinforcement learning (RL)

Editorial Comments

Early Experience is a pragmatic contribution: it replaces brittle rationale-only augmentation with outcome-grounded supervision that an agent can generate at scale, without reward functions. The two variants—Implicit World Modeling (next-observation prediction to anchor environment dynamics) and Self-Reflection (contrastive, outcome-verified rationales against expert actions)—directly attack off-policy drift and long-horizon error accumulation, explaining the consistent gains over imitation learning across eight environments and the stronger RL ceilings when used as an initializer for GRPO. In web and tool-use settings where verifiable rewards are scarce, this reward-free supervision is the missing middle between IL and RL and is immediately actionable for production agent stacks.

Check out the PAPER here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Meta AI’s ‘Early Experience’ Trains Language Agents without Rewards—and Outperforms Imitation Learning appeared first on MarkTechPost.

What Early Experience changes?

Understanding the Benchmarks

Efficiency: less expert data, same optimization budget

How the data is built?

Where reinforcement learning (RL) fits?

Key Takeaways

Editorial Comments

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签