EAGLET框架提升AI代理长时任务表现

2025 was supposed to be the year of "AI agents," according to Nvidia CEO Jensen Huang, and other AI industry personnel. And it has been, in many ways, with numerous leading AI model providers such as OpenAI, Google, and even Chinese competitors like Alibaba releasing fine-tuned AI models or applications designed to focus on a narrow set of tasks, such as web search and report writing.

But one big hurdle to a future of highly performant, reliable, AI agents remains: getting them to stay on task when the task extends over a number of steps. Third-party benchmark tests show even the most powerful AI models experience higher failure rates the more steps they take to complete a task, and the longer time they spend on it (exceeding hours).

A new academic framework called EAGLET proposes a practical and efficient method to improve long-horizon task performance in LLM-based agents — without the need for manual data labeling or retraining.

Developed by researchers from Tsinghua University, Peking University, DeepLang AI, and the University of Illinois Urbana-Champaign, EAGLET offers a "global planner" that can be integrated into existing agent workflows to reduce hallucinations and improve task efficiency.

EAGLET is a fine-tuned language model that interprets task instructions — typically provided as prompts by the user or the agent's operating environment — and generates a high-level plan for the agent (powered by its own LLM). It does not intervene during execution, but its up-front guidance helps reduce planning errors and improve task completion rates.

Addressing the Planning Problem in Long-Horizon Agents

Many LLM-based agents struggle with long-horizon tasks because they rely on reactive, step-by-step reasoning. This approach often leads to trial-and-error behavior, planning hallucinations, and inefficient trajectories.

EAGLET tackles this limitation by introducing a global planning module that works alongside the executor agent.

Instead of blending planning and action generation in a single model, EAGLET separates them, enabling more coherent, task-level strategies.

A Two-Stage Training Pipeline with No Human Annotations

EAGLET’s planner is trained using a two-stage process that requires no human-written plans or annotations.

The first stage involves generating synthetic plans with high-capability LLMs, such as GPT-5 and DeepSeek-V3.1-Think.

These plans are then filtered using a novel strategy called homologous consensus filtering, which retains only those that improve task performance for both expert and novice executor agents.

In the second stage, a rule-based reinforcement learning process further refines the planner, using a custom-designed reward function to assess how much each plan helps multiple agents succeed.

Introducing the Executor Capability Gain Reward (ECGR)

One of EAGLET’s key innovations is the Executor Capability Gain Reward (ECGR).

This reward measures the value of a generated plan by checking whether it helps both high- and low-capability agents complete tasks more successfully and with fewer steps.

It also includes a decay factor to favor shorter, more efficient task trajectories. This approach avoids over-rewarding plans that are only useful to already-competent agents and promotes more generalizable planning guidance.

Compatible with Existing Agents and Models

The EAGLET planner is designed to be modular and "plug-and-play," meaning it can be inserted into existing agent pipelines without requiring executor retraining.

In evaluations, the planner boosted performance across a variety of foundational models, including GPT-4.1, GPT-5, Llama-3.1, and Qwen2.5.

It also proved effective regardless of prompting strategy, working well with standard ReAct-style prompts as well as approaches like Reflexion.

State-of-the-Art Performance Across Benchmarks

EAGLET was tested on three widely used benchmarks for long-horizon agent tasks: ScienceWorld, which simulates scientific experiments in a text-based lab environment; ALFWorld, which tasks agents with completing household activities through natural language in a simulated home setting; and WebShop, which evaluates goal-driven behavior in a realistic online shopping interface.

Across all three, executor agents equipped with EAGLET outperformed their non-planning counterparts and other planning baselines, including MPO and KnowAgent.

In experiments with the open source Llama-3.1-8B-Instruct model, EAGLET boosted average performance from 39.5 to 59.4, a +19.9 point gain across tasks.

On ScienceWorld unseen scenarios, it raised performance from 42.2 to 61.6.

In ALFWorld seen scenarios, EAGLET improved outcomes from 22.9 to 54.3, a more than 2.3× increase in performance.

Even stronger gains were seen with more capable models.

For instance, GPT-4.1 improved from 75.5 to 82.2 average score with EAGLET, and GPT-5 rose from 84.5 to 88.1, despite already being strong performers.

In some benchmarks, performance gains were as high as +11.8 points, such as when combining EAGLET with the ETO executor method on ALFWorld unseen tasks.

Compared to other planning baselines like MPO, EAGLET consistently delivered higher task completion rates. For example, on ALFWorld unseen tasks with GPT-4.1, MPO achieved 79.1, while EAGLET scored 83.6—a +4.5 point advantage.

Additionally, the paper reports that agents using EAGLET complete tasks in fewer steps on average. With GPT-4.1 as executor, average step count dropped from 13.0 (no planner) to 11.1 (EAGLET). With GPT-5, it dropped from 11.4 to 9.4, supporting the claim of improved execution efficiency.

Efficiency Gains in Training and Execution

Compared to RL-based methods like GiGPO, which can require hundreds of training iterations, EAGLET achieved better or comparable results with roughly one-eighth the training effort.

This efficiency also carries over into execution: agents using EAGLET typically needed fewer steps to complete tasks. This translates into reduced inference time and compute cost in production scenarios.

No Public Code—Yet

As of the version submitted to arXiv, the authors have not released an open-source implementation of EAGLET. It is unclear if or when the code will be released, under what license, or how it will be maintained, which may limit the near-term utility of the framework for enterprise deployment.

VentureBeat has reached out to the authors to clarify these points and will update this piece when we hear back.

Enterprise Deployment Questions Remain

While the planner is described as plug-and-play, it remains unclear whether EAGLET can be easily integrated into popular enterprise agent frameworks such as LangChain or AutoGen, or if it requires a custom stack to support plan-execute separation.

Similarly, the training setup leverages multiple executor agents, which may be difficult to replicate in enterprise environments with limited model access. VentureBeat has asked the researchers whether the homologous consensus filtering method can be adapted for teams that only have access to one executor model or limited compute resources.

EAGLET’s authors report success across model types and sizes, but it is not yet known what the minimal viable model scale is for practical deployment. For example, can enterprise teams use the planner effectively with sub-10B parameter open models in latency-sensitive environments? Additionally, the framework may offer industry-specific value in domains like customer support or IT automation, but it remains to be seen how easily the planner can be fine-tuned or customized for such verticals.

Real-Time vs. Pre-Generated Planning

Another open question is how EAGLET is best deployed in practice. Should the planner operate in real-time alongside executors within a loop, or is it better used offline to pre-generate global plans for known task types? Each approach has implications for latency, cost, and operational complexity. VentureBeat has posed this question to the authors and will report any insights that emerge.

Strategic Tradeoffs for Enterprise Teams

For technical leaders at medium-to-large enterprises, EAGLET represents a compelling proof of concept for improving the reliability and efficiency of LLM agents. But without public tooling or implementation guidelines, the framework still presents a build-versus-wait decision. Enterprises must weigh the potential gains in task performance and efficiency against the costs of reproducing or approximating the training process in-house.

Potential Use Cases in Enterprise Settings

For enterprises developing agentic AI systems—especially in environments requiring stepwise planning, such as IT automation, customer support, or online interactions—EAGLET offers a template for how to incorporate planning without retraining. Its ability to guide both open- and closed-source models, along with its efficient training method, may make it an appealing starting point for teams seeking to improve agent performance with minimal overhead.