VentureBeat 10月15日 06:41
EAGLET框架提升AI代理长时任务表现
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

2025年被认为是“AI代理”之年,然而AI代理在执行长时任务时常面临效率和可靠性挑战。最近,来自清华大学等机构的研究人员提出了EAGLET框架,一种创新的方法来改善基于大语言模型(LLM)的AI代理在长时任务中的表现。EAGLET通过引入一个“全局规划器”来解决这一问题,该规划器可以在现有代理工作流程中集成,以减少幻觉并提高任务效率,而无需手动数据标注或重新训练。该框架采用两阶段训练流程,利用合成数据和强化学习进行优化,并通过“执行器能力增益奖励”(ECGR)来评估规划的有效性。EAGLET能够兼容多种现有模型和代理,并在多个基准测试中展现出显著的性能提升,包括提高任务完成率和减少执行步骤,为企业部署AI代理提供了新的思路。

💡 **EAGLET框架解决AI代理长时任务瓶颈**:AI代理在处理需要多步骤才能完成的任务时,性能会显著下降。EAGLET框架通过引入一个独立的“全局规划器”,将计划和执行过程分离,有效解决了这一长期存在的挑战,显著提高了AI代理在复杂、耗时任务上的可靠性和效率。

🚀 **创新的无监督训练与奖励机制**:EAGLET的训练过程无需人工标注数据,而是采用两阶段的合成数据生成和基于规则的强化学习。其核心创新之一是“执行器能力增益奖励”(ECGR),该奖励机制能够评估规划对不同能力水平的执行器代理的普遍帮助程度,确保生成的规划具有广泛的适用性和效率。

🔌 **高度兼容与性能飞跃**:EAGLET框架设计为模块化、即插即用,可以轻松集成到现有的AI代理流程中,无需对底层模型进行重新训练。在多个基准测试中,集成EAGLET的代理在任务完成率和执行效率上均取得了显著提升,即使是强大的模型如GPT-5,性能也有进一步的改善。

🤔 **企业部署的考量与未来展望**:尽管EAGLET展现出巨大潜力,但目前尚未公开代码,且在企业级框架集成、多模型训练环境复制等方面仍存在一些不确定性。企业在考虑采用时,需权衡自行复现或等待官方发布的成本与效益,但其作为提高AI代理可靠性和效率的范例,为企业应用提供了重要的参考方向。

2025 was supposed to be the year of "AI agents," according to Nvidia CEO Jensen Huang, and other AI industry personnel. And it has been, in many ways, with numerous leading AI model providers such as OpenAI, Google, and even Chinese competitors like Alibaba releasing fine-tuned AI models or applications designed to focus on a narrow set of tasks, such as web search and report writing.

But one big hurdle to a future of highly performant, reliable, AI agents remains: getting them to stay on task when the task extends over a number of steps. Third-party benchmark tests show even the most powerful AI models experience higher failure rates the more steps they take to complete a task, and the longer time they spend on it (exceeding hours).

A new academic framework called EAGLET proposes a practical and efficient method to improve long-horizon task performance in LLM-based agents — without the need for manual data labeling or retraining.

Developed by researchers from Tsinghua University, Peking University, DeepLang AI, and the University of Illinois Urbana-Champaign, EAGLET offers a "global planner" that can be integrated into existing agent workflows to reduce hallucinations and improve task efficiency.

EAGLET is a fine-tuned language model that interprets task instructions — typically provided as prompts by the user or the agent's operating environment — and generates a high-level plan for the agent (powered by its own LLM). It does not intervene during execution, but its up-front guidance helps reduce planning errors and improve task completion rates.

Addressing the Planning Problem in Long-Horizon Agents

Many LLM-based agents struggle with long-horizon tasks because they rely on reactive, step-by-step reasoning. This approach often leads to trial-and-error behavior, planning hallucinations, and inefficient trajectories.

EAGLET tackles this limitation by introducing a global planning module that works alongside the executor agent.

Instead of blending planning and action generation in a single model, EAGLET separates them, enabling more coherent, task-level strategies.

A Two-Stage Training Pipeline with No Human Annotations

EAGLET’s planner is trained using a two-stage process that requires no human-written plans or annotations.

The first stage involves generating synthetic plans with high-capability LLMs, such as GPT-5 and DeepSeek-V3.1-Think.

These plans are then filtered using a novel strategy called homologous consensus filtering, which retains only those that improve task performance for both expert and novice executor agents.

In the second stage, a rule-based reinforcement learning process further refines the planner, using a custom-designed reward function to assess how much each plan helps multiple agents succeed.

Introducing the Executor Capability Gain Reward (ECGR)

One of EAGLET’s key innovations is the Executor Capability Gain Reward (ECGR).

This reward measures the value of a generated plan by checking whether it helps both high- and low-capability agents complete tasks more successfully and with fewer steps.

It also includes a decay factor to favor shorter, more efficient task trajectories. This approach avoids over-rewarding plans that are only useful to already-competent agents and promotes more generalizable planning guidance.

Compatible with Existing Agents and Models

The EAGLET planner is designed to be modular and "plug-and-play," meaning it can be inserted into existing agent pipelines without requiring executor retraining.

In evaluations, the planner boosted performance across a variety of foundational models, including GPT-4.1, GPT-5, Llama-3.1, and Qwen2.5.

It also proved effective regardless of prompting strategy, working well with standard ReAct-style prompts as well as approaches like Reflexion.

State-of-the-Art Performance Across Benchmarks

EAGLET was tested on three widely used benchmarks for long-horizon agent tasks: ScienceWorld, which simulates scientific experiments in a text-based lab environment; ALFWorld, which tasks agents with completing household activities through natural language in a simulated home setting; and WebShop, which evaluates goal-driven behavior in a realistic online shopping interface.

Across all three, executor agents equipped with EAGLET outperformed their non-planning counterparts and other planning baselines, including MPO and KnowAgent.

In experiments with the open source Llama-3.1-8B-Instruct model, EAGLET boosted average performance from 39.5 to 59.4, a +19.9 point gain across tasks.

On ScienceWorld unseen scenarios, it raised performance from 42.2 to 61.6.

In ALFWorld seen scenarios, EAGLET improved outcomes from 22.9 to 54.3, a more than 2.3× increase in performance.

Even stronger gains were seen with more capable models.

For instance, GPT-4.1 improved from 75.5 to 82.2 average score with EAGLET, and GPT-5 rose from 84.5 to 88.1, despite already being strong performers.

In some benchmarks, performance gains were as high as +11.8 points, such as when combining EAGLET with the ETO executor method on ALFWorld unseen tasks.

Compared to other planning baselines like MPO, EAGLET consistently delivered higher task completion rates. For example, on ALFWorld unseen tasks with GPT-4.1, MPO achieved 79.1, while EAGLET scored 83.6—a +4.5 point advantage.

Additionally, the paper reports that agents using EAGLET complete tasks in fewer steps on average. With GPT-4.1 as executor, average step count dropped from 13.0 (no planner) to 11.1 (EAGLET). With GPT-5, it dropped from 11.4 to 9.4, supporting the claim of improved execution efficiency.

Efficiency Gains in Training and Execution

Compared to RL-based methods like GiGPO, which can require hundreds of training iterations, EAGLET achieved better or comparable results with roughly one-eighth the training effort.

This efficiency also carries over into execution: agents using EAGLET typically needed fewer steps to complete tasks. This translates into reduced inference time and compute cost in production scenarios.

No Public Code—Yet

As of the version submitted to arXiv, the authors have not released an open-source implementation of EAGLET. It is unclear if or when the code will be released, under what license, or how it will be maintained, which may limit the near-term utility of the framework for enterprise deployment.

VentureBeat has reached out to the authors to clarify these points and will update this piece when we hear back.

Enterprise Deployment Questions Remain

While the planner is described as plug-and-play, it remains unclear whether EAGLET can be easily integrated into popular enterprise agent frameworks such as LangChain or AutoGen, or if it requires a custom stack to support plan-execute separation.

Similarly, the training setup leverages multiple executor agents, which may be difficult to replicate in enterprise environments with limited model access. VentureBeat has asked the researchers whether the homologous consensus filtering method can be adapted for teams that only have access to one executor model or limited compute resources.

EAGLET’s authors report success across model types and sizes, but it is not yet known what the minimal viable model scale is for practical deployment. For example, can enterprise teams use the planner effectively with sub-10B parameter open models in latency-sensitive environments? Additionally, the framework may offer industry-specific value in domains like customer support or IT automation, but it remains to be seen how easily the planner can be fine-tuned or customized for such verticals.

Real-Time vs. Pre-Generated Planning

Another open question is how EAGLET is best deployed in practice. Should the planner operate in real-time alongside executors within a loop, or is it better used offline to pre-generate global plans for known task types? Each approach has implications for latency, cost, and operational complexity. VentureBeat has posed this question to the authors and will report any insights that emerge.

Strategic Tradeoffs for Enterprise Teams

For technical leaders at medium-to-large enterprises, EAGLET represents a compelling proof of concept for improving the reliability and efficiency of LLM agents. But without public tooling or implementation guidelines, the framework still presents a build-versus-wait decision. Enterprises must weigh the potential gains in task performance and efficiency against the costs of reproducing or approximating the training process in-house.

Potential Use Cases in Enterprise Settings

For enterprises developing agentic AI systems—especially in environments requiring stepwise planning, such as IT automation, customer support, or online interactions—EAGLET offers a template for how to incorporate planning without retraining. Its ability to guide both open- and closed-source models, along with its efficient training method, may make it an appealing starting point for teams seeking to improve agent performance with minimal overhead.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

EAGLET AI Agents Large Language Models Long-Horizon Tasks AI Planning Tsinghua University AI Efficiency LLM Agents Artificial Intelligence AI Research AI 代理 大语言模型 长时任务 AI 规划 人工智能 AI 效率
相关文章