AgentFlow：一种创新的可训练智能体框架

TL;DR: AgentFlow is a trainable agent framework with four modules—Planner, Executor, Verifier, Generator—coordinated by an explicit memory and toolset. The planner is optimized in the loop with a new on-policy method, Flow-GRPO, which broadcasts a trajectory-level outcome reward to every turn and applies token-level PPO-style updates with KL regularization and group-normalized advantages. On ten benchmarks, a 7B backbone tuned with Flow-GRPO reports +14.9% (search), +14.0% (agentic), +14.5% (math), and +4.1% (science) over strong baselines.

What is AgentFlow?

AgentFlow formalizes multi-turn, tool-integrated reasoning as an Markov Decision Process (MDP). At each turn, the Planner proposes a sub-goal and selects a tool plus context; the Executor calls the tool; the Verifier signals whether to continue; the Generator emits the final answer on termination. A structured, evolving memory records states, tool calls, and verification signals, constraining context growth and making trajectories auditable. Only the planner is trained; other modules can be fixed engines.

The public implementation showcases a modular toolkit (e.g., base_generator, python_coder, google_search, wikipedia_search, web_search) and ships quick-start scripts for inference, training, and benchmarking. The repository is MIT-licensed.

https://arxiv.org/pdf/2510.05592

Training method: Flow-GRPO

Flow-GRPO (Flow-based Group Refined Policy Optimization) converts long-horizon, sparse-reward optimization into tractable single-turn updates:

Final-outcome reward broadcast:

every turn

Token-level clipped objective:

Group-normalized advantages:

https://arxiv.org/pdf/2510.05592

Understanding the results and benchmarks

Benchmarks. The research team evaluates four task types: knowledge-intensive search (Bamboogle, 2Wiki, HotpotQA, Musique), agentic reasoning (GAIA textual split), math (AIME-24, AMC-23, Game of 24), and science (GPQA, MedQA). GAIA is a tooling-oriented benchmark for general assistants; the textual split excludes multimodal requirements.

Main numbers (7B backbone after Flow-GRPO). Average gains over strong baselines: +14.9% (search), +14.0% (agentic), +14.5% (math), +4.1% (science). The research team state their 7B system surpasses GPT-4o on the reported suite. The project page also reports training effects such as improved planning quality, reduced tool-calling errors (up to 28.4% on GAIA), and positive trends with larger turn budgets and model scale.

Ablations. Online Flow-GRPO improves performance by +17.2% vs. a frozen-planner baseline, while offline supervised fine-tuning of the planner degrades performance by −19.0% on their composite metric.

https://arxiv.org/pdf/2510.05592

Key Takeaways

Modular agent, planner-only training.

Flow-GRPO converts long-horizon RL to single-turn updates.

The research team-reported gains on 10 benchmarks.

Tool-use reliability improves.

Editorial Comments

AgentFlow formalizes tool-using agents into four modules (planner, executor, verifier, generator) and trains only the planner in-loop via Flow-GRPO, which broadcasts a single trajectory-level reward to every turn with token-level PPO-style updates and KL control. Reported results on ten benchmarks show average gains of +14.9% (search), +14.0% (agentic/GAIA textual split), +14.5% (math), and +4.1% (science); the research team additionally state the 7B system surpasses GPT-4o on this suite. Implementation, tools, and quick-start scripts are MIT-licensed in the GitHub repo.

Check out the Technical Paper, GitHub Page and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Stanford Researchers Released AgentFlow: In-the-Flow Reinforcement Learning RL for Modular, Tool-Using AI Agents appeared first on MarkTechPost.

What is AgentFlow?

Training method: Flow-GRPO

Understanding the results and benchmarks

Key Takeaways

Editorial Comments

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签