MarkTechPost@AI 10月19日 14:55
W4S:用弱模型设计调用强执行器的代码工作流
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

斯坦福大学等机构的研究人员提出了一种名为Weak-for-Strong Harnessing (W4S)的新型强化学习框架。该框架的核心思想是训练一个小型元代理(meta-agent),使其能够设计和优化调用更强大执行器模型(executor model)的代码工作流。W4S将工作流设计视为一个多轮马尔可夫决策过程,并使用一种名为RLAO(Reinforcement Learning for Agentic Workflow Optimization)的方法来训练元代理。研究团队报告称,在11个基准测试中,使用一个7B参数的元代理,仅训练约1小时GPU时间,W4S取得了持续的性能提升,且无需对强模型进行微调。

💡 W4S框架的核心创新在于,它不直接微调强大的执行器模型,而是训练一个小型、低成本的“弱”元代理,使其学会如何智能地调用和编排更强大的模型来完成任务。这种方法将模型能力与工作流设计能力分离开来,提高了效率和可控性。

⚙️ W4S将工作流的设计过程形式化为一个多轮马尔可夫决策过程(MDP)。在每一轮中,元代理接收任务指令、当前工作流代码以及先前执行的反馈信息作为状态,并输出一个动作,该动作包含对工作流的分析以及新的Python代码。环境执行代码后,返回准确率和失败案例,作为下一轮的状态输入。

🚀 通过使用RLAO(Reinforcement Learning for Agentic Workflow Optimization)方法,W4S能够有效地训练元代理,使其在多轮交互中不断优化工作流。RLAO是一种离线强化学习程序,它通过奖励加权回归来优化策略,特别强调在验证准确率上取得稳步进展,从而提高样本效率。

📊 W4S在多项基准测试中展现了优越的性能。例如,在HumanEval基准测试中,使用GPT-4o-mini作为执行器时,W4S达到了95.4%的Pass@1准确率,且总成本仅约0.9美元。在11个不同基准测试中,W4S相较于最强的自动化基线,性能平均提升了2.9%至24.6%,证明了其在通用任务上的有效性。

🧠 W4S的另一个重要优势是其学习到的工作流设计能力具有一定的泛化性。例如,在数学迁移任务中,在GSM Plus和MGSM数据集上训练的元代理,在未进行额外训练的情况下,在GSM8K和GSM Hard等相关任务上取得了高于自动化基线的表现,表明其学到的“组织”能力可以迁移到相似的任务上。

Researchers from Stanford, EPFL, and UNC introduce Weak-for-Strong Harnessing, W4S, a new Reinforcement Learning RL framework that trains a small meta-agent to design and refine code workflows that call a stronger executor model. The meta-agent does not fine tune the strong model, it learns to orchestrate it. W4S formalizes workflow design as a multi turn Markov decision process, and trains the meta-agent with a method called Reinforcement Learning for Agentic Workflow Optimization, RLAO. The research team reports consistent gains across 11 benchmarks with a 7B meta-agent trained for about 1 GPU hour.

https://arxiv.org/pdf/2504.04785

W4S operates in turns. The state contains task instructions, the current workflow program, and feedback from prior executions. An action has 2 components, an analysis of what to change, and new Python workflow code that implements those changes. The environment executes the code on validation items, returns accuracy and failure cases, and provides a new state for the next turn. The meta-agent can run a quick self check on one sample, if errors arise it attempts up to 3 repairs, if errors persist the action is skipped. This loop gives learning signal without touching the weights of the strong executor.

https://arxiv.org/pdf/2504.04785


W4S runs as an iterative loop

Reinforcement Learning for Agentic Workflow Optimization (RLAO)

RLAO is an offline reinforcement learning procedure over multi turn trajectories. At each iteration, the system samples multiple candidate actions, keeps the best performing action to advance the state, and stores the others for training. The policy is optimized with reward weighted regression. The reward is sparse and compares current validation accuracy to history, a higher weight is given when the new result beats the previous best, a smaller weight is given when it beats the last iteration. This objective favors steady progress while controlling exploration cost.

https://arxiv.org/pdf/2504.04785

Understanding the Results

On HumanEval with GPT-4o-mini as executor, W4S achieves Pass@1 of 95.4, with about 33 minutes of workflow optimization, zero meta-agent API cost, an optimization execution cost of about 0.4 dollars, and about 2.7 minutes to execute the test set at about 0.5 dollars, for a total of about 0.9 dollars. Under the same executor, AFlow and ADAS trail this number. The reported average gains against the strongest automated baseline range from 2.9% to 24.6% across 11 benchmarks.

On math transfer, the meta-agent is trained on GSM Plus and MGSM with GPT-3.5-Turbo as executor, then evaluated on GSM8K, GSM Hard, and SVAMP. The paper reports 86.5 on GSM8K and 61.8 on GSM Hard, both above automated baselines. This indicates that the learned orchestration transfers to related tasks without re training the executor.

Across seen tasks with GPT-4o-mini as executor, W4S surpasses training free automated methods that do not learn a planner. The study also runs ablations where the meta-agent is trained by supervised fine tuning rather than RLAO, the RLAO agent yields better accuracy under the same compute budget. The research team include a GRPO baseline on a 7B weak model for GSM Hard, W4S outperforms it under limited compute.

Iteration budgets matter. The research team sets W4S to about 10 optimization turns on main tables, while AFlow runs about 20 turns and ADAS runs about 30 turns. Despite fewer turns, W4S achieves higher accuracy. This suggests that learned planning over code, combined with validation feedback, makes the search more sample efficient.

https://arxiv.org/pdf/2504.04785

Key Takeaways

Editorial Comments

W4S targets orchestration, not model weights, and trains a 7B meta agent to program workflows that call stronger executors. W4S formalizes workflow design as a multi turn MDP and optimizes the planner with RLAO using offline trajectories and reward weighted regression. Reported results show Pass@1 of 95.4 on HumanEval with GPT 4o mini, average gains of 2.9% to 24.6% across 11 benchmarks, and about 1 GPU hour of training for the meta agent. The framing compares cleanly with ADAS and AFlow, which search agent designs or code graphs, while W4S fixes the executor and learns the planner.


Check out the Technical Paper and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Weak-for-Strong (W4S): A Novel Reinforcement Learning Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

W4S Weak-for-Strong Harnessing Reinforcement Learning RLAO Meta-agent Code Workflow Executor Model LLMs
相关文章