MarkTechPost@AI 10月07日
LIMI方法:用少量高质量数据训练出强大的软件代理
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一项来自上海交通大学和SII生成式AI研究实验室的研究提出了LIMI(“Less Is More for Agency”)方法,一种监督微调技术,仅用78个精心策划的样本就能将基础模型转化为强大的软件研究代理。LIMI在AgencyBench基准测试中取得了73.5%的平均分,显著超越了使用数千甚至上万个样本训练的基线模型,证明了数据质量和结构比数据量更重要。该方法通过捕获完整的、多轮次的工具使用轨迹,覆盖软件开发和研究工作流,有效提升了代理的能力。

✨ LIMI方法的核心在于“少即是多”的代理效率原则,强调高质量、结构化的数据对于训练高效软件代理至关重要。研究人员通过使用78个精心设计的、包含完整多轮工作流的工具使用轨迹进行微调,显著提升了模型在复杂任务上的表现,证明了数据质量远比数据量更能驱动代理能力的提升。

🛠️ 该研究利用了SII-CLI执行环境收集了覆盖软件开发(如“vibe coding”)和研究工作流(如搜索、分析、实验设计)的真实和合成数据。每个轨迹都包含了模型推理、工具调用和环境反馈等详细信息,确保了训练数据的丰富性和实用性,为代理提供了真实世界的实践经验。

🚀 LIMI方法在AgencyBench基准测试中取得了73.5%的平均分,在FTFC(Fully Tool-Function Call)方面达到71.7%,在SR@3(Success Rate @3)方面达到74.6%。更令人瞩目的是,LIMI仅用78个样本就超越了使用10,000个样本训练的基线模型,数据效率提升了128倍,展现了其在训练效率上的巨大优势。

💡 LIMI方法不仅在特定任务上表现出色,其带来的性能提升也具有良好的泛化能力。在包括TAU2-bench、EvalPlus-HE/MBPP、DS-1000和SciCode等多种通用性评估套件上,LIMI的平均得分约为57%,即使在没有工具访问的情况下,其性能也略高于基础模型,表明该方法能够带来内在的能力提升。

Do curated, tool-grounded demonstrations build stronger software agents than broad piles of generic instruction data? A team of researchers from Shanghai Jiao Tong University and SII Generative AI Research Lab (GAIR) proposes LIMI (“Less Is More for Agency”), a supervised fine-tuning method that turns a base model into a capable software/research agent using 78 samples. LIMI scores 73.5% average on AgencyBench (FTFC 71.7, RC@3 74.2, SR@3 74.6), beating strong baselines (GLM-4.5 45.1, Qwen3-235B-A22B 27.5, Kimi-K2 24.1, DeepSeek-V3.1 11.9) and even surpassing variants trained on 10,000 samples—with 128× less data.

https://arxiv.org/pdf/2509.17567

What exactly is new?

https://arxiv.org/pdf/2509.17567

How does it work?

https://arxiv.org/pdf/2509.17567

Results

https://arxiv.org/pdf/2509.17567

Key Takeaways

    Data efficiency dominates scale. LIMI reaches 73.5% average on AgencyBench using curated trajectories, surpassing GLM-4.5 (45.1%) and showing a +53.7-point advantage over a 10k-sample SFT baseline—with 128× fewer samples.Trajectory quality, not bulk. Training data are long-horizon, tool-grounded workflows in collaborative software development and scientific research, collected via the SII-CLI execution stack referenced by the paper.Across-metric gains. On AgencyBench, LIMI reports FTFC 71.7%, SR@3 74.6%, and strong RC@3, with detailed tables showing large margins over baselines; generalization suites (TAU2, EvalPlus-HE/MBPP, DS-1000, SciCode) average 57.2%. Works across scales. Fine-tuning GLM-4.5 (355B) and GLM-4.5-Air (106B) both yields large deltas over their bases, indicating method robustness to model size.

Our Comments

The research team trains GLM-4.5 variants with 78 curated, long-horizon, tool-grounded trajectories captured in a CLI environment spanning software-engineering and research tasks. It reports 73.5% average on AgencyBench with FTFC, RC@3, and SR@3 metrics; baseline GLM-4.5 is reported at 45.1%. A comparison against a 10,000-sample AFM-CodeAgent SFT baseline shows 73.5% vs 47.8%; tool-free evaluation indicates intrinsic gains (≈50.0% for LIMI vs 48.7% GLM-4.5). Trajectories are multi-turn and token-dense, emphasizing planning, tool orchestration, and verification.


Check out the Paper, GitHub Page and Model Card on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LIMI 软件代理 AI效率 监督微调 数据质量 工具使用 AGI LIMI Software Agents AI Efficiency Supervised Fine-Tuning Data Quality Tool Use AGI
相关文章