LIMI方法：用少量高质量数据训练出强大的软件代理

Do curated, tool-grounded demonstrations build stronger software agents than broad piles of generic instruction data? A team of researchers from Shanghai Jiao Tong University and SII Generative AI Research Lab (GAIR) proposes LIMI (“Less Is More for Agency”), a supervised fine-tuning method that turns a base model into a capable software/research agent using 78 samples. LIMI scores 73.5% average on AgencyBench (FTFC 71.7, RC@3 74.2, SR@3 74.6), beating strong baselines (GLM-4.5 45.1, Qwen3-235B-A22B 27.5, Kimi-K2 24.1, DeepSeek-V3.1 11.9) and even surpassing variants trained on 10,000 samples—with 128× less data.

https://arxiv.org/pdf/2509.17567

What exactly is new?

Agency Efficiency Principle

agentic competence

data quality/structure

Minimal but dense supervision.

SII-CLI

vibe coding

research workflows

https://arxiv.org/pdf/2509.17567

How does it work?

Base models:

slime

Data construction:

SII-CLI

Evaluation:

AgencyBench

https://arxiv.org/pdf/2509.17567

Results

AgencyBench (avg): 73.5%.

(+28.4 pts)

71.7%

37.8%

74.6%

47.4%

Data efficiency:

AFM-CodeAgent SFT (10,000 samples)

73.5% vs 47.8%

+53.7%

128×

Generalization:

~57%

50.0% vs 48.7%

https://arxiv.org/pdf/2509.17567

Key Takeaways

Data efficiency dominates scale.

73.5%

curated trajectories

+53.7-point

10k-sample

with 128× fewer samples

Trajectory quality, not bulk.

long-horizon, tool-grounded

SII-CLI

Across-metric gains.

FTFC 71.7%

SR@3 74.6%

RC@3

57.2%

Works across scales.

GLM-4.5 (355B)

GLM-4.5-Air (106B)

Our Comments

The research team trains GLM-4.5 variants with 78 curated, long-horizon, tool-grounded trajectories captured in a CLI environment spanning software-engineering and research tasks. It reports 73.5% average on AgencyBench with FTFC, RC@3, and SR@3 metrics; baseline GLM-4.5 is reported at 45.1%. A comparison against a 10,000-sample AFM-CodeAgent SFT baseline shows 73.5% vs 47.8%; tool-free evaluation indicates intrinsic gains (≈50.0% for LIMI vs 48.7% GLM-4.5). Trajectories are multi-turn and token-dense, emphasizing planning, tool orchestration, and verification.

Check out the Paper, GitHub Page and Model Card on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples appeared first on MarkTechPost.

What exactly is new?

How does it work?

Results

Key Takeaways

Our Comments

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签