小型递归模型在ARC-AGI推理任务上超越大型语言模型

Can an iterative draft–revise solver that repeatedly updates a latent scratchpad outperform far larger autoregressive LLMs on ARC-AGI? Samsung SAIT (Montreal) has released Tiny Recursive Model (TRM)—a two-layer, ~7M-parameter recursive reasoner that reports 44.6–45% test accuracy on ARC-AGI-1 and 7.8–8% on ARC-AGI-2, surpassing results reported for substantially larger language models such as DeepSeek-R1, o3-mini-high, and Gemini 2.5 Pro on the same public evaluations. TRM also improves puzzle benchmarks Sudoku-Extreme (87.4%) and Maze-Hard (85.3%) over the prior Hierarchical Reasoning Model (HRM, 27M params), while using far fewer parameters and a simpler training recipe.

What’s exactly is new?

TRM removes HRM’s two-module hierarchy and fixed-point gradient approximation in favor of a single tiny network that recurses on a latent “scratchpad” (z) and a current solution embedding (y):

Deeply supervised recursion.

unrolled up to 16 times

Full backprop through the loop.

one-step implicit (fixed-point) gradient

backpropagates through all recursive steps

https://arxiv.org/pdf/2510.04871v1

Architecturally, the best-performing setup for ARC/Maze retains self-attention; for Sudoku’s small fixed grids, the research team swap self-attention for an MLP-Mixer-style token mixer. A small EMA (exponential moving average) over weights stabilizes training on limited data. Net depth is effectively created by recursion (e.g., T = 3, n = 6) rather than stacking layers; in ablations, two layers generalize better than deeper variants at the same effective compute.

Understanding the Results

ARC-AGI-1 / ARC-AGI-2 (two tries):

TRM-Attn (7M)

44.6% / 7.8%

HRM (27M)

40.3% / 5.0%

DeepSeek-R1

(671B)

15.8% / 1.3%

o3-mini-high

34.5% / 3.0%

Gemini 2.5 Pro

37.0% / 4.9%

Sudoku-Extreme (9×9, 1K train / 423K test):

87.4%

55.0%

Maze-Hard (30×30):

85.3%

74.5%

https://arxiv.org/pdf/2510.04871v1

These are direct-prediction models trained from scratch on small, heavily augmented datasets—not few-shot prompting. ARC remains the canonical target; broader leaderboard context and rules (e.g., ARC-AGI-2 grand-prize threshold at 85% private set) are tracked by the ARC Prize Foundation.

Why a 7M model can beat much larger LLMs on these tasks?

Decision-then-revision instead of token-by-token:

latent iterative consistency checks

Compute spent on test-time reasoning, not parameter count:

emulated depth ≈ T·(n+1)·layers

Tighter inductive bias to grid reasoning:

Key Takeaways

Architecture:

Results:

~44.6–45%

ARC-AGI-1

~7.8–8%

ARC-AGI-2 (two-try)

Efficiency/Pattern:

Editorial Comments

This research demonstrates a ~7M-parameter, two-layer recursive solver that unrolls up to 16 draft-revise cycles with ~6 latent updates per cycle and reports ~45% on ARC-AGI-1 and ~8% (two-try) on ARC-AGI-2. The research team released code on GitHub. ARC-AGI remains unsolved at scale (target 85% on ARC-AGI-2), so the contribution is an architectural efficiency result rather than a general reasoning breakthrough.

Check out the Technical Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Tiny Recursive Model (TRM): A Tiny 7M Model that Surpass DeepSeek-R1, Gemini 2.5 pro, and o3-mini at Reasoning on both ARG-AGI 1 and ARC-AGI 2 appeared first on MarkTechPost.

What’s exactly is new?

Understanding the Results

Why a 7M model can beat much larger LLMs on these tasks?

Editorial Comments

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签