What If Reasoning Doesn’t Need Billion-Parameter Models?

Juan Manuel Ortiz de Zarate
3 days ago
10 min read

In recent years, large language models (LLMs) have dominated the landscape of AI reasoning. With trillions of parameters and vast training corpora, they solve tasks that once appeared out of reach—yet they continue to struggle in certain domains, particularly structured reasoning problems such as Sudoku, mazes, and the ARC-AGI benchmarks. These puzzles embody a form of non-linguistic, algorithmic intelligence that does not yield easily to next-token prediction or chain-of-thought sampling.

A surprising contender has emerged from this landscape: Tiny Recursive Models (TRMs), introduced in 2025 [1]. These are deliberately small neural networks—just 7 million parameters—that outperform advanced LLMs such as DeepSeek R1, o3-mini, and Gemini 2.5 Pro on some of the hardest public benchmarks in reasoning. Even more striking, they do so with just 0.01% of the computational footprint of large models.

This article explores the motivations, mechanics, and implications of TRMs. We situate them in the context of the earlier Hierarchical Reasoning Model (HRM)[7], analyze why recursion and depth matter more than size, and look at how TRMs challenge common assumptions about scaling, memory, and reasoning architectures.

The underlying message is bold yet simple: when it comes to certain kinds of reasoning, small may not just be beautiful—it may be superior.

Tiny Recursion Model (TRM) recursively improves its predicted answer y with a tiny network

1. Why Large Models Struggle with Hard Reasoning

LLMs excel at generative tasks where patterns in language correlate with answers. But structured puzzles like Sudoku or ARC-AGI do not reward linguistic fluency. They require consistent multi-step reasoning, constraint satisfaction, and the ability to refine an internal hypothesis until it converges to a valid solution.

Large models attempt to approximate this through techniques such as:

Chain-of-Thought prompting (CoT) [2]: asking the model to reason step by step
Test-Time Compute (TTC) [3]: sampling many solutions and picking the best

While these methods improve performance, they come with limitations. CoT requires correct reasoning traces—if the model’s “thoughts” derail, so will the answer. TTC mitigates errors by repetition but becomes expensive as difficulty scales. And even with these boosts, state-of-the-art LLMs plateau far below human performance on ARC-AGI[4,5].

The key limitation is structural: LLMs generate outputs autoregressively. One wrong token and the entire reasoning chain collapses.

This fragility inspired researchers to search for alternatives—intuition-style solvers that update a full state representation iteratively instead of predicting one token at a time.

2. HRM: A Step Forward, But Not the Final Answer

In early 2025, the Hierarchical Reasoning Model (HRM) rekindled interest in recursive neural reasoning. HRM used:

Two networks, a low-frequency and a high-frequency module
Two latent states:
- zL (latent reasoning)
- zH (latent solution)
Deep supervision: repeating improvement cycles up to 16 times
Adaptive computational time (ACT) to stop early when confident

Its performance shocked the field. On problems where LLMs barely register progress, HRM reached:

55% accuracy on Sudoku-Extreme
75% accuracy on Maze-Hard
~40% on ARC-AGI-1

The model was tiny—just 27 million parameters—but extraordinarily sample-efficient, often trained with as few as 1000 examples.

Yet, HRM had problems:

Its theoretical justification was shaky. It invoked fixed-point iteration and the Implicit Function Theorem to explain why it back-propagated only through the last recursion. But empirical evidence suggested it wasn’t actually reaching fixed points.
It was more complex than necessary. Dual networks, two latents, and specialized supervision loops made it difficult to interpret and extend.
ACT required two forward passes per training step, doubling compute cost.
Accuracy gains seemed to come mostly from deep supervision, not hierarchy.

These limitations motivated a radical simplification.

Pseudocode of Hierarchical Reasoning Models

3. Tiny Recursive Models: A Simpler, Stronger Approach

Tiny Recursive Models (TRMs) represent a conceptual reset. Rather than incrementally refining the Hierarchical Reasoning Model (HRM), they strip it down to its functional core and rebuild it around a far simpler—and ultimately more powerful—set of principles. The central insight is that most of HRM’s empirical success does not come from hierarchy, biological inspiration, or fixed-point theory, but from something much more basic: iterative refinement of an internal solution state.

TRM abandons the idea that reasoning must be split across different hierarchical modules operating at different frequencies. It also discards the assumption that internal representations converge to a mathematical fixed point that justifies truncated backpropagation. Instead, it embraces a pragmatic and transparent view of reasoning as a process of repeated improvement, where a model is allowed to revise its own tentative answers multiple times before committing to a final output.

At its heart, TRM is built around just three ideas.

First, there is only a single, tiny neural network. This network is deliberately small—typically two Transformer layers—yet it is reused many times through recursion. By reapplying the same parameters across multiple reasoning steps, TRM achieves large effective depth without increasing model size. This parameter sharing is not a limitation but a strength: it forces the model to learn general update rules for reasoning rather than memorizing task-specific shortcuts.

Second, TRM maintains two latent states, each with a clear and intuitive role. The first state, denoted y, represents the current candidate solution. Importantly, y is not an abstract hidden vector but an embedded answer: when decoded, it corresponds directly to a concrete output such as a completed Sudoku grid or an ARC-AGI image. The second state, z, is a latent reasoning representation. It cannot be decoded into a valid answer on its own, but it captures the intermediate structure of the reasoning process—the constraints, partial patterns, and implicit logic that explain why a particular solution might be correct or incorrect.

This separation turns out to be crucial. If the model were to carry only y, it would lack a memory of how it arrived at that solution and would struggle to improve it. If it carried only z, it would be forced to encode the entire solution implicitly, blurring the line between reasoning and answering. By keeping y and z distinct, TRM mirrors a very natural cognitive process: holding onto a tentative answer while separately tracking the reasoning that led there.

Third, TRM relies on recursive refinement. Each reasoning cycle consists of two phases. In the first phase, the model repeatedly updates z, refining its internal reasoning state while conditioning on the input and the current solution y. In the second phase, the model uses the updated z to revise y, producing a new candidate solution. This updated solution is then fed back into the next recursion step. Over time, errors can be corrected, inconsistencies resolved, and partial structures completed. Each recursion does not aim to solve the problem from scratch, but to make the solution slightly better than before.

What makes TRM especially elegant is not only what it introduces, but what it intentionally removes.

There are no separate low- and high-frequency modules, eliminating the need to coordinate multiple networks with different update schedules. There is no reliance on fixed-point approximations or the Implicit Function Theorem, since TRM simply backpropagates through the full recursive computation that actually occurs. There is no need for multiple forward passes to implement adaptive computation time, because halting can be learned directly from whether the current solution is correct. And finally, there is no dependence on elaborate biological analogies to justify architectural decisions—the model is motivated entirely by computational clarity and empirical performance.

The result is a system that is smaller, easier to train, easier to reason about, and—most importantly—significantly better at generalizing. Despite having fewer parameters than HRM and far fewer than modern LLMs, TRM consistently achieves higher accuracy on difficult reasoning benchmarks. Its strength does not come from scale, but from the disciplined reuse of a simple computation applied recursively.

In this sense, TRM embodies a provocative idea: reasoning may not require large models, but rather the ability to revise one’s own thoughts.

4. How TRM Works: Latent Updates and Deep Supervision

4.1 The Core Mechanism

Each TRM reasoning cycle (called a recursion) consists of:

n updates to z: The model refines its internal reasoning chain.
One update to y: Using the improved reasoning state, it proposes a better solution.

Unlike HRM, TRM back-propagates through all updates in the final recursion step, not just the last two. Earlier recursions (T−1 of them) run without gradients to avoid excessive memory cost.

This pattern—multiple free recursions + one gradient recursion—makes TRM:

Deep in effective reasoning
Cheap in training memory
Capable of error correction across iterations

4.2 Why Two Latent States?

The paper’s reinterpretation is remarkably intuitive:

y = the current answer
z = the chain-of-thought–like reasoning trace

Both must be remembered across supervision steps.

If z were removed, the model would lose reasoning continuity. If y were removed, z would be forced to encode the entire solution.

Two states is the sweet spot—extra states harm performance, and one state underfits.

5. Less Is More: Why Tinier Networks Generalize Better

One of the most counterintuitive findings of the paper is that smaller networks outperform larger ones.

A 2-layer Transformer with recursion beats:

4-layer versions
Architectures with self-attention replaced by MoEs
Much deeper models with similar compute budgets

This suggests a fascinating hypothesis:

Recursion acts as an implicit depth multiplier.

It gives a tiny network the representational power of a deep one, but without overfitting.

For example:

HRM effectively emulates 384 layers,
TRM about 42 recursive steps,

yet TRM wins.

When data is extremely scarce—e.g., 1000 training Sudokus—the regularization effect of being small becomes an advantage.

6. Performance: Tiny Networks Beating Giants

Across four benchmarks, TRM sets new state-of-the-art results for small networks.

6.1 Sudoku-Extreme (423K test samples)

HRM: 55%
TRM (MLP version): 87.4%

This is the most dramatic leap, and clearly shown in Table 1 and Table 4 (page 8).

6.2 Maze-Hard

Here the Transformer-attention version of TRM performs better due to the larger grid.

HRM: 74.5%
TRM-Att: 85.3%

6.3 ARC-AGI-1 and ARC-AGI-2

These are the hardest reasoning benchmarks available. In the following figure, you can check the TRM results and comparisons against larger models.

Test accuracy on ARC-AGI and ARC-AGI-2[6] Benchmarks

TRM achieves competitive small-model performance—exceptional considering its training data is tiny, and no external knowledge is used.

7. Simplification and Ablation Insights

A defining feature of Tiny Recursive Models is how much performance they gain by removing complexity rather than adding it. This is most clearly illustrated in how TRM simplifies adaptive computation time (ACT). In HRM, ACT relies on a Q-learning–style formulation with both halting and continuation losses, requiring an additional forward pass at every training step. TRM eliminates this extra machinery and learns only a single halting probability through a simple binary objective: whether the current solution is already correct. This preserves accuracy while roughly halving the computational cost, in line with TRM’s minimalist design philosophy.

The ablation studies reinforce the same message. A single network consistently outperforms dual-network designs, two-layer models generalize better than deeper ones, and backpropagating through the full recursive computation yields stronger results than fixed-point–based approximations. Training stability improves substantially with exponential moving average (EMA), and architectural choices such as self-attention only help when the input structure truly requires it.

Across all experiments, the pattern is striking: every simplification improves or preserves generalization, while every attempt to increase capacity—through more layers or richer modules—hurts performance. Unlike the scaling behavior of large language models, TRM operates in a regime where smaller, simpler architectures benefit most from deep recursive reasoning.

8. Why Recursion Helps: An Open Scientific Question

Perhaps the most intriguing part of the paper is what it does not explain: why recursion is so effective.

A TRM with 2 layers outperforms a deep Transformer with many more layers, even when compute is held constant.

One hypothesis emphasized by the authors:

Recursion enables continuous error correction.

Unlike autoregressive LLMs that commit early mistakes, TRMs refine a full-state representation repeatedly. They can:

fix their own wrong guesses
incorporate new insights
gradually converge to consistency

In other words, recursion gives the model something like iterative thinking.

This aligns with cognitive science: humans often solve puzzles by repeatedly adjusting an internal hypothesis rather than predicting a sequence of tokens.

But this remains speculative. As the paper notes, TRM has no formal theory yet. It works astonishingly well, but we do not fully know why.

9. Broader Implications: A New Paradigm in Reasoning?

LLMs dominate general AI tasks, but TRM shows that domain-specific reasoning models can outperform them by orders of magnitude in efficiency.

This raises big questions:

Are LLMs fundamentally misaligned with algorithmic reasoning?

Their token-prediction architecture may inherently limit their ability to perform stable multi-step logical refinements.

Will future AI systems combine generative language with recursive solvers?

LLMs could propose hypotheses; TRMs could validate or refine them.

Can recursion be incorporated directly into the next generation of Transformers?

Deep equilibrium models attempted this, but TRM’s success suggests lighter-weight variants may be possible.

Is reasoning a function of depth—not width?

If so, architectural recursion may eventually yield more powerful reasoning than parameter scaling alone.

This echoes a theme increasingly discussed in AI research: thinking is not predicting the next word—it’s revising a belief state.

TRM is an early but compelling embodiment of that idea.

Conclusion

The Tiny Recursive Model is an elegant demonstration that bigger is not always better. On problems where reasoning—not knowledge—is the bottleneck, TRMs show that:

Small networks
With recursive internal updates
And minimalistic design

can outperform large-scale LLMs trained on astronomical datasets.

Their success rests on:

full-state iterative refinement
deep supervision
compact architectures that avoid overfitting

The contrast between LLMs and TRMs highlights a crucial insight: scaling laws may govern language modeling, but reasoning may follow a different law entirely.

As research continues, TRMs could inspire new architectures that blend recursion, depth, and symbolic-like manipulation—all within tiny computational footprints. They hint at a future where the most sophisticated reasoning systems are not necessarily the largest ones.

And in doing so, they reopen a timeless scientific question: What is the minimal machinery needed for robust intelligence?

For now, the answer appears surprisingly small.

References

[1] Jolicoeur-Martineau, A. (2025). Less is more: Recursive reasoning with tiny networks. arXiv preprint arXiv:2510.04871.

[2] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824-24837.

[3] Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314.

[4] Chollet, F. (2019). On the measure of intelligence. arXiv preprint arXiv:1911.01547.

[5] Measuring Intelligence: Key Benchmarks and Metrics for LLMs, Transcendent AI

[6] Chollet, F., Knoop, M., Kamradt, G., Landers, B., & Pinkard, H. (2025). Arc-agi-2: A new challenge for frontier ai reasoning systems. arXiv preprint arXiv:2505.11831.

[7] Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y., ... & Yadkori, Y. A. (2025). Hierarchical Reasoning Model. arXiv preprint arXiv:2506.21734.