top of page
Search

Do They Really Think?

In recent years, the development of frontier language models has advanced rapidly, culminating in a new class of systems known as Large Reasoning Models (LRMs). These models—such as OpenAI’s o1/o3, Claude 3.7 Sonnet Thinking, DeepSeek-R1, and Gemini Thinking—promise more than mere fluency; they claim to "think". Through mechanisms like chain-of-thought (CoT) [2] reasoning and self-reflection, LRMs aim to mirror cognitive processes, improving performance on complex tasks. But how deeply do these models truly reason? And what happens when problem complexity increases beyond familiar thresholds?


In their paper, The Illusion of Thinking,[1] challenge prevailing assumptions about the reasoning capabilities of LRMs by examining their behavior in a controlled experimental environment. Using a series of compositional puzzles where complexity can be precisely tuned, the authors explore how these models solve problems, how their internal reasoning processes evolve, and where they ultimately break down.

This article unpacks the paper’s methodology, findings, and implications for the future of AI reasoning, highlighting what current LRMs can and cannot do when faced with increasing cognitive demands.


Rethinking Evaluation: Beyond Final Answers


The standard method for evaluating reasoning in LLMs focuses almost exclusively on final answer accuracy. Benchmarks [https://www.transcendent-ai.com/post/measuring-intelligence-key-benchmarks-and-metrics-for-llms6] like MATH500 or AIME24 gauge how well a model solves math problems, but they are riddled with issues—data contamination, fixed formats, and limited interpretability of intermediate reasoning steps.


Shojaee et al. propose a radical shift in evaluation: instead of relying on static datasets, they design controllable puzzle environments. These puzzles—Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World—allow for fine-grained manipulation of complexity while maintaining constant logical structure. Unlike math problems mined from the web, these puzzles avoid contamination and require explicit algorithmic reasoning, making them ideal testbeds for probing true reasoning capabilities.

Overview of the four puzzle environments used for controlled reasoning evaluation. Each column shows a puzzle's progression from initial state (top), through an intermediate configuration (middle), to the target goal state (bottom). The environments—Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World—allow systematic control of problem complexity while maintaining logical consistency, enabling fine-grained reasoning analysis.
Overview of the four puzzle environments used for controlled reasoning evaluation. Each column shows a puzzle's progression from initial state (top), through an intermediate configuration (middle), to the target goal state (bottom). The environments—Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World—allow systematic control of problem complexity while maintaining logical consistency, enabling fine-grained reasoning analysis.

Methodology: Matching Models, Varying Complexity


To rigorously examine the reasoning abilities of Large Reasoning Models (LRMs), Shojaee et al. adopt a novel and meticulous experimental design that controls for confounding factors often present in traditional benchmark evaluations. Their methodology centers on two key pillars: model matching and problem complexity scaling.


Controlling for Compute: Matching Reasoning and Non-Reasoning Models


One of the main challenges in evaluating reasoning models is disentangling their architectural advantages from mere computational advantages. A model that spends more inference tokens may outperform another simply because it has more opportunity to think—not necessarily because it reasons better. To address this, the authors compare matched model pairs: a reasoning-augmented LRM and its non-reasoning counterpart with the same model backbone and inference compute.


For example:


  • Claude 3.7 Sonnet (Thinking) vs Claude 3.7 Sonnet (Standard)

  • DeepSeek-R1 vs DeepSeek-V3


These pairs are carefully chosen because they differ only in their reasoning strategies—typically, the thinking model uses long CoT generation, self-reflection, or planning tokens, while the standard model generates a direct answer without intermediate reasoning traces. Crucially, both models are allowed the same inference token budget during evaluation, ensuring that any performance differences arise from the quality of the reasoning process rather than raw compute.


In doing so, the authors move beyond superficial comparisons and instead focus on how models think, not just how much they think.


Beyond Benchmarks: Puzzle-Based Evaluation Environments


Traditional reasoning benchmarks—like MATH500, GSM8K, or AIME—have become the gold standard for testing LLMs, but they have limitations:

  • They often contain memorized examples or structural templates that may be overrepresented in training data.

  • They rarely allow for controlled variation in difficulty across instances.

  • They do not provide access to the step-by-step reasoning path of the model for fine-grained analysis.

To overcome these limitations, Shojaee et al. design four controlled puzzle environments that support precise manipulation of complexity while preserving consistent logic:


  1. Tower of Hanoi – A recursive disk transfer problem with exponential complexity scaling.

  2. Checker Jumping – A spatial reasoning puzzle involving sequential swaps under movement constraints.

  3. River Crossing – A constraint satisfaction planning task inspired by classical logic puzzles.

  4. Blocks World – A block stacking and rearrangement problem used extensively in planning literature.

Each environment allows the authors to scale the compositional depth of the task by varying an integer parameter NNN: number of disks, checkers, people, or blocks. This creates a continuous axis of complexity that reveals how model performance changes as reasoning demands increase. Importantly, these puzzles are designed to avoid contamination from public training corpora and are evaluated using deterministic simulators that rigorously check both the final answer and each intermediate step.


Multi-Faceted Evaluation Strategy


To probe reasoning capabilities, the authors employ three levels of evaluation:

  1. Final Answer Accuracy – Does the model reach the correct goal state, regardless of how?

  2. Intermediate Reasoning Trace Analysis – What steps did the model take, and when did it find (or lose) the correct solution?

  3. Token Budget Utilization – How much of the allocated inference budget does the model use as complexity increases?

This layered approach enables the authors to uncover subtle phenomena—such as early overthinking, delayed solution discovery, and token usage collapse—while avoiding oversimplified “pass/fail” judgments. Moreover, the reasoning traces generated by thinking models (e.g., Claude 3.7 Sonnet Thinking) can be parsed and replayed in simulators, allowing the researchers to visualize and validate each step taken.

Top: Example of the Tower of Hanoi puzzle with internal reasoning traces generated by a Large Reasoning Model (LRM), showing intermediate states and final move sequence. Bottom Left: Accuracy of reasoning and non-reasoning models declines with increasing task complexity. Bottom Center: Reasoning token usage initially rises, then paradoxically drops beyond a complexity threshold. Bottom Right: Position within the trace where correct or incorrect solutions appear, showing early overthinking and late failures as complexity grows.
Top: Example of the Tower of Hanoi puzzle with internal reasoning traces generated by a Large Reasoning Model (LRM), showing intermediate states and final move sequence. Bottom Left: Accuracy of reasoning and non-reasoning models declines with increasing task complexity. Bottom Center: Reasoning token usage initially rises, then paradoxically drops beyond a complexity threshold. Bottom Right: Position within the trace where correct or incorrect solutions appear, showing early overthinking and late failures as complexity grows.

Key Findings: Three Regimes of Reasoning


The study identifies three distinct performance regimes:

  1. Low Complexity: Non-reasoning LLMs outperform LRMs in both accuracy and efficiency. Reasoning models often "overthink"—generating unnecessarily verbose thought chains even when the correct answer is apparent early on.

  2. Medium Complexity: LRMs start to shine. Their extended reasoning helps navigate increasingly entangled decision paths, resulting in higher accuracy than non-reasoning counterparts.

  3. High Complexity: Both LRMs and standard LLMs collapse. Surprisingly, LRMs begin to decrease their reasoning effort—measured in tokens spent on thinking—as complexity increases, despite still having computational budget remaining. This counterintuitive behavior suggests intrinsic limitations in how current models manage their reasoning processes.


Internal Reasoning: The Anatomy of “Thinking Traces”


One of the most distinctive contributions of Shojaee et al.’s work is their deep dive into the internal reasoning processes of LRMs—what they refer to as “thinking traces.” Rather than treating the model as a black box that simply outputs an answer, the authors analyze the full chain of thought that reasoning models generate before reaching a conclusion. This offers unprecedented insight into how models attempt to solve problems, where they go astray, and how their thinking evolves with complexity.


What Are Thinking Traces?


In LRMs, thinking traces are the intermediate sequences—often tens of thousands of tokens long—that represent the model's reasoning process. These traces typically include step-by-step attempts to plan, verify, and adjust potential solutions before generating a final answer. For example, in the Tower of Hanoi puzzle, a thinking trace might contain dozens of full move sequences the model considers, revises, or abandons.


The authors extract these traces systematically and analyze not only their content but also their structure—how reasoning unfolds across the token sequence, and how early or late correct solutions appear. To make this analysis rigorous, each trace is evaluated against custom-built simulators that can verify the validity of every intermediate step. This allows for detailed failure analysis and the identification of qualitative reasoning patterns.


Three Distinct Reasoning Patterns Across Complexity Regimes

By examining thousands of thinking traces across puzzles and complexity levels, the authors identify three recurring patterns that align with the low-, medium-, and high-complexity regimes discussed earlier:

  1. Low Complexity – Early Success, Followed by Overthinking  For simpler tasks (e.g., 2–3 disks in Tower of Hanoi), reasoning models often arrive at a correct solution early [5] in the trace. However, they rarely stop there. Instead, they continue generating alternative solutions, many of which are incorrect. This behavior, known as the overthinking phenomenon [3, 4], leads to unnecessary computational overhead and increased risk of overwriting the correct solution. The trace effectively becomes less accurate over time, as shown in Figure 7b of the paper, where solution accuracy decreases in later segments of the trace for small N.

  2. Medium Complexity – Exploratory Reasoning and Delayed Success  In puzzles of moderate complexity, the opposite pattern emerges. The model initially explores incorrect paths, often through trial and error, and only discovers the correct solution near the end of the trace. Here, thinking traces become more valuable—the model's extended reasoning time increases the likelihood of course correction. The distribution of correct solutions shifts toward the tail of the trace, indicating that deeper exploration is necessary for success.

  3. High Complexity – Collapse of Reasoning  As problem complexity continues to rise (e.g., 8+ disks in Hanoi or 3+ agents in River Crossing), models fail to produce any valid solution within their thinking traces. Correct answers vanish entirely, and the trace becomes a chaotic or repetitive sequence of invalid or partially correct moves. In this regime, not only does performance collapse, but so does reasoning effort—the models paradoxically generate shorter thinking traces even though the task demands more deliberation. This collapse is not due to token limits (which are far from being reached), but seems to reflect an internal failure to organize coherent thought when faced with deeply compositional tasks.

Quantifying Position and Accuracy in Traces


Shojaee et al. go beyond qualitative insights by introducing metrics to quantify where in the trace correct solutions emerge. Each intermediate solution is assigned a relative position (e.g., early, middle, late), and its correctness is validated using a deterministic simulator. This allows for the creation of position-accuracy curves—plots that reveal how accuracy varies throughout the trace for different levels of complexity.


Key findings include:

  • For easy problems, correct solutions cluster near the beginning, with accuracy decreasing as the trace continues—evidence of unnecessary exploration.

  • For moderate problems, the trend reverses: accuracy increases toward the end, as models refine and converge on valid paths.

  • For hard problems, accuracy remains flat at zero throughout the trace, signaling total failure.

These dynamics suggest that LRMs are capable of some self-correction but lack robust mechanisms to detect when a correct solution has been found and to halt further exploration.


Failure Types in Reasoning Traces


The analysis also reveals several recurring failure modes in thinking traces:

  • Looping Behavior: Models repeatedly cycle through the same incorrect partial solutions.

  • Premature Fixation: A wrong path is selected early and pursued for the rest of the trace.

  • Syntax Errors: Even in structured tasks like Tower of Hanoi, models sometimes produce syntactically malformed move sequences.

  • Semantically Invalid Moves: Moves that violate the task’s core constraints—like placing a large disk on top of a smaller one—are surprisingly common, even late in the trace.

These failures underscore the gap between natural language fluency and algorithmic reliability. While models may “sound” thoughtful, their actual reasoning often lacks internal consistency and symbolic rigor.


Implications for Model Design and Evaluation


The thinking trace analysis makes a compelling case for rethinking how we evaluate reasoning in LLMs:

  • Final answer accuracy is insufficient. A model that gives the right answer after 1,000 incorrect steps should not be treated the same as one that solves it in 10.

  • Token allocation strategies matter. LRMs appear unable to dynamically adjust their reasoning length based on problem complexity.

  • Self-awareness is lacking. Models rarely signal confidence or uncertainty within their reasoning, making it hard to determine whether they believe they’ve found a solution.

If future models are to reason effectively, they must not only generate longer traces but also use them meaningfully—detecting success, pruning error paths, and stopping when appropriate.


Cracks in the Illusion: Generalization, Scaling Limits, and the Road Ahead


While Large Reasoning Models (LRMs) present an image of intelligent, structured thinking, Shojaee et al. uncover major gaps between that illusion and actual reasoning performance. Their findings challenge the assumption that current models generalize effectively or scale reliably under complexity.


The Limits of Generalization


A core claim of LRMs is that they can reason abstractly, beyond memorized patterns. Yet when the authors explicitly provide the algorithm for solving the Tower of Hanoi puzzle, models still fail at higher complexity levels. This suggests that LRMs struggle not only to discover solutions, but even to execute known procedures—raising doubts about their capacity for generalizable logical reasoning.


When More Thinking Becomes Less


An especially striking discovery is that LRMs reduce their reasoning effort—measured in tokens—precisely when problem complexity increases. Initially, they scale up their thinking with harder tasks, but beyond a threshold, they start thinking less, despite having remaining token budget. This behavior signals a deeper limitation: current models lack the capacity to dynamically adjust effort based on difficulty.


Diagnostic Insights from Puzzle Environments


The four puzzles studied reveal different stress points:

  • Tower of Hanoi: Strong early performance collapses around 6+ disks, even with the solution provided.

  • Checker Jumping: Models overthink easy versions, then fail to find solutions as NNN grows.

  • River Crossing: Collapse occurs earliest—often after just a few moves—possibly due to lack of training exposure.

  • Blocks World: Shows gradual degradation, highlighting planning weaknesses.

These controlled environments help isolate where and how reasoning fails—whether due to memory limits, constraint violations, or premature convergence.

Accuracy and reasoning token usage of LRMs under increasing problem complexity. As task complexity increases, reasoning models (DeepSeek-R1, Claude 3.7 Sonnet, o3-mini) initially allocate more tokens to thinking. However, beyond a model-specific threshold, both accuracy and reasoning effort collapse—even when inference budgets remain underutilized. This behavior suggests a fundamental limitation in reasoning scalability.
Accuracy and reasoning token usage of LRMs under increasing problem complexity. As task complexity increases, reasoning models (DeepSeek-R1, Claude 3.7 Sonnet, o3-mini) initially allocate more tokens to thinking. However, beyond a model-specific threshold, both accuracy and reasoning effort collapse—even when inference budgets remain underutilized. This behavior suggests a fundamental limitation in reasoning scalability.

These insights expose the fragility of reasoning in today’s LRMs. More tokens or longer traces do not guarantee better reasoning, and surface-level coherence may mask deep inconsistencies. Going forward, progress will likely depend on:


  • Architectures that adapt compute based on task difficulty

  • Meta-reasoning abilities to self-monitor and halt faulty thinking

  • Hybrid models that integrate symbolic execution or algorithmic scaffolding

New benchmarks that explicitly test scaling and compositional reasoning

Pass@k accuracy comparison between reasoning and non-reasoning versions of Claude 3.7 Sonnet and DeepSeek on MATH-500, AIME24, and AIME25 datasets under equivalent inference budgets. Results show little or no benefit from reasoning on MATH-500, modest gains on AIME24, and larger gaps on AIME25—suggesting that benchmark contamination and dataset differences confound evaluation of true reasoning capabilities.
Pass@k accuracy comparison between reasoning and non-reasoning versions of Claude 3.7 Sonnet and DeepSeek on MATH-500, AIME24, and AIME25 datasets under equivalent inference budgets. Results show little or no benefit from reasoning on MATH-500, modest gains on AIME24, and larger gaps on AIME25—suggesting that benchmark contamination and dataset differences confound evaluation of true reasoning capabilities.

Toward More Robust Reasoning


The study leaves us with crucial questions: Can LRMs learn to reason procedurally, not just narratively? What determines the collapse point in reasoning effort? And how can we build systems that know when they know?


Until such questions are answered, the current generation of reasoning models, while impressive, still operates within narrow boundaries—sophisticated in appearance, but often brittle in depth.


Conclusion: Reasoning Requires More Than Tokens


The Illusion of Thinking delivers a compelling critique of modern reasoning models. While LRMs show undeniable progress, their performance is fragile, bounded by narrow complexity bands and prone to collapse in harder scenarios. Their "thinking" is often verbose but brittle, effective only within a comfort zone carefully padded by training data.


True reasoning—generalizable, abstract, and robust—remains elusive. Bridging the gap between pattern completion and cognitive competence will require more than token budgets or reinforcement tuning. It will require rethinking how we train, evaluate, and interpret the "intelligence" of machines.


Until then, we must recognize the illusion—and keep thinking critically about thinking machines.


References


[1] Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity


[2] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824-24837.


[3] Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., ... & Yu, D. (2024). Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187.


[4] Ballon, M., Algaba, A., & Ginis, V. (2025). The Relationship Between Reasoning and Performance in Large Language Models--o3 (mini) Thinks Harder, Not Longer. arXiv preprint arXiv:2502.15631.


[5] Zelikman, E., Wu, Y., Mu, J., & Goodman, N. (2022). Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35, 15476-15488.


Comments


bottom of page