How Language Models Learned to Reason

Juan Manuel Ortiz de Zarate
Feb 15
10 min read

For many years, large language models impressed researchers with their fluency but frustrated them with their reasoning. Models could write poetry, summarize documents, and answer trivia, yet routinely failed at tasks that humans solve by breaking problems into steps: multi-step arithmetic, commonsense inference, or symbolic manipulation. The 2022 NeurIPS paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” [1] marked a turning point. It demonstrated that reasoning was not absent from large language models, it was latent, waiting to be elicited.

The central idea of the paper is deceptively simple: if a prompt shows not only the question and the answer, but also the intermediate reasoning steps that lead to the answer, sufficiently large language models will imitate this pattern and produce their own multi-step reasoning. This method, called chain-of-thought (CoT) prompting, unlocked dramatic performance gains across arithmetic, commonsense, and symbolic reasoning tasks, without any finetuning or architectural changes.

This article explains what chain-of-thought prompting is, why it works, what the authors empirically demonstrated, and why this paper reshaped how we think about prompting, scale, and reasoning in language models.

The Problem: Fluent Models, Shallow Reasoning

Before chain-of-thought prompting, the dominant paradigm for using large language models [2] was standard few-shot prompting. A prompt would include a handful of examples formatted as input–output pairs:

Q: Problem

A: Answer

This worked surprisingly well for many tasks, but it struggled with problems that require intermediate reasoning. Arithmetic word problems are a canonical example. Even when models “knew” the math, they often jumped directly to an incorrect final answer because the reasoning path was never made explicit.

Previous research attempted to address this limitation through finetuning with rationales or symbolic reasoning modules, but these approaches were expensive, task-specific, and incompatible with the flexible in-context learning paradigm that made large language models so powerful. The key insight of Wei et al. was that reasoning could be induced purely through prompting.

What Is Chain-of-Thought Prompting?

At its core, chain-of-thought prompting is a modification of the standard few-shot prompting paradigm that explicitly exposes intermediate reasoning steps within the prompt itself. Instead of presenting a language model with examples that map inputs directly to final answers, chain-of-thought prompting inserts a structured sequence of natural language steps that bridge the gap between the question and the conclusion.

In traditional few-shot prompting, exemplars typically follow a concise pattern:

Q: Question

A: Final answer

This format implicitly assumes that the model can internally infer the reasoning required to arrive at the answer. For many tasks, especially those involving surface-level pattern recognition, this assumption holds. However, for problems that require multi-step reasoning, such as arithmetic word problems, temporal inference, or symbolic manipulation, this format often fails. The model may possess the necessary knowledge but lacks an explicit scaffold to organize and apply it.

Chain-of-thought prompting alters the exemplar structure to the following:

Q: Question

A: Step-by-step reasoning expressed in natural language, followed by the final answer.

The crucial difference is that the reasoning steps are not hidden inside the model’s latent computations but are made explicit in the prompt. When several such exemplars are provided, the model learns, purely through in-context learning, that producing intermediate reasoning steps is part of the task itself.

Examples of <input, chain of thought, output> triples for arithmetic, commonsense, and symbolic reasoning benchmarks.

A defining characteristic of chain-of-thought prompting is that the reasoning is expressed in free-form natural language, not in equations, symbolic programs, or formal logic. This choice is deliberate. Large language models are trained primarily on natural language, and their internal representations are especially well-suited to capturing linguistic regularities, causal narratives, and sequential dependencies expressed in text.

By framing reasoning as a linguistic process, “first this happens, then that follows, therefore…”, chain-of-thought prompting aligns the reasoning task with the model’s strongest inductive biases. The model is not asked to translate a word problem directly into a mathematical formula; instead, it is encouraged to narrate the solution process in the same way a human might explain it step by step.

This linguistic framing turns reasoning into a token-by-token generative process, allowing the model to condition each step on the previous one. Errors, corrections, and dependencies become locally manageable, rather than being compressed into a single leap from question to answer.

An important aspect of chain-of-thought prompting is that models are not merely copying the surface structure of the exemplars. They are learning a procedural pattern: problems of a certain kind should be solved by decomposing them into intermediate subproblems, solving each in sequence, and then synthesizing the final result.

This distinction becomes evident when models generate chains of thought for novel problems that differ substantially from the exemplars. The generated reasoning steps are not memorized templates; they are newly constructed, context-sensitive sequences that adapt the demonstrated reasoning strategy to new inputs.

In this sense, chain-of-thought prompting teaches the model how to think about a class of problems, rather than what answer to give for a specific input. The exemplars function as demonstrations of a reasoning algorithm encoded in language.

Sequential Dependency and Causal Structure

Another key feature of chain-of-thought prompting is its sequential dependency structure. Each reasoning step conditions the next. This matters because many reasoning tasks are inherently sequential: later conclusions depend on earlier inferences.

Standard prompting collapses this sequence into a single prediction, forcing the model to implicitly represent the entire reasoning chain in its hidden states. Chain-of-thought prompting externalizes this process, allowing the model to “offload” intermediate results into text, which can then be reused explicitly.

This has two major effects:

Reduced cognitive load per token: Each step only needs to reason about a small part of the problem.
Error localization: When the model fails, the failure often occurs at a specific step, rather than as an opaque incorrect answer.

The authors emphasize that this does not imply that the model is reasoning in the same way humans do internally. However, the external behavior closely mirrors human step-by-step problem solving, making it both more effective and more interpretable.

Chain-of-thought prompting has several important conceptual implications.

First, it decouples reasoning from training. The model does not need to be retrained or finetuned on rationales; it only needs to be shown how reasoning looks in context.

Second, it introduces variable computation at inference time. Harder problems naturally lead to longer reasoning chains, allowing the model to allocate more “thinking” where needed.

Third, it provides a window into model behavior. While not a guarantee of faithfulness, chains of thought make it easier to inspect errors, diagnose failures, and understand what kinds of reasoning models can and cannot perform.

Finally, it suggests that reasoning is an emergent capability tied to scale rather than an explicit module baked into the architecture.

Arithmetic Reasoning: A Breakthrough Result

The strongest empirical evidence in the paper comes from arithmetic reasoning benchmarks[3], particularly GSM8K [4], a dataset of grade-school math word problems.

PaLM 540B uses chain-ofthought prompting to achieve a new stateof-the-art performance on the GSM8K benchmark

Using standard prompting, even very large models performed poorly. With chain-of-thought prompting, performance increased dramatically, but only for sufficiently large models.

The standout result is that PaLM 540B [5], prompted with just eight chain-of-thought exemplars, achieved state-of-the-art accuracy on GSM8K, outperforming prior approaches that relied on supervised finetuning and external verifiers. This result is striking because it shows that prompting alone can rival or surpass heavily engineered pipelines.

The authors emphasize three patterns:

Scale matters: Small models generate fluent but illogical chains of thought. Performance gains appear only around the 100B-parameter regime.
Hard problems benefit more: Datasets requiring multiple reasoning steps see much larger improvements than simple one-step problems.
Prompting competes with finetuning: Few-shot CoT prompting can match or exceed task-specific supervised models.

These findings strongly support the claim that reasoning ability is not binary but emerges gradually with scale.

Ablation Studies: What Actually Helps?

To understand why chain-of-thought prompting works, the authors perform a series of careful ablations.

One hypothesis is that CoT simply helps by producing equations. To test this, they prompt the model to output only a mathematical equation before answering. This helps on very simple problems but fails on GSM8K, suggesting that semantic decomposition, not just equation extraction, is essential.

Another hypothesis is that CoT works because it allows more computation. The authors test this by forcing the model to output meaningless filler tokens before answering. This yields no improvement, ruling out “extra tokens” as the explanation.

Ablation study for different variations of prompting using LaMDA 137B and PaLM 540B.

Finally, they test placing the chain of thought after the answer. This also fails, showing that the reasoning must precede the answer and be used causally rather than decoratively.

Together, these ablations strongly suggest that explicit step-by-step reasoning in natural language is the key mechanism.

Robustness: Does Style Matter?

Prompt sensitivity is a well-known issue in large language models. A natural concern is whether chain-of-thought prompting works only with carefully crafted examples.

The authors address this by testing chains of thought written by different annotators, in different styles, and with varying levels of verbosity. They also test exemplars sampled directly from datasets that already contain reasoning steps.

Chain-of-thought prompting enables large language models to solve challenging math problems.

While performance varies somewhat, as expected, all chain-of-thought variants consistently outperform standard prompting. This suggests that CoT prompting does not rely on a specific phrasing or stylistic trick, but on the presence of structured intermediate reasoning itself.

Although arithmetic provides the clearest evidence, the authors also evaluate chain-of-thought prompting on commonsense reasoning tasks, including CommonsenseQA [6], StrategyQA [7], and selected BIG-bench tasks.

These problems require background knowledge, multi-hop inference, and plausibility judgments rather than numerical computation. Here too, chain-of-thought prompting yields consistent improvements, especially for the largest models.

Notably, PaLM 540B with CoT prompting surpasses the prior state of the art on StrategyQA and performs competitively with human baselines on sports plausibility tasks. The gains are smaller than in math, but they demonstrate that CoT is not limited to arithmetic, it generalizes to reasoning about the world.

Symbolic Reasoning and Length Generalization

The final experimental section examines symbolic reasoning, using controlled toy tasks such as last-letter concatenation and coin-flip tracking.

These tasks are interesting because they allow precise control over problem length. Models are shown examples with a fixed number of steps and then tested on longer sequences at inference time.

Standard prompting fails completely in these out-of-domain settings. Chain-of-thought prompting, however, enables length generalization: large models successfully apply the same reasoning pattern to longer sequences than those seen in the prompt.

This result is subtle but important. It shows that chain-of-thought prompting does not merely encourage memorization of exemplars; it induces abstract procedural reasoning that can be extended to novel inputs.

One of the paper’s most influential conclusions is that chain-of-thought reasoning is an emergent ability. Below a certain scale, models generate fluent but incorrect reasoning. Above that threshold, both reasoning quality and task performance improve rapidly.

This observation aligns with broader findings about emergent behaviors in large language models, where qualitative shifts occur once models reach sufficient capacity. Importantly, standard prompting often masks these abilities, making models appear less capable than they actually are.

In this sense, chain-of-thought prompting reveals that standard prompting provides a lower bound on model capability, not an upper bound.

Limitations and Open Questions

The authors are careful not to overclaim. They explicitly note several limitations.

First, generating a plausible chain of thought does not guarantee that the model is “really” reasoning in a human-like sense. The relationship between generated reasoning traces and internal neural computation remains unresolved.

Second, chains of thought are not always correct. Models can produce coherent but flawed reasoning that leads to the wrong answer, raising concerns about over-trusting explanations.

Third, the reliance on very large models makes CoT prompting expensive and limits its immediate applicability in resource-constrained settings.

Finally, there is the question of faithfulness: whether the generated chain truly reflects the model’s decision process or is a post-hoc rationalization. This paper does not resolve that debate, but it provides a concrete experimental foundation for future work.

Conclusion

The impact of “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” extends far beyond its immediate results.

It reframed prompting from a formatting trick into a mechanism for cognitive scaffolding. It showed that reasoning does not require new architectures, symbolic solvers, or specialized training, sometimes, it just requires asking the model to “show its work.”

The paper also catalyzed a wave of follow-up research, including self-consistency decoding, zero-shot CoT prompting, and methods that selectively hide or compress reasoning traces for safety and efficiency.

Perhaps most importantly, it changed how practitioners interact with language models. Today, prompting models to “think step by step” is a default move in both research and industry, a direct legacy of this work.

Chain-of-thought prompting demonstrated that large language models are not merely pattern matchers that jump from question to answer. When given the right scaffolding, they can perform structured, multi-step reasoning across arithmetic, commonsense, and symbolic domains.

By revealing reasoning as an emergent, prompt-elicitable capability, Wei et al. fundamentally altered our understanding of what language models can do, and how much of that ability was hidden in plain sight.

The paper stands as a reminder that sometimes, the most powerful advances come not from building bigger machines, but from learning how to ask better questions.

References

[1] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824-24837.

[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

[3] Measuring Intelligence: Key Benchmarks and Metrics for LLMs, Transcendent AI

[4] Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., ... & Schulman, J. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.

[5] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., ... & Fiedel, N. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240), 1-113.

[6] Talmor, A., Herzig, J., Lourie, N., & Berant, J. (2019, June). Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4149-4158).

[7] Geva, M., Khashabi, D., Segal, E., Khot, T., Roth, D., & Berant, J. (2021). Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9, 346-361.