Benchmarks | Transcendent AI

What If Reasoning Doesn’t Need Billion-Parameter Models?

Large language models excel at language but often struggle with structured reasoning tasks. This article explores Tiny Recursive Models (TRMs), a radically simpler approach that uses small neural networks with recursive refinement to outperform massive LLMs on puzzles like Sudoku, mazes, and ARC-AGI. By prioritizing iterative reasoning over scale, TRMs show that deep thinking can emerge from minimal architectures, challenging prevailing assumptions about model size and intell

Juan Manuel Ortiz de Zarate

Dec 18, 202510 min read

AI Can Code, But Can It Engineer?

SWE-Bench Pro marks a turning point in evaluating AI coding agents. Built from complex, real-world software repositories, it reveals that even frontier models like GPT-5 and Claude Opus solve less than 25% of tasks. The benchmark exposes the gap between coding fluency and true engineering ability, redefining how progress toward autonomous software development should be measured.

Juan Manuel Ortiz de Zarate

Nov 5, 202510 min read

The Checklist Shortcut to Smarter, Safer AI

This article explores Reinforcement Learning from Checklist Feedback (RLCF), a new approach that replaces reward models with checklists to align large language models. By breaking instructions into clear, verifiable steps, checklists provide richer, more interpretable feedback and consistently improve performance across benchmarks. The piece examines how this shift could make AI more reliable, transparent, and user-aligned.

Juan Manuel Ortiz de Zarate

Sep 4, 202512 min read

Adventuring with AI: What Classic Games Teach Us About Modern Models

TextQuests introduces a benchmark built on 25 Infocom text-based adventure games to evaluate LLMs in dynamic, exploratory environments. Unlike static benchmarks, it tests long-context reasoning, trial-and-error learning, and ethical decision-making without external tools. Results show that even advanced models like GPT-5 struggle with sustained strategy, highlighting current limits in autonomy, memory, and adaptive reasoning

Juan Manuel Ortiz de Zarate

Aug 23, 202510 min read

The Illusion of Thinking: Understanding Reasoning Models in AI

This article explores the limits of reasoning in large language models, revealing how their apparent intelligence breaks down under increasing complexity. Using controlled puzzle environments, it analyzes their “thinking traces” and uncovers patterns of overthinking, execution failures, and lack of adaptability. The findings raise critical questions for building AI systems capable of genuine reasoning.

Juan Manuel Ortiz de Zarate

Jun 26, 202510 min read