top of page

Search


AI Can Code, But Can It Engineer?
SWE-Bench Pro marks a turning point in evaluating AI coding agents. Built from complex, real-world software repositories, it reveals that even frontier models like GPT-5 and Claude Opus solve less than 25% of tasks. The benchmark exposes the gap between coding fluency and true engineering ability, redefining how progress toward autonomous software development should be measured.

Juan Manuel Ortiz de Zarate
Nov 510 min read


The Checklist Shortcut to Smarter, Safer AI
This article explores Reinforcement Learning from Checklist Feedback (RLCF), a new approach that replaces reward models with checklists to align large language models. By breaking instructions into clear, verifiable steps, checklists provide richer, more interpretable feedback and consistently improve performance across benchmarks. The piece examines how this shift could make AI more reliable, transparent, and user-aligned.

Juan Manuel Ortiz de Zarate
Sep 412 min read


Adventuring with AI: What Classic Games Teach Us About Modern Models
TextQuests introduces a benchmark built on 25 Infocom text-based adventure games to evaluate LLMs in dynamic, exploratory environments. Unlike static benchmarks, it tests long-context reasoning, trial-and-error learning, and ethical decision-making without external tools. Results show that even advanced models like GPT-5 struggle with sustained strategy, highlighting current limits in autonomy, memory, and adaptive reasoning

Juan Manuel Ortiz de Zarate
Aug 2210 min read


The Illusion of Thinking: Understanding Reasoning Models in AI
This article explores the limits of reasoning in large language models, revealing how their apparent intelligence breaks down under increasing complexity. Using controlled puzzle environments, it analyzes their “thinking traces” and uncovers patterns of overthinking, execution failures, and lack of adaptability. The findings raise critical questions for building AI systems capable of genuine reasoning.

Juan Manuel Ortiz de Zarate
Jun 2610 min read
bottom of page