Adventuring with AI: What Classic Games Teach Us About Modern Models

Juan Manuel Ortiz de Zarate
Aug 23
10 min read

Large Language Models have transformed the landscape of artificial intelligence. From excelling in natural language understanding to demonstrating remarkable coding skills, these systems are now central to how we think about machine reasoning. Yet, despite their triumphs on standardized benchmarks and static datasets, a nagging question persists: can LLMs operate as autonomous agents in dynamic, exploratory environments that demand long-term planning and self-directed reasoning?

The paper TextQuests: How Good are LLMs at Text-Based Video Games?[1] introduces a novel benchmark[2] designed to probe exactly this. By leveraging the legendary Infocom interactive fiction games of the 1980s, the authors present a rigorous testbed for evaluating AI agents in worlds that mirror many of the challenges of real-life problem solving. The results are both sobering and illuminating: even the most advanced models, including GPT-5 and Gemini 2.5, struggle to achieve consistent progress in these rich, text-driven environments.

LLMs performance on TEXTQUESTS. All reasoning models are evaluated with high-reasoning budget

This article explores the motivations behind TextQuests, the design of the benchmark, the evaluation of state-of-the-art models, and what these findings reveal about the future of AI agents.

Why Interactive Fiction?

From Static Knowledge to Dynamic Worlds

For decades, benchmarks such as MMLU[4] or GPQA[3] have measured the factual recall and reasoning abilities of AI systems. While invaluable, these tests are inherently static: they evaluate responses to fixed prompts, often with clear right or wrong answers. Real-world reasoning, however, is rarely so neatly packaged. It unfolds dynamically, requiring memory, adaptation, and multi-step planning.

Interactive fiction (IF) games, such as Zork or The Hitchhiker’s Guide to the Galaxy, embody these qualities. Players must explore vast environments, solve puzzles, and make hundreds of interconnected decisions. Unlike static tests, IF demands that an agent:

Maintain a long and growing context – remembering past actions and observations over potentially hundreds of turns.
Engage in trial-and-error learning – adapting strategies when initial attempts fail.
Devise multi-step plans – breaking down long-term goals into executable actions.

In essence, IF is a laboratory for autonomous reasoning.

Why Infocom?

The Infocom library, spanning 25 titles, offers the perfect playground. These games are rich in narrative complexity, filled with puzzles, and often require over 30 hours of human playtime to complete. They also present a wide variety of challenges: spatial navigation, logic puzzles, resource management, and ethical decision-making. Crucially, they rely entirely on text input and output, aligning seamlessly with the capabilities of LLMs.

Examples showing the diverse reasoning challenges in TEXTQUESTS. denotes LLM thinking. denotes the action

Designing the Benchmark

Creating TextQuests required more than simply repurposing classic games. The authors carefully engineered a framework that translates the messy, exploratory nature of interactive fiction into a systematic evaluation environment for LLM agents. The design choices balance authentic gameplay experience, experimental rigor, and comparability across models.

Preserving Authentic Complexity

A primary design principle was to preserve the richness of the original Infocom titles. Unlike simplified game environments often used in reinforcement learning, the Infocom games remain unmodified in terms of narrative scope, puzzle structure, and difficulty. Agents interact with the same text prompts and parser interface that challenged human players in the 1980s. This ensures that the benchmark measures capabilities in naturalistic conditions rather than sanitized, toy-like versions of tasks.

To make the games computationally accessible while still faithful to the original experience, the benchmark uses the Jericho–Frotz[5] pipeline, which compiles the original ZIL code into environments that LLMs can query in real time. This preserves parser behavior, ambiguous descriptions, and even the occasional eccentricities of the original design, all of which add layers of complexity for the agent.

Standardized Interaction Protocol

To ensure fair comparisons across models, TextQuests imposes a strict interaction format. Each turn consists of:

A full game history (observations, actions, reasoning).
The model’s output is in a structured JSON format, with two fields: reasoning and action.
A single executable command sent to the game environment.

This protocol guarantees that all models face the same input-output constraints, eliminating discrepancies that could arise from different prompting strategies. It also mirrors the human experience of formulating intent (“why do I want to do this?”) followed by execution (“type command into parser”).

Balancing Freedom with Boundaries

One of the difficulties in designing agent benchmarks is managing the trade-off between open-ended exploration and measurable progress. If the environment is too open, evaluations become noisy and incomparable; if too constrained, they no longer reflect real exploratory reasoning. TextQuests addresses this by:

Checkpoint labeling: Human annotators defined essential milestones (e.g., retrieving a key object, solving a pivotal puzzle). These form the backbone of the Game Progress metric, balancing freedom of play with an objective measure of advancement.
Autosave and Restore commands: By embedding a lightweight form of “counterfactual exploration,” the benchmark acknowledges that getting stuck is part of the game but prevents this from dominating the evaluation. It also enables the study of strategic backtracking, a skill critical in real-world planning.

Clue Integration as Cognitive Scaffolding

A novel contribution is the WITH CLUES setting. Unlike walkthroughs that reduce the game to a scripted sequence, the InvisiClues booklets are riddled with tiered, cryptic hints. For LLMs, this creates a dual challenge: they must retrieve relevant hints from a long, structured document and then contextualize the hint within the current game state. This setting transforms TextQuests into a document-grounded reasoning task, testing not just memory of past actions but also the ability to integrate auxiliary knowledge sources.

Game progress for various model scales versus an optimal human walkthrough. Capable models sustain progress longer, suggesting better long-horizon reasoning.

Ethical Dimension

Finally, the incorporation of harm annotations reflects an awareness that agent performance cannot be judged solely on technical efficiency. Infocom games frequently present ethically loaded choices, lying, stealing, or harming characters. By systematically coding such decisions and penalizing them in the Harm metric, TextQuests integrates value-sensitive evaluation directly into gameplay. This dual-axis approach, progress and harm, prevents the benchmark from rewarding raw success at the expense of ethical considerations.

Evaluating State-of-the-Art LLMs

Designing a benchmark is only half the task; to be meaningful, it must be applied across a diverse set of models in a consistent way. The evaluation of TextQuests, therefore, serves not only as a test of agent performance but also as a comparative study of how different training philosophies, scales, and reasoning approaches manifest in exploratory environments.

Diversity of Model Families

The evaluation suite covers both frontier closed-source systems and emerging open-weight alternatives. By including GPT-5 and Claude Opus alongside Gemini 2.5, Grok, DeepSeek R1[6], and open-source giants like GPT-OSS 120B, the benchmark spans a broad design space. This allows researchers to probe whether performance advantages stem primarily from scale, architectural innovations, reinforcement-style training, or alignment strategies. For example, Claude models emphasize extended reasoning traces, while Grok emphasizes high reasoning budgets with relatively lean scaffolding. TextQuests provides a common stage to observe how these different emphases play out under pressure.

Controlled Experimental Settings

To isolate the intrinsic reasoning capabilities of the models, all were tested under uniform conditions:

Fixed 500-turn caps, with supplementary runs extended to 800 turns for saturation analysis.
Standardized prompts requiring models to output reasoning plus a single command, ensuring comparability.
High reasoning budgets were enabled where possible, but without tool augmentation or external memory modules.

This last constraint is especially important. Many demonstrations of agentic LLMs rely on heavy scaffolding, such as summarization modules or retrieval systems. By design, TextQuests eliminates these crutches, providing a direct measure of the model itself rather than its ecosystem of supports.

Comparing Scale and Efficiency

One of the most illuminating aspects of TextQuests is its ability to highlight the relationship between scale and sustained reasoning. Larger models consistently sustain progress further into a game session, but the benchmark also reveals where efficiency breaks down. Smaller “mini” versions of flagship models, though lighter and cheaper to run, often plateau early. This creates a nuanced picture: raw scale helps, but efficiency of reasoning, measured in tokens per step, determines whether that scale translates into practical agent performance.

Comparing mini and standard models from different closed-source providers, highlighting the importance of model scale for exploratory tasks.

The evaluation also draws attention to the token economy as a hidden dimension of capability. Models that manage to generate concise but purposeful reasoning traces often maintain momentum longer, suggesting that the quality of reasoning, not just its volume, plays a role in sustaining exploratory success.

Family-Specific Signatures

Beyond raw scores, TextQuests reveals distinctive “failure signatures” tied to different model families. For instance:

GPT-family models often excel at clue interpretation but sometimes overcommit to a mistaken inference, pushing forward rather than backtracking.
Claude models leverage their long-form reasoning style effectively, but at times generate unnecessarily verbose chains that slow down exploration.
Gemini models show strong bursts of progress but display more volatility, alternating between deep insight and repetitive stalling.
Open-weight models lag behind overall, yet their inclusion is vital: they serve as baselines for community-driven progress and illustrate the current gap between frontier systems and openly accessible research models.

Benchmark as a Comparative Lens

Perhaps the most important contribution of this evaluation is not the ranking of models but the comparative lens it provides. TextQuests creates a controlled environment where differences of architecture, training scale, and reasoning strategies become visible in ways that standard academic benchmarks cannot capture. Where multiple-choice tests compress reasoning into a single answer, TextQuests stretches it across hundreds of turns, revealing patterns of persistence, adaptability, and failure modes.

A comparison of output and reasoning token efficiency across state-of-the-art LLMs on TEXTQUESTS. Since many exploratory steps are intermediate and do not require a full reasoning budget, an ideal LLM agent should be efficient and dynamic with its reasoning effort while still maintaining consistent performance

Where Models Struggle

The shortcomings revealed by TextQuests go beyond occasional missteps or memory slips; they expose systematic weaknesses in how today’s LLMs structure, sustain, and adapt their reasoning across extended interactions. These struggles, while varied, can be grouped into deeper categories that illuminate the underlying gaps between statistical language models and human players of interactive fiction.

Fragile Strategy Formation

One of the clearest distinctions between human and model gameplay lies in strategic persistence. Human players often pursue a long-term plan, such as collecting specific items before attempting a puzzle, while flexibly revising that plan when new information emerges. By contrast, LLMs tend to oscillate between short bursts of coherent strategy and sudden collapses into ad-hoc improvisation. Once a plan derails, they rarely re-establish a stable trajectory. This fragility suggests that current models lack the internal scaffolding to represent medium-term goals separate from immediate action selection.

Mismanagement of Resources and Trade-offs

Infocom games are rich in resource management puzzles: deciding which items to carry through a narrow passage, or when to conserve light sources. LLMs routinely mishandle such trade-offs. They may carry too many unnecessary objects, drop critical ones without remembering, or fail to anticipate the consequences of resource depletion. Unlike humans, who instinctively weigh opportunity costs (“if I leave the lantern, I won’t see later”), LLMs often treat these as isolated moves rather than interconnected decisions. This reveals a gap in prospective reasoning, the ability to foresee downstream implications of current actions.

Sensitivity to Ambiguity and Indirect Feedback

Interactive fiction thrives on descriptions that are deliberately vague or metaphorical. Humans often exploit world knowledge and genre conventions to infer what the game “really means.” LLMs, however, are unusually brittle when confronted with ambiguous cues. They may misinterpret a cryptic description as irrelevant, or pursue a literal but unproductive course of action. When the environment gives indirect feedback (e.g., “nothing happens” after trying a spell), humans recognize this as a nudge to explore alternatives. Models often interpret it as a signal to repeat the same action, highlighting a weakness in learning from negative evidence.

Over-Exploration Without Synthesis

Another recurrent struggle is a tendency toward breadth over depth. LLMs can generate a wide variety of exploratory actions, but they struggle to synthesize discoveries into coherent knowledge. A model might map many rooms but fail to remember where a key item was located, or repeatedly “discover” the same object without integrating it into its plan. This produces gameplay that looks busy and exploratory but is strategically hollow, a phenomenon akin to surface-level exploration without conceptual integration.

Limited Self-Diagnosis

Perhaps the most revealing struggle is the lack of self-diagnostic reasoning. Human players reflect on failures (“I must have missed something earlier” or “I should restore before entering the maze”). LLMs, when trapped, rarely articulate their own mistakes or propose systematic revisions of their approach. Instead, they continue forward with increasingly desperate commands. This inability to recognize and articulate failure conditions suggests that metacognition, the capacity to evaluate one’s own problem-solving process, remains an underdeveloped faculty in current models.

Broader Implications

A Different Kind of Benchmark

Traditional long-context evaluations, like needle-in-a-haystack tests, measure retrieval from static documents. While useful, they don’t capture the dynamic accumulation of context that occurs when an agent builds its own history through actions. TextQuests fills this gap, offering a rare test of iterative reasoning in growing contexts.

Ethics and Safety

By integrating harm scores, TextQuests extends beyond mere competence. It asks: can agents pursue goals without causing unnecessary harm? This dual framing is essential as AI agents move toward deployment in sensitive real-world domains.

Toward Truly Autonomous Agents

The difficulties exposed by TextQuests suggest that tool use and scaffolding have masked core limitations. In Pokémon experiments, for instance, models succeeded only with extensive external supports like pathfinding tools. TextQuests strips these away, forcing reliance on intrinsic reasoning. The struggles observed highlight how far we remain from robust, general-purpose AI agents.

Looking Forward

Research Directions

The authors highlight several avenues for improvement:

Better Memory Architectures: Models need mechanisms for more reliable long-term recall within extended contexts.
Spatial and Symbolic Reasoning: Enhancements in structured reasoning could help agents build and navigate internal maps.
Dynamic Compute Allocation: Smarter strategies for allocating reasoning effort may boost efficiency without sacrificing performance.
Ethical Scaffolding: Integrating safety layers that align actions with moral norms remains an open challenge.

The Role of Open Sourcing

By releasing TextQuests openly at textquests.ai [7], the team invites the broader community to experiment, benchmark, and iterate. This democratization is crucial for fostering rapid progress and shared standards.

Conclusion

TextQuests is more than a nostalgic return to Zork and its contemporaries. It is a stress test for the future of AI. By situating models in dynamic, puzzle-filled, ethically charged worlds, it exposes both their potential and their limitations. The benchmark makes clear that today’s LLMs, while impressive, remain brittle when faced with the sustained, self-directed reasoning tasks required for true autonomy.

As the field moves beyond static benchmarks, challenges like TextQuests will be central in charting the path forward. The lesson is unmistakable: if we want AI to reason like us, we must first teach it to survive in the dark, maze-like worlds of interactive fiction.

References

[1] Phan, L., Mazeika, M., Zou, A., & Hendrycks, D. (2025). TextQuests: How Good are LLMs at Text-Based Video Games?. arXiv preprint arXiv:2507.23701.

[2] Measuring Intelligence: Key Benchmarks and Metrics for LLMs, Transcendent AI

[3] Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., ... & Bowman, S. R. (2024, July). Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling.

[4] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2020). Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.

[5] Hausknecht, M., Ammanabrolu, P., Côté, M. A., & Yuan, X. (2020, April). Interactive fiction games: A colossal adventure. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 05, pp. 7903-7910).

[6] DeepSeek, the game-changing model, Transcendent AI

[7] TextQuests: How Good are LLMs at Text-Based Video Games?, TextQuests