AI Can Code, But Can It Engineer?
- Juan Manuel Ortiz de Zarate

- 15 hours ago
- 10 min read
For years, progress in coding-capable large language models (LLMs) has been measured through neatly packaged programming benchmarks, small, well-defined tasks that resemble textbook exercises more than the chaos of real development. In that constrained world, the leading models looked almost superhuman: over 70 percent of problems on SWE-Bench Verified [5] were being solved automatically by systems like GPT-4 and Claude 3. But as every software engineer knows, the real world is never that tidy. Enterprise codebases are sprawling, full of interdependencies, tests that fail for obscure reasons, and requirements written in human shorthand.
The new benchmark SWE-Bench Pro [1], introduced by Xiang Deng, Jeff Da and colleagues at Scale AI, is a direct response to that reality. It represents a philosophical shift: from evaluating LLMs as clever code autocompleters to treating them as apprentice engineers working inside complex projects. The numbers tell the story, where previous tasks were almost saturated, SWE-Bench Pro drops even the most advanced systems back below 25 percent success. GPT-5 tops the chart at 23.3 percent, just ahead of Anthropic’s Claude Opus 4.1 at 22.7. The message is clear: genuine software engineering autonomy remains far away.
This article examines what SWE-Bench Pro contributes, how it differs from earlier efforts, what it reveals about current AI agents’ strengths and weaknesses, and why it might reshape how we train and trust coding models in the years ahead.

From Function-Level Code to Enterprise-Scale Engineering
The evolution of coding benchmarks
Early code benchmarks[6], HumanEval[3] and MBPP [4], were designed for function-level reasoning: write a few lines of Python to pass a unit test. These datasets were essential for calibrating model progress, but they measured programming in the abstract, not engineering as practiced.
The turning point came with SWE-Bench [2], which reframed evaluation around issue resolution. Instead of isolated snippets, models received an entire GitHub repository, a natural-language bug report or feature request, and the challenge to produce a patch that fixed the problem while keeping the test suite green. For the first time, AI had to navigate dependency chains, project structure, and commit histories, approximating the daily work of a real developer.
Yet SWE-Bench’s success created a paradox. Because it drew heavily from open-source Python projects under permissive licenses, many of its examples were likely included in the training data of the very models being tested. This data-contamination effect inflated scores and blurred the line between understanding and memorization. Moreover, many SWE-Bench tasks were too simple: 161 of the 500 verified cases involved one- or two-line fixes. The benchmark risked turning into a solved problem.
The Design of SWE-Bench Pro
1. Contamination resistance by design
The first innovation of SWE-Bench Pro is legal and methodological rather than technical. To prevent models from having seen the material during training, the authors restrict the public and held-out sets to GPL-licensed repositories, whose copyleft terms prohibit inclusion in proprietary datasets. In parallel, they assembled a commercial subset of proprietary codebases from partner startups, repositories never available on the public internet.
This three-tiered structure, public (11 repos, 731 instances), held-out (12 repos, 858 instances), and commercial (18 repos, 276 instances), creates what the authors call a contamination-resistant testbed. Only the public subset is released openly; the held-out set remains private for future overfitting checks, and the commercial set is used internally to measure performance on truly unseen enterprise-grade code.
2. Real-world scale and complexity
SWE-Bench Pro explicitly filters out trivial edits. Every task requires at least 10 lines of code change, often spanning multiple files, on average 107 lines across 4 files. Some patches exceed 100 lines, mirroring the scale of professional pull requests. Each repository contributes no more than 100 problems, ensuring diversity across domains: business-to-business platforms, developer tools, and consumer applications with heavy front-end logic.
3. Human-centered augmentation
Unlike earlier automated extractions, each problem in SWE-Bench Pro is human-augmented and verified through a three-stage process. Annotators rewrite vague commit messages into clear problem statements, add requirements that specify expected behavior, and, when necessary, explicitly define class or function interfaces. This prevents the “false negative” pattern where a model implements the right fix under a slightly different name and fails the test.
Every instance also includes two complementary test suites:
fail-to-pass tests, which fail before the patch and pass after it, confirming that the bug or feature was correctly addressed; and
pass-to-pass tests, ensuring the patch does not break existing functionality.
Each suite is run multiple times to filter out flaky tests, and tasks failing reproducibility checks are discarded.
4. Standardized execution environments
To replicate the messiness of modern development, each repository runs inside a containerized environment tailored to its language ecosystem, Python virtualenvs, Node.js modules, or Go modules, captured in reproducible Docker images. This prevents subtle dependency errors and allows fair cross-model comparison.
Results: Reality Check for Coding Agents
The evaluation used the SWE-Agent framework [7], a general scaffold that lets models interact with a virtual development environment, reading, editing, running, and submitting code patches across up to 200 turns.

The commercial set proved even harsher: GPT-5 dropped to 14.9 percent, Opus 4.1 to 17.8. No system exceeded 20 percent on proprietary enterprise codebases.
For perspective, the same GPT-5 scored above 70 percent on SWE-Bench Verified. In one leap, the problem difficulty multiplied threefold, re-exposing fundamental weaknesses that easy benchmarks had hidden.
What the Benchmark Reveals
Language and repository variance
Performance correlates strongly with programming language. Python and Go repositories show the highest resolution rates, occasionally above 30 percent, while JavaScript and TypeScript lag behind, reflecting greater ecosystem complexity and noisier dependency graphs. At the repository level, success ranges from single digits to over 50 percent depending on codebase structure and documentation quality.
This unevenness suggests that model “competence” is still brittle and domain-specific. LLMs appear to memorize idioms for well-structured Python packages but falter in messy front-end logic, asynchronous APIs, or build systems with intricate configuration files.

Failure-mode taxonomy: how agents actually fail
To move beyond scores, the authors performed a failure-mode analysis using LLM-as-a-judge. GPT-5 acted as a classifier, reading 20 final steps from each failed trajectory and categorizing the cause. The findings read like a diagnostic map of machine fallibility:
Semantic or algorithmic misunderstanding dominates frontier models like Claude Opus 4.1 (36 percent) and GPT-5 (52 percent). They usually produce clean, syntactically valid code that simply implements the wrong logic.
Syntax errors remain common for smaller open-source models such as Qwen-3 32B, affecting nearly half of failures.
Context management failures, losing track of which file or function is being edited, cripple mid-tier models like Claude Sonnet 4, which shows 35 percent context overflows and 17 percent “endless file-reading” loops.
Tool-use errors (misusing bash or editor commands) plague open systems without integrated tool interfaces.
Infinite loops and stuck states appear when the agent exhausts context or runs repetitive searches without progress.
Collectively, these trajectories highlight that even top models lack sustained reasoning over long horizons, precisely the capacity real engineers rely on when debugging across modules.
A Deeper Look: What Makes Long-Horizon Tasks Hard?
Traditional code-completion benchmarks compress reasoning into a few hundred tokens: read the prompt, emit a function. Long-horizon software engineering demands a completely different skill set.
State management. The agent must maintain a mental model of the repository, file hierarchy, class relationships, build dependencies, over hundreds of editing steps.
Context survival. Each command’s output can flood the model’s context window; an unfiltered grep or find may generate thousands of lines, causing context overflow.
Iterative planning. A realistic fix often requires experimentation: inspect tests, hypothesize a cause, patch, re-run, and iterate. Today’s agents rarely plan beyond two or three steps.
Specification ambiguity. Even with human-augmented requirements, many issues hinge on implicit domain knowledge, naming conventions, architectural intent, or performance trade-offs.
Humans navigate these uncertainties through intuition, memory, and discussion. LLMs, operating as stateless text transformers, lack persistent working memory and epistemic humility: they cannot “know what they don’t know.” SWE-Bench Pro effectively exposes that missing meta-cognition.
Implications for the Future of Autonomous Developers
Benchmarks as governance tools
Benchmarks are not just scoreboards; they steer research priorities. SWE-Bench Pro sends a clear signal that engineering, not coding, is the next frontier. It redefines success from solving toy functions to sustaining coherent work across hundreds of interdependent files.
Because it is contamination-resistant, the dataset also serves as a cleaner yardstick for claims of progress. If a model improves on SWE-Bench Pro, we can trust the gain reflects reasoning ability, not memorized commits.
Re-training and architectural directions
The failure analysis hints at three promising research avenues:
Memory-augmented agents. Systems need external memory stores or vector-based retrieval mechanisms to preserve long-term context across edits.
Hierarchical planning. Instead of flat token-by-token generation, agents could learn multi-scale reasoning: plan at the file level, then implement at the function level.
Integrated toolchains. Tight coupling between model, IDE, and build tools can reduce “tool-use” failures and feedback latency.
Several recent prototypes, OpenHands, AgentCoder, and CodeAct, are exploring such ideas. SWE-Bench Pro offers the crucible to test whether these architectures truly scale.
The human parallel
Perhaps the most striking insight is sociological. Professional engineers rarely work alone; they collaborate, review each other’s code, and use project management frameworks. Current AI agents resemble talented but amnesic interns, capable of writing code but unable to coordinate or remember decisions. The authors of SWE-Bench Pro explicitly propose collaborative development scenarios for future iterations, where multiple agents or human-AI teams tackle shared tasks such as code reviews or merge-conflict resolution. Benchmarks of the future may judge not only correctness but team intelligence.
Limitations of SWE-Bench Pro Itself
No benchmark escapes its own constraints. The authors acknowledge several:
Language coverage. While SWE-Bench Pro includes Python, JavaScript, TypeScript, and Go, it omits major ecosystems such as Java, C++, Rust, and C#. That limits generality.
Narrow task type. All problems are framed as “issue resolution”, bug fixes or feature additions verified by tests. Real-world engineering also involves design decisions, code reviews, documentation, and architectural refactoring, which remain outside scope.
Dependence on tests. The benchmark assumes the existing test suite defines correctness. But software often admits multiple valid solutions; a patch that fixes the bug differently might fail the original tests.
Reduced ambiguity. Human augmentation clarifies tasks so models can focus on implementation, but this sanitization underplays the messy communication loops of real projects.
These caveats ensure that SWE-Bench Pro should be viewed not as the final challenge but as a necessary transitional step between synthetic puzzles and live-production environments.
Case Study Example: The Open Library Task
An example from the appendix illustrates the benchmark’s realism. The task asks the agent to add Google Books as a metadata fallback source for Open Library’s BookWorm importer. The problem description includes a natural-language motivation, measurable goals, and explicit success criteria, precisely how a real feature ticket would be written. Requirements specify expected behaviors, such as adding "google_books" to a configuration tuple and handling multi-match API responses gracefully.

To solve it, an agent must traverse multiple files (affiliate_server.py, imports.py, promise_batch_imports.py), integrate API calls, manage error handling, and update staging logic, all while keeping existing tests intact. It is the kind of multi-file orchestration that junior developers learn over months, not minutes. On SWE-Bench Pro, tasks of this kind are the norm rather than the exception.
Toward a New Era of Evaluation
SWE-Bench Pro’s creators argue that as benchmarks like SWE-Bench Verified approach saturation, the field risks illusion of competence: incremental model upgrades appear impressive on paper but yield little real-world improvement. By resetting the difficulty curve, they hope to “extend the runway” for meaningful progress.
In that sense, SWE-Bench Pro plays a role similar to ImageNet in computer vision circa 2012: a unifying, high-fidelity dataset that will likely define the next generation of research agendas. But unlike ImageNet’s closed-world labels, SWE-Bench Pro sits atop living codebases; its very existence will evolve as repositories change. The authors’ decision to maintain a private held-out set acknowledges this dynamic, allowing longitudinal measurement of overfitting as models gain exposure to public data.
Beyond Tests: The Road Ahead
The paper’s final section sketches what might follow:
Expanded language and framework coverage. Future versions should incorporate Java, Rust, Kotlin, and modern front-end stacks to broaden representativeness.
Alternative evaluation metrics. Instead of binary pass/fail outcomes, assess maintainability, readability, and architectural soundness, qualities that tests cannot capture.
Security and performance dimensions. Benchmarks could score whether patches introduce vulnerabilities or degrade runtime efficiency.
Collaborative and long-term tasks. Agents could be evaluated on multi-day projects requiring coordination, version-control interactions, and iterative refactoring.
In short, the community is shifting from code correctness to software competence, a subtler, richer concept.
Conclusion
SWE-Bench Pro redefines how we measure intelligence in code. By moving from sanitized tasks to messy, industrially grounded problems, it punctures the illusion that large models have already mastered software engineering. The benchmark’s three pillars, contamination resistance, human-verified complexity, and standardized execution, establish a new baseline for honest evaluation.
Its sobering results, no model above 25 percent success, remind us that coding fluency does not equal engineering ability. GPT-5 and Claude Opus 4.1 may dazzle in demos, but faced with the long horizons of enterprise code, they stumble like novices tracing their first bug through a labyrinth of dependencies.
That humility is precisely what the field needs. SWE-Bench Pro gives researchers a harder mountain to climb, one closer to the peaks where real engineers work. Progress here will not only produce better models but also deepen our understanding of reasoning itself, how systems plan, persist, and learn from failure. In the long run, the benchmark may be remembered less as a dataset and more as a mirror, showing both how far AI has come and how much further the craft of software engineering still extends beyond prediction into the terrain of thought.
References
[1] Deng, X., Da, J., Pan, E., He, Y. Y., Ide, C., Garg, K., ... & Kenstler, B. (2025). SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?. arXiv preprint arXiv:2509.16941.
[2] Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2023). Swe-bench: Can language models resolve real-world github issues?. arXiv preprint arXiv:2310.06770.
[3] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., ... & Zaremba, W. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
[4] Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., ... & Sutton, C. (2021). Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
[5] OpenAI, 2024. URL https://openai.com/index/introducing-swe-bench-verified/.
[6] Measuring Intelligence: Key Benchmarks and Metrics for LLMs, Transcendent AI
[7] Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., & Press, O. (2024). Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems, 37, 50528-50652.




Comments