Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure
Analysis of 9,374 agent trajectories shows failure isn't about patch complexity, but gaps in reasoning.
A new research paper from Tural Mehtiyev and Wesley Assunção provides the first large-scale behavioral analysis of why AI coding agents fail. The study examined 9,374 execution trajectories from 19 different agents (combining 8 coding agent frameworks with 14 different LLMs) across 500 programming tasks. While top-ranked LLM-based coding agents still fail on over 20% of benchmarked problems, the research reveals that patch complexity alone doesn't explain difficulty. Surprisingly, 12 never-solved tasks required only simple patches and were rated easy by human annotators, yet all agents failed due to fundamental gaps in architectural reasoning and domain knowledge.
The study challenges conventional wisdom about agent behavior. The widely reported correlation between longer execution trajectories and failure actually reverses direction once task difficulty is controlled for, revealing it as a statistical confound. Instead, successful agents consistently demonstrate specific behavioral patterns: they gather context before editing and invest more effort in validation. Most significantly, the research disentangles LLM capability from framework design, finding that the underlying LLM is the primary driver of both outcomes and behaviors. Agents sharing the same LLM agree on far more tasks than those sharing the same framework, and the performance gap between different frameworks shrinks with each generation of LLM improvement. While framework prompts do influence agent tactics, this influence diminishes substantially when using stronger foundation models.
- Top coding agents still fail on over 20% of tasks, with failures often stemming from gaps in architectural reasoning rather than patch complexity
- Successful agents follow specific behavioral patterns: gathering context before editing and investing in validation, with trajectory structure being more predictive than length
- The underlying LLM drives outcomes more than framework design, with framework performance gaps shrinking as LLMs improve
Why It Matters
This research provides actionable insights for developers building better coding agents and helps teams understand where to focus improvement efforts.