AgentLens finds 10.7% of passing SWE agents are 'Lucky Passes'
New framework reveals many SWE agents succeed by luck, not skill.
A new paper from researchers including Priyam Sahoo introduces AgentLens, a framework that exposes a critical flaw in how we evaluate software engineering (SWE) agents: the 'Lucky Pass' problem. Current evaluation relies solely on whether a patch passes tests, treating a chaotic trial-and-error process the same as a principled solution. The team analyzed 2,614 trajectories from OpenHands across eight model backends on 60 SWE-bench Verified tasks. Among 1,815 passing trajectories, 10.7% exhibited Lucky Pass behavior—characterized by regression cycles, blind retries, missing verification, or temporally disordered exploration and implementation. The framework uses Prefix Tree Acceptor references and a context-sensitive intent labeler to classify actions into Exploration, Implementation, Verification, or Orchestration.
Across the eight models, Lucky Pass rates ranged dramatically from 0.5% to 23.2%, and ranking models by quality score instead of pass rate shifted positions by up to five ranks. This suggests that popular leaderboards may overrate agents that hack solutions rather than build them reliably. AgentLens introduces three quality tiers—Lucky, Solid, and Ideal—and further decomposes Lucky Passes into five recurring failure mechanisms. The team has released AgentLens-Bench, a dataset of 1,815 annotated trajectories, along with an SDK for process-level evaluation. For organizations deploying AI coding agents, this work underscores that pass rate is an incomplete metric; process quality is essential for trust and reliability.
- 10.7% of passing SWE-agent trajectories are 'Lucky Passes' achieved through trial-and-error rather than principled reasoning
- Lucky rates vary from 0.5% to 23.2% across eight model backends, causing rank shifts of up to five positions when using quality scores
- AgentLens-Bench dataset includes 1,815 annotated trajectories with quality tiers (Lucky, Solid, Ideal) and five recurring Lucky Pass mechanisms
Why It Matters
Pass rates alone mislead; process quality is crucial for reliable AI coding agents.