10.7% of passing SWE-agent trajectories are 'Lucky Passes' achieved through trial-and-error rather than principled reasoning?

10.7% of passing SWE-agent trajectories are 'Lucky Passes' achieved through trial-and-error rather than principled reasoning

Lucky rates vary from 0.5% to 23.2% across eight model backends, causing rank shifts of up to five positions when using quality scores?

Lucky rates vary from 0.5% to 23.2% across eight model backends, causing rank shifts of up to five positions when using quality scores

AgentLens-Bench dataset includes 1,815 annotated trajectories with quality tiers (Lucky, Solid, Ideal) and five recurring Lucky Pass mechanisms?

AgentLens-Bench dataset includes 1,815 annotated trajectories with quality tiers (Lucky, Solid, Ideal) and five recurring Lucky Pass mechanisms

Developer Tools

AgentLens finds 10.7% of passing SWE agents are 'Lucky Passes'

arXiv cs.SE May 14, 2026

⚡New framework reveals many SWE agents succeed by luck, not skill.

Deep Dive

A new paper from researchers including Priyam Sahoo introduces AgentLens, a framework that exposes a critical flaw in how we evaluate software engineering (SWE) agents: the 'Lucky Pass' problem. Current evaluation relies solely on whether a patch passes tests, treating a chaotic trial-and-error process the same as a principled solution. The team analyzed 2,614 trajectories from OpenHands across eight model backends on 60 SWE-bench Verified tasks. Among 1,815 passing trajectories, 10.7% exhibited Lucky Pass behavior—characterized by regression cycles, blind retries, missing verification, or temporally disordered exploration and implementation. The framework uses Prefix Tree Acceptor references and a context-sensitive intent labeler to classify actions into Exploration, Implementation, Verification, or Orchestration.

Across the eight models, Lucky Pass rates ranged dramatically from 0.5% to 23.2%, and ranking models by quality score instead of pass rate shifted positions by up to five ranks. This suggests that popular leaderboards may overrate agents that hack solutions rather than build them reliably. AgentLens introduces three quality tiers—Lucky, Solid, and Ideal—and further decomposes Lucky Passes into five recurring failure mechanisms. The team has released AgentLens-Bench, a dataset of 1,815 annotated trajectories, along with an SDK for process-level evaluation. For organizations deploying AI coding agents, this work underscores that pass rate is an incomplete metric; process quality is essential for trust and reliability.

Key Points

10.7% of passing SWE-agent trajectories are 'Lucky Passes' achieved through trial-and-error rather than principled reasoning
Lucky rates vary from 0.5% to 23.2% across eight model backends, causing rank shifts of up to five positions when using quality scores
AgentLens-Bench dataset includes 1,815 annotated trajectories with quality tiers (Lucky, Solid, Ideal) and five recurring Lucky Pass mechanisms

Why It Matters

Pass rates alone mislead; process quality is crucial for reliable AI coding agents.

Read Original Article

AgentLens finds 10.7% of passing SWE agents are 'Lucky Passes'

Why It Matters

Related Articles

🚀 Stay Ahead in AI