Developer Tools

SWE-Cycle benchmark reveals 3x drop in AI code agents' full-cycle solve rates

New benchmark tests agents on bare repos — isolated tasks vs full cycle results differ sharply.

Deep Dive

A team of researchers from Shanghai Jiao Tong University, Meituan, and other institutions released SWE-Cycle, a rigorous benchmark designed to measure how autonomous code agents handle the entire issue resolution lifecycle. Unlike existing benchmarks that test agents in pre-configured environments with static evaluation pipelines, SWE-Cycle forces agents to work from scratch on a bare repository. The benchmark includes 489 carefully filtered instances and evaluates agents across three isolated sub-tasks—environment reconstruction, code implementation, and verification test generation—plus a fourth FullCycle task that combines all three. To address the unreliability of static parsers on complex autonomous trajectories, the team also developed SWE-Judge, an execution-capable evaluation agent that merges static code review with dynamic testing to accurately verify functional correctness.

When the researchers tested code agents powered by six state-of-the-art LLMs (likely including GPT-4o, Claude 3.5, and open-source models), they observed a sharp drop in solve rates when moving from isolated tasks to the FullCycle task. The results highlight critical bottlenecks: agents struggle with cross-phase dependencies (e.g., setting up an environment that works with later code), maintaining code quality across the full pipeline, and recovering from errors without human scaffolding. SWE-Cycle and SWE-Judge provide the first comprehensive framework that measures end-to-end autonomy, exposing gaps that standard benchmarks miss. The findings suggest that current AI coding assistants are far from reliable for unsupervised software maintenance and need significant improvement in planning, debugging, and integration skills.

Key Points
  • SWE-Cycle benchmark contains 489 instances across 3 isolated tasks (env reconstruction, code impl, test generation) plus a FullCycle end-to-end task on bare repos.
  • SWE-Judge combines static code review with dynamic testing to eliminate measurement errors from traditional static parsers.
  • Six state-of-the-art LLMs tested: solve rates plummet from isolated tasks to FullCycle, revealing severe cross-phase dependency bottlenecks.

Why It Matters

Real-world software engineering requires more than isolated coding—agents must handle full issue cycles autonomously, and this benchmark shows they're not ready.