Models & Releases

Why we no longer evaluate SWE-bench Verified

Key AI coding benchmark flawed with training leakage and incorrect tests, prompting shift to new standard.

Deep Dive

The maintainers of SWE-bench, a critical benchmark for evaluating AI coding capabilities, have declared the 'SWE-bench Verified' dataset fundamentally flawed and will cease evaluating it. This decision stems from a technical analysis revealing two major issues: widespread training data contamination and a significant number of incorrect test cases. Training leakage occurs when AI models like GPT-4o or Claude 3.5 Sonnet have been exposed to the benchmark's problem-solution pairs during their training, artificially inflating their performance scores. Simultaneously, the benchmark contained tests with incorrect expected outputs, meaning a model providing the right code could still 'fail,' skewing results downward.

This contamination problem is particularly acute for frontier models, whose massive training datasets make it nearly impossible to guarantee they haven't seen the test problems. The team's findings suggest that published performance claims on SWE-bench Verified—often cited in model cards and research papers—may not reflect true reasoning ability, but rather memorization. This undermines the benchmark's core purpose: to measure an AI's ability to solve novel, real-world software issues pulled from GitHub.

In response, the team is pivoting the community to 'SWE-bench Pro,' a new, more rigorous evaluation designed to mitigate these flaws. SWE-bench Pro employs stricter contamination controls and more robust verification of test correctness. For AI developers and researchers, this means recalibrating how they measure progress. Models must now be assessed on a cleaner slate, which could reshuffle leaderboards and provide a truer picture of which AI agents are genuinely advancing in complex code generation and repair.

Key Points
  • SWE-bench Verified contains training data leakage, allowing models to memorize solutions rather than demonstrate reasoning.
  • The benchmark includes incorrect test cases that penalize accurate code, further distorting performance measurements.
  • The maintainers recommend SWE-bench Pro as a new, more rigorous standard for evaluating AI coding agents.

Why It Matters

Accurate benchmarks are essential for tracking real AI progress in software engineering; flawed metrics mislead developers and investors.