AI Safety

SWE-Bench Pro is even worse

OpenAI's recommended coding benchmark fails basic audits with broken tests and inflated requirements.

Deep Dive

A new audit by Jonathan Gabor reveals that SWE-Bench Pro, the coding benchmark recently recommended by OpenAI to replace SWE-Bench Verified, contains severe methodological flaws that make it unreliable for evaluating AI coding assistants. The audit examined 100 random problems and found widespread issues including test leniency where core functionality isn't verified, test cases that require incorrect implementations, and requirements inflation where specifications include untested implementation details. OpenAI had justified switching benchmarks by noting SWE-Bench Verified had broken tasks, but this analysis shows the replacement is fundamentally compromised.

The most critical findings include Claude's analysis of the flipt-cd2f3b0 problem where tests expected the wrong boolean output, meaning an AI implementing the correct logic would fail. Another issue in NodeBB-a91721 showed tests didn't verify the core requirement of registration without email. Gabor suggests these problems stem from SWE-Bench Pro automatically scraping GitHub commits that modify test cases without verifying their correctness or completeness. This creates a benchmark where passing often requires implementing bugs or ignoring untested requirements, making it useless for measuring true coding capability just as major AI companies are racing to improve their programming agents.

Key Points
  • Audit of 100 SWE-Bench Pro problems found 3 major flaw categories: test leniency, incorrect tests, and requirements inflation
  • In flipt-cd2f3b0, tests required incorrect implementation of IsNotOneOf operator - correct AI solutions would fail
  • OpenAI recently recommended SWE-Bench Pro to replace SWE-Bench Verified, despite these fundamental reliability issues

Why It Matters

Flawed benchmarks mislead AI development and investment, potentially slowing progress in creating reliable coding assistants.