AI Safety

Toward a Better Evaluations Ecosystem

SWE-bench Verified results differ by company due to varying trial counts, tools, and subsets.

Deep Dive

Model evaluations are fundamentally broken, argues Benjamin Arnav on LessWrong. Numbers that appear side-by-side as evidence of progress are rarely comparable due to inconsistent methodologies across companies and releases. For instance, Anthropic's SWE-bench Verified results changed from 489 tasks with three tools for Claude 3.7 Sonnet, to running the full 500 tasks without reasoning for Claude 4, then enabling reasoning with 25 trials for Opus 4.6, and finally reverting to 5 trials for Opus 4.7. OpenAI, while more consistent, used a 477-task subset since o3-mini, averaging pass@1 over 4 trials, and then retired the benchmark. Google's Gemini 3 used 10 trials with a bash tool, file operations tool, and submit tool on the full dataset. Similar inconsistencies plague other benchmarks: GPQA saw Anthropic switch between 5 and 10 trials; AIME had OpenAI change rollouts from 64 to 8 and later add tools. The lack of comparability undermines safety decisions and progress tracking.

Arnav argues that this situation is untenable for a high-stakes industry. Every other sector—from finance to aviation—solved this by removing measurements from the hands of the companies being measured, shifting to independent third-party auditors. While methodological changes may be justified by model-specific tooling, reporting standards must be transparent and consistent. Until the AI community adopts standardized, audited benchmarks, the public and regulators will continue to be misled by numbers that don't mean what they appear to mean. The solution is clear: create an evaluations ecosystem where independent auditors run all models under identical conditions, publish methodologies openly, and provide comparable results.

Key Points
  • Anthropic changed SWE-bench Verified methodology across 5 releases: different task subsets (489 vs 500), tools (bash, file editor, planning), trial counts (1, 5, 25), and reasoning on/off.
  • OpenAI used a 477-task subset since o3-mini (4 trials pass@1) while Google used full dataset with 10 trials for Gemini 3, making cross-company comparisons invalid.
  • GPQA and AIME benchmarks also suffer: Anthropic switched trials (5 to 10), OpenAI changed rollouts (64 to 8) and added tools without adjusting reporting.

Why It Matters

Inconsistent evals mislead safety decisions and progress claims; independent auditors are needed for trustworthy benchmarks.