AI Safety

Are LLMs not getting better?

Models pass tests but struggle to merge—merge rates flat for over a year.

Deep Dive

A recent LessWrong post by kqr examines METR's study on LLM programming performance, focusing on the gap between test-passing and mergeable code. METR found that LLMs pass tests 50% of the time at 50 minutes, but mergeable quality drops to 8 minutes. The author reanalyzes merge rate data, testing linear, piecewise constant, and constant models using Brier scores. The constant function scores best (0.0100), followed by piecewise constant (0.0117), and linear slope (0.0129). This suggests merge rates have been flat since early 2025, contradicting claims of rapid advancement.

The post notes that METR's tested models—Sonnet, Opus, GPT-5, Sonnet—may not reflect flagship improvements. Commenters argue Opus 4.5/4.6 and agentic coding trends could shift the slope, but no hard data exists. The author warns of a gap between buzz and actual performance, as seen in 2025. This raises questions about whether LLMs are truly improving or if metrics are misleading. The analysis underscores the need for rigorous benchmarks like merge rates to assess real-world AI capability.

Key Points
  • LLM merge rates (code approved by maintainers) have not improved since early 2025, per METR data reanalysis.
  • A constant function model (Brier 0.0100) fits merge rate data better than linear growth (0.0129), indicating stagnation.
  • METR's test-passing success horizon is 50 minutes, but mergeable quality drops to 8 minutes—a 6x gap.

Why It Matters

Flat merge rates challenge the narrative of rapid LLM coding progress, urging caution in AI adoption.