Developer Tools

LLMs Fail Self-Review: 39.7% Semantic Drift in Code Modernization

New study finds 31.7% of broken model outputs are silently endorsed by the same model.

Deep Dive

A new pre-print paper (arXiv:2605.21537) from researchers Reddy, Lolla, and Sanku systematically exposes a fundamental flaw in using large language models for legacy code modernization. The team ran 1,980 real modernization calls across 11 production LLMs from 7 distinct families on a balanced 60-snippet corpus of legacy Python 2 code. Each output was evaluated with a type-strict behavioral oracle to detect silent changes in observable behavior. The results are sobering: semantic drift (behavior changes) occurred in 39.7% of attempts on deliberately tricky snippets, compared to just 7.0% on benign control code that required no real modernization—a gap of +32.7 percentage points.

More troubling is the finding that self-review by the producing model is not a reliable safety net. Across all semantic drift cases, 31.7% were silently endorsed by the same model that produced them, with per-model miss rates ranging from 0% on five models to 100% on one widely deployed model. Several models even explicitly articulated the very Python 2 vs Python 3 semantic distinction that broke their output, then still declared behavior preserved. The study also found drift rate is non-monotone in model capability and price, indicating the failure is task-structural rather than driven by model scale. All code, prompts, and raw outputs have been released for community verification.

Key Points
  • 39.7% semantic drift on trap snippets vs 7.0% on benign control code across 1,980 modernization calls
  • 31.7% of broken outputs were silently endorsed by the producing model during self-review
  • Drift rates per model ranged from 5.6% to 46.7% and did not correlate with model capability or price

Why It Matters

For developers using LLMs to modernize code, self-review is unreliable—manual validation remains essential despite confident model reasoning.