Developer Tools

Examining LLMs Ability to Summarize Code Through Mutation-Analysis

New mutation-analysis method shows GPT-5.2 still describes intent, not actual behavior, 49.3% of the time.

Deep Dive

A team from Carnegie Mellon University has published a groundbreaking study introducing a mutation-analysis methodology to rigorously evaluate the reliability of LLM-generated code summaries. The core problem identified is that models like GPT-4 and GPT-5.2 often produce confident summaries of what code *should* do (its intent), while failing to detect subtle logic changes, edge cases, or bugs that define its actual behavior.

The researchers validated their approach through three experiments totaling 624 mutation-summary evaluations. On 12 controlled synthetic programs with 324 mutations, they found summary accuracy plummeted from 76.5% for single functions to just 17.3% for complex multi-threaded systems. Testing 150 mutated samples from 50 human-written Python programs confirmed these failure patterns, with an overall summary accuracy rate of 49.3%. A direct comparison between GPT-4 and the newer GPT-5.2 showed a substantial performance leap to 85.3% accuracy, with GPT-5.2 demonstrating an improved ability to identify mutations as 'bugs.' However, both models continued to struggle with distinguishing critical implementation details from standard algorithmic patterns.

This work establishes mutation analysis as the first systematic framework for assessing whether AI-generated documentation truly matches code logic. For software engineering, it reveals a significant trust gap: developers relying on LLM summaries for documentation, testing, or code review may miss critical behavioral changes. The findings suggest that while model capabilities are improving rapidly, current systems cannot be fully trusted for safety-critical code analysis without human verification of behavioral accuracy.

Key Points
  • Mutation-analysis method tested 624 scenarios: accuracy fell from 76.5% (simple functions) to 17.3% (multi-threaded systems).
  • GPT-5.2 showed major gains (85.3% accuracy vs. GPT-4's 49.3%) but still confuses intent with actual behavior.
  • Models describe algorithmic patterns rather than detecting specific logic changes, creating reliability risks for documentation and review.

Why It Matters

Developers relying on AI for code summaries may miss critical bugs, as models often describe intent instead of actual program behavior.