Developer Tools

A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

GPT-4o drops 4.1% and Llama-3.1 drops 16% on transformed benchmarks...

Deep Dive

A new study from researchers at TU Delft introduces a metamorphic testing (MT) method combined with negative log-likelihood (NLL) to diagnose memorization in LLM-based automated program repair (APR). The team constructed variant benchmarks by applying semantics-preserving transformations to two widely used datasets, Defects4J and GitBug-Java, then tested seven LLMs on both original and transformed versions. Results show substantial drops in patch generation success: GPT-4o fell 4.1%, while Llama-3.1 dropped 15.98%. The degradation strongly correlates with NLL on original benchmarks, indicating models perform better on memorized instances.

The findings provide stronger evidence of data leakage in APR evaluations, where LLMs memorize bug fixes from pretraining data rather than learning generalizable repair logic. The study suggests metamorphic testing alone can help mitigate these effects, offering a more reliable evaluation framework. This is critical as LLM-based APR tools gain adoption in software engineering, where inflated performance estimates could mislead developers and researchers about real-world capabilities.

Key Points
  • GPT-4o's repair success dropped 4.1% on transformed benchmarks, Llama-3.1 fell 15.98%
  • Method combines metamorphic testing with negative log-likelihood (NLL) to detect memorization
  • Tested on Defects4J and GitBug-Java datasets with semantics-preserving code transformations

Why It Matters

Exposes inflated LLM repair benchmarks, forcing more rigorous evaluation before real-world deployment.