Developer Tools

TRACER framework detects code LLM contamination with 91% F1 score

New semantic-aware tool catches data leakage in code models at three levels.

Deep Dive

Data contamination—where training data overlaps with evaluation benchmarks—is a known threat to LLM reliability, but it has been underexplored in code models. Traditional detection methods often miss subtle, non-exact duplicates. To address this, a team of researchers (Yifeng Di, Xuliang Huang, Tianyi Zhang) developed TRACER, a semantic-aware framework that detects contamination at three granular levels: Functionally Identical, Nearly Identical, and Shared Logic. The pipeline uses a coarse-to-fine approach, first filtering candidates, then classifying overlap with high precision. This allows evaluators to distinguish between harmless code reuse and genuine benchmark leakage.

TRACER was tested across multiple LLM backbones, with GPT-5 achieving an F1 score of 0.91 for fine-grained detection. In the binary setting (contaminated vs. not), it scored 0.92 F1, surpassing existing methods by 42% to 217%. The researchers also created the first benchmark for fine-grained code contamination, spanning three widely used benchmarks and three representative post-training datasets. Ablation studies confirmed each component's contribution. This tool gives AI practitioners a robust way to audit code LLMs, ensuring evaluations reflect true capability rather than data leakage.

Key Points
  • TRACER models code contamination at three semantic levels: Functionally Identical, Nearly Identical, and Shared Logic.
  • Achieves F1 of 0.92 in binary detection, outperforming existing methods by 42%–217%.
  • Introduces the first dedicated benchmark for fine-grained code contamination detection using GPT-5 and other LLM backbones.

Why It Matters

Ensures reliable LLM evaluation by catching subtle data leakage in code models.