Developer Tools

New AI tool tracks code provenance: SourceTracker detects plagiarism in LLM outputs

LLMs may copy code without credit – SourceTracker's hybrid system catches it at scale.

Deep Dive

Large language models (LLMs) for code generation can reproduce training examples verbatim without attribution, raising serious legal and ethical concerns around plagiarism and license compliance. Traditional fingerprint-based detectors like Winnowing are effective but require linear-time searches over entire training sets, making them impractical for the billion-scale corpora used by modern code LLMs.

To address this, researchers from the University of Bologna and others present SOURCETRACKER, a 300M-parameter encoder optimized for code retrieval, paired with a hybrid two-stage pipeline called HYBRIDSOURCETRACKER (HST). HST first uses vector search to narrow candidate snippets, then re-ranks them via exact Winnowing fingerprints. Evaluated on a 10M-snippet subset of TheStackV2, the system achieves mean reciprocal rank on par with Winnowing for 30-token fragments and surpasses it by 5.4% for 60+ token fragments, all while maintaining logarithmic-time complexity. This enables scalable, high-precision provenance tracking essential for verifying the originality of LLM-generated code.

Key Points
  • SourceTracker is a 300M-parameter encoder tailored for efficient code retrieval from large corpora.
  • HybridSourceTracker (HST) achieves logarithmic-time search complexity, a major speedup over linear-time fingerprinting.
  • On code fragments of 60+ tokens, HST outperforms traditional Winnowing by up to 5.4% in retrieval accuracy.

Why It Matters

Ensures attribution and license compliance for AI-generated code, crucial for open-source integrity and legal safety.