New AI tool tracks code provenance: SourceTracker detects plagiarism in LLM outputs
LLMs may copy code without credit – SourceTracker's hybrid system catches it at scale.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Large language models (LLMs) for code generation can reproduce training examples verbatim without attribution, raising serious legal and ethical concerns around plagiarism and license compliance. Traditional fingerprint-based detectors like Winnowing are effective but require linear-time searches over entire training sets, making them impractical for the billion-scale corpora used by modern code LLMs.
To address this, researchers from the University of Bologna and others present SOURCETRACKER, a 300M-parameter encoder optimized for code retrieval, paired with a hybrid two-stage pipeline called HYBRIDSOURCETRACKER (HST). HST first uses vector search to narrow candidate snippets, then re-ranks them via exact Winnowing fingerprints. Evaluated on a 10M-snippet subset of TheStackV2, the system achieves mean reciprocal rank on par with Winnowing for 30-token fragments and surpasses it by 5.4% for 60+ token fragments, all while maintaining logarithmic-time complexity. This enables scalable, high-precision provenance tracking essential for verifying the originality of LLM-generated code.
- SourceTracker is a 300M-parameter encoder tailored for efficient code retrieval from large corpora.
- HybridSourceTracker (HST) achieves logarithmic-time search complexity, a major speedup over linear-time fingerprinting.
- On code fragments of 60+ tokens, HST outperforms traditional Winnowing by up to 5.4% in retrieval accuracy.
Why It Matters
Ensures attribution and license compliance for AI-generated code, crucial for open-source integrity and legal safety.