Developer Tools

Code Fingerprints: Disentangled Attribution of LLM-Generated Code

New AI can trace code snippets back to their source LLM—DeepSeek, Claude, Qwen, or ChatGPT—with high reliability.

Deep Dive

A research team led by Jiaxun Guo has published a groundbreaking paper titled 'Code Fingerprints: Disentangled Attribution of LLM-Generated Code,' introducing a method to solve a critical problem in modern software engineering. As LLMs like DeepSeek, Claude, Qwen, and ChatGPT become ubiquitous for code generation, they create challenges for governance, accountability, and compliance. Current tools can only distinguish if code is machine-generated, but practical scenarios—such as tracing a security vulnerability to a specific model or conducting a licensing audit—require knowing exactly which LLM was the source. This research directly addresses that gap by proposing model-level code attribution.

The team's solution is the Disentangled Code Attribution Network (DCAN), a system that uses contrastive learning to separate a code snippet's semantic meaning from its model-specific stylistic 'fingerprint.' These fingerprints arise from differences in training data, model architecture, and decoding strategies. To train and evaluate DCAN, the researchers built the first large-scale benchmark dataset containing code generated by the four aforementioned LLMs across Python, Java, C, and Go. The results demonstrate that DCAN achieves reliable multi-class attribution performance, proving the feasibility of tracing code provenance. The dataset and implementation are publicly available, paving the way for integrating this technology into developer tools and compliance workflows.

Key Points
  • Introduces DCAN, a network that disentangles semantic meaning from model-specific stylistic fingerprints for accurate code attribution.
  • Built and tested on a novel benchmark with code from 4 LLMs (DeepSeek, Claude, Qwen, ChatGPT) across 4 programming languages.
  • Solves practical needs for software governance, including vulnerability triage, incident investigation, and licensing compliance audits.

Why It Matters

Enables accountability and traceability in software development, allowing teams to audit AI-generated code for security, licensing, and compliance.