Research & Papers

GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution

A new multimodal framework fuses code style and binary artifacts to identify which LLM wrote a piece of code.

Deep Dive

A team of researchers led by Nitin Choudhury has introduced GoCoMA, a novel multimodal framework designed to solve the growing forensic challenge of attributing code to its generative source, whether it's a specific Large Language Model (LLM) like GPT-4 or Claude 3.5, or a human. The system uniquely combines two distinct data streams: high-level code stylometry (capturing structural and stylistic patterns) and low-level image representations of binary pre-executable artifacts (BPEA), which encode execution-oriented byte semantics shaped by compilers.

GoCoMA's technical innovation lies in its use of hyperbolic geometry. It projects embeddings from these two modalities into a hyperbolic Poincaré ball, fuses them using a novel geodesic-cosine similarity-based cross-modal attention (GCSA) mechanism, and then back-projects the fused representation to Euclidean space for final classification. This method is particularly suited for modeling the hierarchical relationships between the abstract style and concrete binary data. In experiments on the CoDET-M4 and LLMAuthorBench benchmarks, GoCoMA consistently outperformed existing unimodal and Euclidean multimodal baselines, demonstrating superior accuracy in identifying the LLM author of a given code snippet.

Key Points
  • Fuses code stylometry and binary artifact images using hyperbolic geometry (Poincaré ball).
  • Outperforms previous methods on CoDET-M4 and LLMAuthorBench benchmarks for LLM code attribution.
  • Addresses critical forensic needs for security audits and licensing clarity in AI-generated code.

Why It Matters

Provides a forensic tool to audit code for security vulnerabilities, licensing issues, and intellectual property disputes stemming from AI generation.