UCSC-NLP at SemEval-2026 Task 13: Multi-View Generalization and Diagnostic Analysis of Machine-Generated Code Detection
Multi-view training beats unseen languages but 221:1 class imbalance crushes attribution
Researchers at UCSC-NLP submitted their system to SemEval-2026 Task 13, focusing on multilingual machine-generated code detection across two subtasks. For Subtask A (binary detection of human vs. AI code), they fine-tuned the UniXcoder-base model with an innovative multi-view training framework. This approach combines domain-specific structural prefixes, delexicalization with symmetric KL consistency loss, token dropout, and mixed-content augmentation to create generator-invariant representations. The system scored 0.993 macro F1 on validation and 0.845 macro F1 on the test set, which includes code from unseen languages and domains—demonstrating strong generalization.
Subtask B required attributing code to one of 10 specific LLM families plus human code. Here, the team uncovered a severe class imbalance: human code made up 88.4% of the data, creating a 221:1 majority-to-minority ratio. Standard fine-tuning achieved 88.4% accuracy but collapsed macro F1 to just 0.086, meaning it effectively failed to identify any AI model. A class-weighted extension trained for only 3 epochs recovered macro F1 to 0.345—a 301% relative improvement. This stark result highlights that high accuracy alone is deceptive in imbalanced scenarios and that attribution requires specialized training strategies.
- Binary detection achieved 0.993 macro F1 on validation and 0.845 on test set across unseen languages/domains using multi-view training.
- Multi-class attribution faced 221:1 majority-to-minority ratio (88.4% human code), causing standard fine-tuning to collapse to 0.086 macro F1.
- Class-weighted training for 3 epochs recovered macro F1 to 0.345 (+301% relative), proving imbalance-aware methods are essential.
- Model fine-tuned UniXcoder-base with structural prefixes, delexicalization, symmetric KL consistency loss, token dropout, and mixed-content augmentation.
Why It Matters
As LLM-generated code proliferates, detecting and attributing AI code is critical for academic integrity and software security—but class imbalance must be addressed.