FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals
CPU-only training and near-instant inference rival large models for code origin detection.
A team from Sofia University, led by Elitsa Yotkova and Violeta Kastreva, has developed a computationally efficient method for detecting LLM-generated code as part of SemEval-2026 Task 13. The task challenges participants to identify machine-written code across multiple programming languages and unseen domains. Their approach, focused on Subtask A (binary classification), bypasses the need for large GPU clusters by relying on stylometric signals—ratio-based features that are robust to variable snippet lengths. The system uses parsing engines and a programming-language classifier to extract descriptiveness-related signals, plus a separate code-vs-text line classifier to filter natural language segments embedded in samples. A shallow decision tree combined with heuristic rules from data analysis produces final predictions, all on CPU-only hardware.
The key advantage is near-instant inference, making it suitable for real-time or resource-constrained environments. While many teams lean on massive pretrained models (like CodeBERT or GPT-based encoders), this work demonstrates that carefully engineered feature sets can achieve competitive performance without heavy computational costs. The methodology is especially relevant for organizations that need to audit code origins rapidly—such as detecting AI-generated submissions in plagiarism checks, verifying code provenance in CI/CD pipelines, or filtering synthetic code in open-source repositories. By focusing on interpretable stylometric patterns rather than black-box embeddings, the system also offers transparency into why a snippet is flagged as machine-generated, which is valuable for forensic analysis. The paper is available on arXiv (2605.04157) and represents a practical step toward democratizing AI-generated code detection.
- System uses ratio-based features that are length-insensitive, enabling robust detection across varying code snippet sizes.
- Training runs entirely on CPU, with near-instant inference, making it suitable for low-resource deployments.
- Combines a decision tree with heuristic rules derived from data analysis, avoiding large pretrained models.
Why It Matters
Enables cost-effective, real-time detection of AI-written code without expensive hardware—ideal for security and quality assurance teams.