A Large-Scale Comprehensive Measurement of AI-Generated Code in Real-World Repositories A Large-Scale Comprehensive Measurement of AI-Generated Code in Real-World Repositories
Researchers analyzed 1.2M commits to find AI code has higher complexity and defect rates.
A team of researchers from Indiana University and the University of Notre Dame has published the first large-scale empirical study measuring AI-generated code in real-world software repositories. The study, titled "A Large-Scale Comprehensive Measurement of AI-Generated Code in Real-World Repositories," analyzed over 1.2 million commits across thousands of projects using a novel detection method combining heuristic filters with LLM classification. This approach allowed them to distinguish AI-generated code from human-written code at scale, moving beyond previous small-scale controlled evaluations.
The findings reveal significant differences in code quality and development patterns. AI-generated code shows 2.5 times higher likelihood of containing defects and 40% higher complexity scores compared to human-written code. At the commit level, developers using AI assistance tend to make smaller, more frequent commits but with less stable code that requires more post-commit fixes. The study also found that AI-generated code often exhibits different structural patterns and documentation practices.
These results provide crucial insights for software engineering teams adopting AI tools like GitHub Copilot, CodeWhisperer, and ChatGPT. The research suggests that while AI accelerates code generation, it may introduce quality trade-offs that require new review processes and testing strategies. The dataset and methodology developed for this study will enable further research into optimizing AI-assisted development workflows and improving the reliability of generated code.
- AI-generated code has 2.5x higher defect likelihood than human-written code
- Study analyzed 1.2M commits using LLM classification + heuristic filters
- AI-assisted developers make smaller, more frequent commits with less stable code
Why It Matters
Provides empirical evidence for software teams to adjust code review and testing processes when using AI assistants.