Can LLMs be Effective Code Contributors? A Study on Open-source Projects
GPT-4o, Ministral3, and Qwen3-Coder scored 0% to 60% success on 212 commits.
A new study from researchers tested three major LLMs—GPT-4o, Ministral3, and Qwen3-Coder—on 212 real-world commits across eight open-source projects like FFmpeg and wolfSSL. The success rate varied wildly, from 0% to 60% depending on the project. LLMs failed in multiple ways: generating syntactically incorrect code, failing static verification, and not passing project test suites. They particularly struggled with creating new code (not just fixing existing), handling code contexts outside a certain size range, and often succeeded only by parroting training data patterns.
The findings suggest current LLMs are far from being reliable code contributors for production-grade open-source projects. The researchers developed a framework using verification and validation to evaluate LLM suitability for fixes or features. This exposes a gap between LLM-generated code's popularity and its actual effectiveness in complex, real-world codebases. For developers, this means relying solely on LLMs for contributions could introduce significant risk, especially in large projects with strict quality standards.
- GPT-4o, Ministral3, and Qwen3-Coder tested on 212 commits across 8 open-source projects like FFmpeg and wolfSSL.
- Success rates ranged from 0% to 60% depending on the project, with failures including syntax errors and test suite failures.
- LLMs struggled with generating new code and handling large contexts, often succeeding only by parroting training data.
Why It Matters
Highlights critical limitations of LLMs for production code, urging caution in developer workflows.