Developer Tools

Can LLMs be Effective Code Contributors? A Study on Open-source Projects

GPT-4o, Ministral3, and Qwen3-Coder scored 0% to 60% success on 212 commits.

Deep Dive

A new study from researchers tested three major LLMs—GPT-4o, Ministral3, and Qwen3-Coder—on 212 real-world commits across eight open-source projects like FFmpeg and wolfSSL. The success rate varied wildly, from 0% to 60% depending on the project. LLMs failed in multiple ways: generating syntactically incorrect code, failing static verification, and not passing project test suites. They particularly struggled with creating new code (not just fixing existing), handling code contexts outside a certain size range, and often succeeded only by parroting training data patterns.

The findings suggest current LLMs are far from being reliable code contributors for production-grade open-source projects. The researchers developed a framework using verification and validation to evaluate LLM suitability for fixes or features. This exposes a gap between LLM-generated code's popularity and its actual effectiveness in complex, real-world codebases. For developers, this means relying solely on LLMs for contributions could introduce significant risk, especially in large projects with strict quality standards.

Key Points
  • GPT-4o, Ministral3, and Qwen3-Coder tested on 212 commits across 8 open-source projects like FFmpeg and wolfSSL.
  • Success rates ranged from 0% to 60% depending on the project, with failures including syntax errors and test suite failures.
  • LLMs struggled with generating new code and handling large contexts, often succeeding only by parroting training data.

Why It Matters

Highlights critical limitations of LLMs for production code, urging caution in developer workflows.