ProgramBench: Can Language Models Rebuild Programs From Scratch?
Even the best LM passes 95% of tests on only 3% of tasks – zero full rebuilds.
A new paper from John Yang and 12 co-authors at Stanford, Meta, and other institutions introduces ProgramBench, a benchmark designed to test whether language models can rebuild entire software projects from scratch. Unlike existing benchmarks that focus on bug fixes or single-feature development, ProgramBench measures holistic software engineering: given only a program's documentation and its executable behavior, an agent must architect and implement a codebase that passes end-to-end behavioral tests. The tests are generated via agent-driven fuzzing, so models aren't constrained by a prescribed implementation structure.
The benchmark comprises 200 tasks ranging from compact CLI tools to major open-source projects like FFmpeg, SQLite, and the PHP interpreter. The team evaluated nine state-of-the-art LMs, including GPT-4o, Claude 3.5 Sonnet, and Llama 3 70B. The results are striking: not a single model fully resolved any task. The best performer passed 95% of the tests on only 3% of the tasks. Moreover, models consistently produced monolithic, single-file implementations that diverge sharply from the modular, multi-file structure of human-written code. This suggests current LMs lack the ability to make high-level architectural decisions required for real-world software development.
The findings have significant implications for the growing trend of deploying AI agents to seed and maintain codebases with minimal human oversight. While agents can handle isolated tasks like writing a function or fixing a bug, ProgramBench reveals a critical gap in holistic software design. The paper proposes this as a new challenge for the community, and the benchmark is publicly available to spur further research. For now, fully autonomous software engineering from scratch remains out of reach.
- ProgramBench tests 9 LMs on 200 tasks, from CLI tools to complex codebases like FFmpeg, SQLite, and the PHP interpreter.
- Best model passes 95% of tests on only 3% of tasks; no model fully resolves any task.
- All models produce monolithic single-file implementations instead of modular, multi-file architectures typical of human-written code.
Why It Matters
AI agents can't yet architect complete software projects, limiting autonomous codebase development in production environments.