Developer Tools

RepoZero benchmark tests LLMs on building entire code repos from scratch

Top LLMs hit only 30-55% pass rates when generating full repositories from API specs...

Deep Dive

A team of researchers from Peking University has released RepoZero, a novel benchmark that rigorously tests whether Large Language Models can generate entire software repositories from scratch. Unlike existing code benchmarks that focus on patch-based editing or rely on subjective human/LLM judgments, RepoZero uses a fully automated, execution-based verification method. The core innovation is reformulating generation as reproduction: given only the API specifications of an existing open-source repository, an LLM agent must re-implement the entire codebase so that its output matches the original's behavior exactly. To prevent data leakage and shortcut solutions, the benchmark introduces cross-language constraints and a sandboxed evaluation protocol.

Experiments across multiple state-of-the-art LLMs and agent frameworks reveal that even the strongest models achieve only 30-55% pass rates, exposing a substantial gap between current capabilities and real-world requirements. The paper also proposes the Agentic Code-Test Evolution (ACE) framework, which iteratively generates test cases and refines code through error feedback, enabling effective test-time scaling for repository-level synthesis. RepoZero establishes itself as a challenging, scalable, and reliable testbed for end-to-end code generation, highlighting self-verification via test generation as a critical direction for advancing LLM-based coding agents.

Key Points
  • RepoZero uses API-spec-to-repo reproduction with execution-based verification, avoiding human bias
  • Cross-language constraints and sandboxed evaluation reduce data leakage and shortcut solutions
  • Top LLM agents only achieve 30-55% pass rates, showing a major gap from real-world software development

Why It Matters

RepoZero exposes LLMs' inability to build full repositories, pushing the field toward self-verifying coding agents.