Developer Tools

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

New benchmark tests 16 frontier models on building complete web apps from scratch, with best scoring only 58%.

Deep Dive

A research team including Hung Tran, Langston Nashold, and others has introduced Vibe Code Bench, a groundbreaking benchmark designed to evaluate AI models on the complete process of building a working web application from a specification. Unlike existing benchmarks that measure isolated coding tasks, Vibe Code Bench simulates the real-world 'zero-to-one' development process, using 100 detailed web application specifications (50 public, 50 held-out test) that are evaluated by an autonomous browser agent against a deployed application. This approach, detailed in a new arXiv preprint, reveals a significant gap in current AI capabilities, with the best-performing model among 16 tested achieving only 58.0% accuracy on the test split, underscoring that reliable end-to-end application generation is still a frontier problem.

The benchmark's comprehensive evaluation pipeline comprises 964 browser-based workflows broken down into 10,131 individual substeps, providing a granular view of model performance. Key findings include a strong correlation (Pearson r=0.72) between a model's ability to perform self-testing during code generation and its overall success rate. The study also conducted a human alignment analysis, showing that evaluator selection dramatically impacts outcomes, with pairwise step-level agreement ranging from 31.8% to 93.6%. The contributions are threefold: a novel dataset and evaluation system, a cost and latency analysis of 16 frontier models, and an evaluator alignment protocol. A live leaderboard has been established to track progress as models improve on this critical, practical task.

Key Points
  • Benchmark tests 16 frontier AI models on 100 full web app specs with 964 browser workflows.
  • Best model scores only 58.0% accuracy, proving end-to-end app development is still a major challenge.
  • Study finds self-testing during generation is a strong performance predictor with a 0.72 correlation.

Why It Matters

Sets a new standard for measuring practical AI coding ability, moving beyond snippets to full application delivery.