SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies
No platform scored above 60% engineering quality; concurrency handling as low as 6%
A new academic benchmark, SWE-WebDevBench, rigorously evaluates 'vibe coding' platforms—AI agents that generate full-stack software from natural language prompts. Created by Siddhant Saxena, Nilesh Trivedi, and Vinayaka Jyothi, the framework uses 68 metrics across 25 primary and 43 diagnostic measures, organized by interaction mode (App Creation vs. Modification), agency angle (Product Manager, Engineering, Ops), and complexity tier (up to Tier 5 AI-native SaaS).
Testing six major platforms across three domains and 18 evaluation cells uncovered four recurring weaknesses. First, a specification bottleneck: platforms compress rich business requirements into oversimplified technical plans. Second, pervasive frontend-backend decoupling: polished UIs mask broken or missing backend infrastructure. Third, a steep production-readiness cliff: no platform scored above 60% on engineering quality, with post-generation human effort varying widely.
Fourth, widespread security and infrastructure failures: no platform exceeded a 65% Security Score against a 90% target, and concurrency handling was as low as 6%. The authors release SWE-WebDevBench as a public benchmark to enable replication and help developers address these gaps, noting the findings are descriptive of their sample and need larger-scale validation.
- SWE-WebDevBench uses 68 metrics across 3 dimensions to evaluate AI coding agents as virtual software agencies.
- No platform scored above 60% on engineering quality; security scores maxed at 65% against a 90% target.
- Concurrency handling was as low as 6%, and frontend-backend decoupling was pervasive across all tested platforms.
Why It Matters
For dev teams relying on AI code generators, this benchmark exposes critical gaps that could break production apps.