SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents
New benchmark reveals which AI models can build 10,000-line software from scratch.
Researchers introduced SWE-AGI, a new benchmark testing if AI agents can autonomously build production-scale software from specifications. Tasks require implementing 1,000-10,000 lines of core logic from standards like RFCs. GPT-5.3-Codex performed best, solving 19/22 tasks (86.4%), beating Claude Opus 4.6 (68.2%). The benchmark uses MoonBit to minimize data leakage, forcing architectural reasoning. Results show performance degrades sharply on harder tasks, and code reading becomes the bottleneck as projects scale.
Why It Matters
This benchmark shows AI is getting closer to autonomously building complex software, potentially automating weeks of developer work.