Developer Tools

SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents

New benchmark reveals which AI models can build 10,000-line software from scratch.

Deep Dive

Researchers introduced SWE-AGI, a new benchmark testing if AI agents can autonomously build production-scale software from specifications. Tasks require implementing 1,000-10,000 lines of core logic from standards like RFCs. GPT-5.3-Codex performed best, solving 19/22 tasks (86.4%), beating Claude Opus 4.6 (68.2%). The benchmark uses MoonBit to minimize data leakage, forcing architectural reasoning. Results show performance degrades sharply on harder tasks, and code reading becomes the bottleneck as projects scale.

Why It Matters

This benchmark shows AI is getting closer to autonomously building complex software, potentially automating weeks of developer work.