Developer Tools

SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?

A new benchmark reveals AI's massive struggle with real-world software engineering.

Deep Dive

A new benchmark, SWE-Bench Mobile, tests AI coding agents on realistic iOS app development tasks using real PRDs and Figma designs. Evaluating 22 configurations across four agents, the best achieved only a 12% task success rate. The study found agent design matters as much as the model, with commercial agents outperforming open-source ones and simple prompts beating complex ones by 7.4%, highlighting a major gap for industrial use.

Why It Matters

This shows current AI agents are far from replacing human mobile developers, setting realistic expectations for the industry.