New 'open-world' AI eval puts agents to work building real iOS apps
An AI agent published an iOS app to the App Store with just one manual fix...
A team of 18 researchers led by Sayash Kapoor and Arvind Narayanan (Princeton) argues that current AI benchmarks both overstate and understate frontier capabilities. They propose 'open-world evaluations': tasks that are long-horizon, messy, and real-world, assessed via small-sample qualitative analysis rather than automated grading. To formalize this, they introduce CRUX (Collaborative Research for Updating AI eXpectations), a project that regularly runs such evaluations. The first CRUX instance tasked an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent succeeded with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread.
This approach directly challenges the dominance of benchmarks like MMLU or HumanEval, which reward narrow optimization and may miss messy, real-world competencies. The researchers highlight that existing benchmarks privilege tasks that are precisely specifiable, automatically gradable, easy to optimize for, and run with low budgets. Open-world evaluations, by contrast, demand longer time horizons, require handling ambiguity, and produce qualitative evidence of genuine capability. The paper offers guidelines for designing and reporting such evaluations, aiming to give policymakers and the public a more accurate picture of what frontier AI can actually do—before those abilities go viral.
- CRUX is a new project for conducting open-world evaluations of frontier AI using long-horizon, real-world tasks.
- An AI agent built and published a simple iOS app to the Apple App Store with only one avoidable manual intervention.
- Researchers argue benchmarks overstate/understate capability; open-world evals provide early warnings of soon-to-be-widespread abilities.
Why It Matters
This evaluation method could reveal frontier AI's real-world readiness months before benchmarks are obsoleted.