CRUX is a new project for conducting open-world evaluations of frontier AI using long-horizon, real-world tasks?

CRUX is a new project for conducting open-world evaluations of frontier AI using long-horizon, real-world tasks.

An AI agent built and published a simple iOS app to the Apple App Store with only one avoidable manual intervention?

An AI agent built and published a simple iOS app to the Apple App Store with only one avoidable manual intervention.

Researchers argue benchmarks overstate/understate capability; open-world evals provide early warnings of soon-to-be-widespread abilities?

Researchers argue benchmarks overstate/understate capability; open-world evals provide early warnings of soon-to-be-widespread abilities.

Research & Papers

New 'open-world' AI eval puts agents to work building real iOS apps

arXiv cs.AI May 22, 2026

⚡An AI agent published an iOS app to the App Store with just one manual fix...

Deep Dive

A team of 18 researchers led by Sayash Kapoor and Arvind Narayanan (Princeton) argues that current AI benchmarks both overstate and understate frontier capabilities. They propose 'open-world evaluations': tasks that are long-horizon, messy, and real-world, assessed via small-sample qualitative analysis rather than automated grading. To formalize this, they introduce CRUX (Collaborative Research for Updating AI eXpectations), a project that regularly runs such evaluations. The first CRUX instance tasked an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent succeeded with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread.

This approach directly challenges the dominance of benchmarks like MMLU or HumanEval, which reward narrow optimization and may miss messy, real-world competencies. The researchers highlight that existing benchmarks privilege tasks that are precisely specifiable, automatically gradable, easy to optimize for, and run with low budgets. Open-world evaluations, by contrast, demand longer time horizons, require handling ambiguity, and produce qualitative evidence of genuine capability. The paper offers guidelines for designing and reporting such evaluations, aiming to give policymakers and the public a more accurate picture of what frontier AI can actually do—before those abilities go viral.

Key Points

CRUX is a new project for conducting open-world evaluations of frontier AI using long-horizon, real-world tasks.
An AI agent built and published a simple iOS app to the Apple App Store with only one avoidable manual intervention.
Researchers argue benchmarks overstate/understate capability; open-world evals provide early warnings of soon-to-be-widespread abilities.

Why It Matters

This evaluation method could reveal frontier AI's real-world readiness months before benchmarks are obsoleted.

Read Original Article

New 'open-world' AI eval puts agents to work building real iOS apps

Why It Matters

Related Articles

🚀 Stay Ahead in AI