[R] ARC Round 3 - released + technical report
New technical report suggests top models likely trained on ARC-like data, yet still fail spectacularly.
The ARC Prize, a benchmark designed to measure true abstraction and reasoning capabilities akin to human fluid intelligence, has published its Round 3 results and a revealing technical report. The findings are stark: every frontier AI model tested—including offerings from OpenAI, Anthropic, and Google—failed to surpass a 1% score on the novel visual puzzle tasks. This indicates that despite massive scale and training data, current systems lack core, efficient reasoning abilities. The report further suggests, through analysis of model reasoning traces, that the few models performing relatively well likely had exposure to ARC-like data during training, rather than developing the skill from first principles.
The continued failure to claim the substantial prize money from Rounds 1 and 2 underscores the benchmark's difficulty and the field's current limitations. The ARC-AGI tasks require solving unique visual pattern problems that cannot be memorized, demanding genuine abstraction. The sub-1% scores across the board reveal a vast "room for improvement" in developing AI that can reason efficiently and generalize on unseen problems. This result challenges the narrative of rapid, continuous improvement in AI reasoning and highlights a specific frontier where even the most advanced models fall short, pointing researchers toward the hard problem of building systems that learn to learn.
- All tested frontier models (GPT-4, Claude, Gemini) scored below 1% on the ARC-AGI Round 3 benchmark.
- The technical report suggests well-performing models may have ARC-like data in their training sets, not innate reasoning.
- Prize money for efficiency-focused Rounds 1 and 2 remains unclaimed, highlighting a critical gap in AI capabilities.
Why It Matters
This benchmark exposes a fundamental weakness in current AI: a lack of efficient, general reasoning ability that limits real-world problem-solving.