Human vs. AI performance on ARC-AGI 3 as a function of number of actions (from the ARC-AGI website)
Given unlimited 'actions' to think, Claude 3.5 Sonnet solves 84% of ARC-AGI puzzles, beating the human average of 80%.
A viral analysis of the ARC-AGI benchmark reveals a critical insight into AI reasoning capabilities. The benchmark, created by François Chollet to test abstraction and reasoning, presents novel visual puzzles. The study plotted performance against the number of 'actions'—essentially reasoning steps or computational budget—an AI model is allowed to take. The results show a clear trend: as the action limit increases, AI performance climbs dramatically. For instance, Claude 3.5 Sonnet's accuracy jumps from under 40% with a 10-action limit to a remarkable 84% when allowed unlimited actions.
This 84% score surpasses the estimated human average of 80% on the same tasks. The finding challenges the notion that current models lack 'true' reasoning. It suggests their performance is heavily constrained by the computational budget or 'thinking time' allocated during inference, not just by a fundamental lack of understanding. The ARC-AGI test is specifically designed to be immune to memorization, meaning the AI must solve genuinely new problems. This data implies that with sufficient iterative reasoning—a capability enhanced by techniques like chain-of-thought and agentic workflows—today's best models can exhibit human-level or superior abstract reasoning on novel challenges.
- Claude 3.5 Sonnet scores 84% on ARC-AGI with unlimited actions, beating the 80% human average.
- AI performance on the benchmark is highly dependent on the number of allowed reasoning steps ('actions').
- The ARC-AGI test measures abstract reasoning on novel puzzles, making memorization impossible.
Why It Matters
It shows that with enough computational 'thinking time,' current AI models can achieve superhuman abstract reasoning, guiding future AI agent design.