Arc AGI - 3 Released
New benchmark measures 'fluid intelligence,' with current models failing spectacularly.
A new, significantly harder benchmark for evaluating artificial general intelligence has been released. Arc AGI-3 is the latest iteration of a test suite created to measure 'fluid intelligence'—the ability to reason, solve novel problems, and adapt to new situations, which is distinct from simply recalling memorized facts. Its predecessors, versions 1 and 2, were considered excellent tests but were quickly 'saturated,' meaning top AI models like OpenAI's GPT-4 and Anthropic's Claude 3 achieved high scores, reducing their usefulness for measuring cutting-edge progress.
The initial results from Arc AGI-3 are stark. The best-performing AI model currently scores only 0.3%, a dramatic drop from performance on earlier versions. This near-zero score highlights the benchmark's extreme difficulty and its design to push beyond current AI capabilities. It presents a suite of complex, multi-step reasoning problems that require genuine understanding and logical deduction, areas where even the most advanced LLMs still struggle. The benchmark's creator and the AI community view this as an exciting development, providing a clear, challenging target for the next generation of AI systems.
For researchers and companies like OpenAI, Google DeepMind, and Anthropic, Arc AGI-3 establishes a new high bar. It moves the goalposts from testing knowledge retrieval to evaluating sophisticated reasoning and cognitive flexibility. This shift is critical for guiding development toward more robust and generally intelligent systems, rather than models that are merely proficient at pattern matching on training data. The benchmark will likely become a standard metric cited in future model releases, similar to how MMLU or GPQA are used today.
- Arc AGI-3 is a new benchmark designed to test AI 'fluid intelligence' and reasoning, not factual recall.
- The current best AI model scores a near-failing 0.3%, showing the test's extreme difficulty.
- It succeeds versions 1 and 2, which were saturated by top models, making AGI-3 a new target for AI development.
Why It Matters
It sets a new, brutally difficult standard for measuring true AI reasoning, guiding research beyond simple knowledge recall.