ARC-AGI-3 Update (GPT-5.5 High and Opus4.7)
Latest AI models manage only 0.43% and 0.18% on a rigorous reasoning test.
New results from the ARC-AGI-3 benchmark reveal that even the most advanced AI models still struggle with abstract reasoning. OpenAI's GPT-5.5 achieved a score of 0.43%, while Anthropic's Opus 4.7 managed only 0.18%. ARC-AGI-3 is a variant of the Abstraction and Reasoning Corpus designed to measure a model's ability to generalize from minimal examples — a core component of human-like intelligence. Unlike standard benchmarks that rely on pattern matching or memorization, ARC-AGI-3 presents novel visual puzzles that require flexible problem-solving. These scores, while low, are not unexpected given that previous models have also failed to crack this benchmark.
These results highlight the significant gap between current deep learning systems and true AGI. Both GPT-5.5 and Opus 4.7 represent the cutting edge of large language and multimodal models, yet they barely progress beyond random chance on ARC-AGI-3. The benchmark's designer, François Chollet, has argued that solving ARC tasks requires the formation of new concepts, which remains a critical bottleneck for AI. The community will watch closely to see if future model iterations or emerging architectures like deep reasoning transformers can make meaningful progress. For now, ARC-AGI-3 remains the ultimate litmus test for AGI readiness.
- OpenAI's GPT-5.5 scored 0.43% and Anthropic's Opus 4.7 scored 0.18% on ARC-AGI-3.
- ARC-AGI-3 is an abstract reasoning benchmark that requires solving novel visual puzzles, unlike standard NLU tasks.
- These scores highlight that current AI models still fail to generalize like humans, keeping AGI a distant goal.
Why It Matters
Demonstrates that even frontier AI models lack abstract reasoning, challenging AGI timelines and investment assumptions.