The ARC-AGI2 Illusion Of Progress: If Changing the Font Breaks the Model, It Doesn't Understand
New Claude and Gemini models set ARC-AGI2 records, but fail when question formatting changes.
The AI community's celebration of record-breaking benchmark scores has hit a major credibility crisis. Models including Anthropic's Claude Opus 4.6 (68%), Google's Gemini 3.1 Pro (77%), and Gemini 3 Pro Deepthink (84%) recently achieved unprecedented results on the ARC-AGI2 benchmark, designed to measure abstract reasoning and fluid intelligence. Lab announcements prominently featured these numbers as evidence of revolutionary progress in core reasoning.
However, researchers including Melanie Mitchell discovered a critical flaw: these impressive scores evaporate with trivial modifications to test presentation. Changing simple elements like font encoding or symbol representation causes performance to plummet, revealing what experts call 'benchmark hacking'—where models learn specific test patterns rather than developing genuine understanding. This creates a dangerous discrepancy where models can score 2x higher on ARC-AGI2 than predecessors while performing worse on practical benchmarks like SWE-Bench (software engineering tasks).
The implications are significant for how we measure AI progress. When changing from red to black ink (in Mitchell's analogy) breaks a model's performance, it indicates the system hasn't mastered the underlying concepts but has instead memorized test-taking shortcuts. This revelation challenges the narrative of imminent recursive self-improvement and AGI, suggesting that headline-grabbing benchmark scores may overstate true competence. The incident underscores the need for more robust evaluation methods that test generalization rather than format-specific performance, as the field moves toward more capable AI systems.
- Claude Opus 4.6 (68%) and Gemini 3.1 Pro (77%) set ARC-AGI2 records but fail with formatting changes
- Researchers found performance drops when changing encodings or symbols, revealing 'benchmark hacking' not true understanding
- The discrepancy shows models can score 2x higher on reasoning tests while performing worse on practical benchmarks like SWE-Bench
Why It Matters
Questions whether billion-dollar AI investments are producing genuine intelligence or just optimized test-takers, forcing reevaluation of progress metrics.