From 0% to 36% on Day 1 of ARC-AGI-3
An open-source AI agent achieved a 36% score on the notoriously difficult ARC-AGI-3 benchmark on its first public attempt.
A submission from Symbolica AI to the ARC-AGI-3 benchmark on GitHub has sparked significant discussion in the AI community. The agent, whose exact architecture is detailed in the repository, reportedly scored 36% on the challenging test on its first public evaluation day. The ARC-AGI benchmark, created by researcher François Chollet, is designed to measure an AI's ability for abstract reasoning and core knowledge, focusing on tasks that require understanding and applying novel patterns rather than recalling memorized information. This makes it a notoriously difficult test where even state-of-the-art models like GPT-4 and Claude 3 have historically scored poorly, often below 40%.
The immediate 36% score is notable because it represents a substantial leap from a baseline of zero, achieved without the typical long-term, iterative fine-tuning process seen with large language models. The performance suggests Symbolica's agent may be using a fundamentally different approach, potentially based on program synthesis or symbolic reasoning, to tackle the core generalization problems ARC-AGI presents. This rapid success on a benchmark designed to be 'immune to scaling' challenges the prevailing narrative that brute-force model scaling is the only path to advanced reasoning. The open-source nature of the submission allows for immediate scrutiny and could accelerate research into alternative AI architectures focused on robust, human-like abstraction.
- Symbolica AI's agent scored 36% on the ARC-AGI-3 abstract reasoning benchmark on its first evaluation day.
- The ARC-AGI benchmark is designed to test core knowledge and novel pattern application, where top LLMs like GPT-4 typically score below 40%.
- The rapid success suggests a potential breakthrough in non-neural or hybrid approaches to artificial general intelligence (AGI)-style reasoning.
Why It Matters
It challenges the dominance of pure LLM scaling by demonstrating rapid progress on fundamental reasoning tasks, potentially pointing to new AGI research paths.