ARC-AGI-3
New benchmark tests AI agents on dynamic environments and long-term planning, not just static puzzles.
The ARC Prize team has launched ARC-AGI-3, a groundbreaking interactive reasoning benchmark designed to measure human-like intelligence in AI agents. Unlike traditional static tests, ARC-AGI-3 challenges agents to explore novel environments, acquire goals on the fly, build adaptable world models, and learn continuously from experience. A 100% score signifies an AI agent can solve every game as efficiently as a human, focusing on the process of learning rather than just the final answer. The benchmark is built on principles of being easy for humans, requiring no pre-loaded knowledge, and featuring novelty to prevent brute-force memorization.
ARC-AGI-3 measures intelligence across time by evaluating skill-acquisition efficiency, long-horizon planning with sparse feedback, and experience-driven adaptation across multiple steps. It aims to make the gap between AI and human learning measurable by capturing planning horizons, memory compression, and the ability to update beliefs with new evidence. The platform includes a full developer toolkit for agent integration, an interactive UI for testing, and features like replayable runs that allow developers to inspect agent behavior through a structured timeline of decisions and actions. This represents a significant shift from evaluating final outputs to assessing the entire reasoning and learning process, setting a new standard for what constitutes general intelligence in machines.
- First interactive benchmark testing AI's ability to learn dynamically in novel environments, not solve static puzzles.
- Measures skill-acquisition efficiency and long-horizon planning, requiring agents to adapt strategies without language instructions.
- Includes a full developer toolkit, replay system for behavior analysis, and UI for transparent agent evaluation and iteration.
Why It Matters
This benchmark sets a new, more rigorous standard for measuring progress toward true AGI by evaluating the learning process itself.