Media & Culture

LLMs do fine on ARC-AGI-3 if they are allowed to search over game logs

Structured search over logs cuts actions from 80k to 900—near human level.

Deep Dive

A new analysis from Reddit user ClarityInMadness challenges the belief that LLMs are fundamentally poor at the ARC-AGI-3 benchmark. While frontier models like Opus 4.6 and GPT-5.2 fail out of the box—unable to progress beyond Level 3 even after 1,000 actions—the key differentiator is tool use. By allowing LLMs to search over game logs (recorded actions, board states, and scores) and use Python, the agents achieve near-human efficiency, requiring only around 900 actions to finish preview games. In contrast, non-LLM exploration-based agents dominating the ARC 2025 leaderboard needed 80k–100k+ actions.

The research found that additional hand-engineering (e.g., pre-built functions or memory abstraction) provided diminishing or even negative returns. Instead, structured search over raw logs—even exceeding 100,000 lines—remained tractable and effective. A standout example: on the last level of ft09, the agent recognized the Lights Out mechanic, constructed a linear system, and solved it via Gaussian elimination in just 11 clicks—the analytic near-optimal solution. This suggests that with appropriate search and tool access, LLMs can match human planning efficiency on abstract reasoning tasks, without complex scaffolding.

Key Points
  • LLMs with game log search match human efficiency (~900 actions) on ARC-AGI-3, outperforming non-LLM agents needing 80k–100k+ actions.
  • Frontier models (Opus 4.6, GPT-5.2) fail without tool support—stuck below Level 3 even after 1,000 actions.
  • Minimal tooling (Python, raw log search) beats elaborate hand-engineering; one agent solved Lights Out in 11 clicks via Gaussian elimination.

Why It Matters

Simple search over game logs can make LLMs as efficient as humans on reasoning tasks, reducing the need for complex agents.