AI Safety

Anthropic's Claude Opus 4.7 Finally Beats Pokémon Red a Year Late, But Gemini Got There First

After over 112k reasoning steps of struggle, Claude Opus 4.5 couldn't see items on the ground — but Opus 4.7 finally triumphed.

Deep Dive

Anthropic's Claude Opus 4.7 has finally beaten Pokémon Red, completing a challenge that went viral over a year ago. However, the victory is somewhat anticlimactic: Gemini 2.5 Pro already beat Pokémon Blue in May 2025 using a simpler harness, and Claude's win is attributed to accumulated step changes rather than a major leap in intelligence. The harness (the code that interacts with the game) used by Claude is now comparable to what Gemini has employed in recent weeks. Still, the milestone highlights how far language model–driven game agents have come.

The road to victory was painful. Opus 4.5 got stuck in Silph Co. for weeks (over 50k reasoning steps) because it consistently ignored a key item on the ground, thinking it was an NPC. It then spent another 112k steps in Cinnabar Mansion, unable to correlate switch presses with barrier states. Opus 4.6 improved note-taking and memory but still struggled with visual switches in Victory Road. Opus 4.7, while only marginally smarter, combined better reasoning with refined harness design to finally conquer the game. The achievement underscores that progress in AI gaming is less about eureka moments and more about steady, iterative refinement.

Key Points
  • Claude Opus 4.7 beat Pokémon Red after a year-long effort, but Gemini 2.5 Pro had already beaten Pokémon Blue in May 2025 with a simpler harness.
  • Opus 4.5 got stuck in Silph Co. for 50k reasoning steps because it ignored a ground item (the Card Key), and then spent 112k steps in Cinnabar Mansion failing to understand switch mechanics.
  • The victory is attributed to cumulative improvements in model reasoning and harness refinements, not a single breakthrough — 4.7 is smarter than 4.6 but not by a large leap.

Why It Matters

Incremental AI improvements can still achieve complex gaming benchmarks, but gaps in perception and memory remain significant hurdles.