Research & Papers

Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

New benchmark uses 62,231 pencil puzzles to test AI reasoning, showing GPT-5.2 improves 81x with effort scaling.

Deep Dive

Researcher Justin Waugh has introduced Pencil Puzzle Bench, a novel benchmark framework designed to rigorously evaluate the multi-step, verifiable reasoning capabilities of large language models. The benchmark leverages a massive database of 62,231 pencil puzzles across 94 distinct varieties—constraint-satisfaction problems closely related to NP-complete tasks—each with a verified unique solution. A curated set of 300 puzzles spanning 20 varieties was used to test 51 models from 11 major AI providers in two distinct modes: direct, single-shot questioning and complex, multi-turn agentic interactions where the AI can iteratively check its work. The key innovation is deterministic, step-level verification; every intermediate board state can be validated against specific puzzle rules, precisely localizing errors and providing dense, per-move reward signals crucial for training methods like process supervision and reinforcement learning.

The evaluation results reveal two critical axes of model capability. First, 'reasoning effort scaling' measures improvement as models are prompted to expend more cognitive effort, with OpenAI's GPT-5.2 demonstrating an extraordinary 81x performance increase from minimal to maximum reasoning effort. Second, 'agentic iteration' tracks gains from allowing models to self-correct through iterative verification. Anthropic's Claude Opus 4.6 showed a dramatic rise from 0.3% to 30.0% accuracy using this method, while GPT-5.2 in its highest-effort mode improved from 20.2% to 56.0%. These agentic attempts were computationally intensive, with a median of 29 turns over 17 minutes and the longest session exceeding 1,221 turns and 14.3 hours, making this a demanding test of both long-context utilization and sustained logical reasoning, not just final answer correctness.

Key Points
  • Benchmark uses 62,231 pencil puzzles across 94 varieties for step-by-step, verifiable reasoning evaluation.
  • GPT-5.2 shows an 81x performance improvement when scaling from no reasoning to maximum cognitive effort.
  • Agentic iteration boosts Claude Opus 4.6 from 0.3% to 30.0% accuracy and GPT-5.2 from 20.2% to 56.0%.

Why It Matters

Provides a rigorous, transparent test for AI reasoning crucial for developing reliable agents in finance, logistics, and scientific research.