Research & Papers

[R] Extreme Sudoku as a constraint-satisfaction benchmark, solved natively without tools or CoT or solution backtracking

New architecture solves 250,000 hard puzzles where GPT-4o, Claude 3.7, and DeepSeek R1 score 0%.

Deep Dive

Pathway, an AI infrastructure company, has introduced a new benchmark and architecture that highlights a critical weakness in today's leading large language models. Their 'Sudoku Extreme' benchmark consists of approximately 250,000 notoriously difficult Sudoku puzzles, framed as a pure constraint-satisfaction problem where solutions are easy to verify but hard to generate. In a striking result, top-tier LLMs including OpenAI's o3-mini, Anthropic's Claude 3.7 Sonnet, and DeepSeek's R1 all achieved 0% accuracy on this test. In contrast, Pathway's own BDH (Bidirectional Hierarchical) architecture achieved 97.4% accuracy, solving the puzzles natively without relying on chain-of-thought reasoning, Python code execution, or explicit solution backtracking.

The core finding challenges a fundamental assumption in AI development: that transformer-based models, through techniques like extended chain-of-thought, can be scaled to solve complex, search-heavy reasoning problems. The researchers argue that transformers, which process information token-by-token with limited internal state, are poorly suited for tasks requiring the maintenance of multiple candidate solutions and the revision of earlier assumptions. This benchmark suggests that current progress in LLM reasoning may be more about creating longer verbalizations of search processes rather than developing architectures capable of performing that search internally.

This result raises significant questions for the future of AI reasoning. It implies that pushing transformer architectures with more data and longer contexts may have inherent limitations for certain classes of logical and constraint-based problems. The success of Pathway's BDH architecture, which employs a different internal reasoning substrate, points to a potential need for hybrid or novel architectures that incorporate stronger internal memory and continuous reasoning spaces to move beyond the constraints of pure language modeling.

Key Points
  • Pathway's BDH architecture solved 97.4% of 250,000 'Sudoku Extreme' puzzles, a pure constraint-satisfaction benchmark.
  • Leading LLMs (OpenAI o3-mini, Claude 3.7, DeepSeek R1) scored 0% on the same benchmark, failing without external tools.
  • The result challenges if transformer-based reasoning can solve search-heavy problems natively, or just verbalize the search process.

Why It Matters

Exposes a fundamental limit in transformer reasoning, pushing the field toward new architectures for logic and search tasks.