AI Safety

InsanityBench: Cryptic Puzzles as a Probe for Lateral Thinking

New benchmark reveals AI's creative gap: models struggle with cryptic puzzles requiring 'insane' conceptual leaps.

Deep Dive

A new benchmark called InsanityBench is challenging AI models with cryptic puzzles designed to test lateral thinking and creative problem-solving abilities. Created by researcher Robin Ha, the benchmark consists of 10 handcrafted tasks that require the kind of conceptual leaps often seen in scientific breakthroughs—what Ha calls 'productive insanity.' Current state-of-the-art models like OpenAI's GPT-4 and Anthropic's Claude 3 score only 10-15% on these puzzles, revealing a significant gap in AI's creative reasoning capabilities.

The benchmark's design intentionally resists gaming through several mechanisms. Each puzzle uses completely different formats—switching between images, poems, short stories, Python code, and cryptic text—with no repeating patterns. Only one example task is publicly available, keeping the full dataset private. This approach contrasts with traditional benchmarks like IMO and Codeforces problems, which Ha argues require 'low-dimensional creativity' that can be mastered through pattern recognition and extensive practice.

InsanityBench represents a shift toward evaluating AI's ability to make unexpected connections and engage with seemingly absurd ideas that later prove correct. The puzzles are graded on a 0-10 scale, with 10 representing a fully correct answer and 5 for significant progress. This development comes as researchers express concern about benchmark saturation in other areas, where companies hire mathematicians specifically to improve model performance on standardized tests, potentially decoupling benchmark scores from actual progress in AI reasoning capabilities.

Key Points
  • Current SOTA models score only 10-15% on lateral thinking puzzles requiring creative leaps
  • Benchmark uses 10 diverse, handcrafted tasks with constantly changing formats to prevent gaming
  • Designed to measure 'productive insanity'—the ability to engage with seemingly absurd ideas that later prove correct

Why It Matters

Reveals AI's creative reasoning gap and provides a harder-to-game benchmark for measuring true progress in lateral thinking.