Starburst: Unsaturated Since Summer 2024
A puzzle game created in 2024 still stumps GPT-4, Claude 3, and other top models despite handholding prompts.
Starburst, a puzzle game created by Chapin Lenthall-Cleary in summer 2024, has unexpectedly become one of AI's most persistent reasoning benchmarks. Originally designed as a human intelligence test inspired by science fiction novels, the game presents players with celestial observations from a fictional universe and challenges them to deduce its underlying physics laws. Players progress through 20 technological eras, gaining better observational tools, with humans typically solving the puzzle by era 6 using reasoning and pattern recognition.
When tested against state-of-the-art LLMs in late 2024, including GPT-4, GPT-4o, and Claude models, the results were striking. Even with carefully designed prompting that removed strategic decision-making and presented information optimally, GPT-4 and GPT-4o could only solve Starburst at era 17, while OpenAI's o1 model occasionally reached era 16. The researchers created a benchmarking scheme that arguably gave LLMs unfair advantages by eliminating agency and irrelevant data, yet models still performed poorly compared to humans. This suggests current LLMs rely heavily on pattern recognition from training data rather than genuine reasoning, as the fictional physics in Starburst wouldn't appear in their training corpora.
The benchmark's longevity is particularly notable—nearly two years after its creation, no AI system has approached human performance. The game's accidental design proved fortuitous: its text-based format, lack of specialized knowledge requirements, and progressive difficulty structure created an ideal testbed for evaluating reasoning capabilities. As AI companies continue developing more advanced models like GPT-5 and Claude 4, Starburst serves as a sobering reminder that scaling parameters and training data alone may not solve fundamental reasoning challenges.
- Starburst was created in 2024 as a human intelligence test but became an accidental AI benchmark that remains unsolved
- Even with optimized prompting, GPT-4 and GPT-4o only solve it at era 17/20, while humans typically solve by era 6
- The benchmark reveals LLMs' reliance on pattern recognition over genuine reasoning, persisting nearly 2 years after creation
Why It Matters
Reveals fundamental reasoning gaps in current AI that won't be solved by simply scaling models, guiding future research directions.