CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge
New benchmark shows frontier LLMs struggle with creative connections despite factual knowledge, exposing a critical weakness.
A research team from EPFL and the University of Geneva has introduced CresOWLve, a new benchmark designed to evaluate creative problem-solving in AI systems using puzzles grounded in real-world knowledge. Unlike existing benchmarks that test isolated skills or use contrived scenarios, CresOWLve requires models to employ multiple creative thinking strategies—like lateral thinking and analogy-making—while retrieving and integrating facts from diverse domains. The benchmark aims to reflect how genuine creative insight occurs in practical settings, moving beyond artificial brainteasers.
When testing several frontier large language models (LLMs), including both standard and 'thinking' variants like GPT-4 and Claude 3, the researchers found CresOWLve remains highly challenging. Their analysis revealed a consistent and significant performance gap: models scored up to 17% lower on creative questions compared to straightforward factual ones. This indicates that while current LLMs are proficient at information retrieval, they fundamentally struggle with the cognitive leap required to form novel, non-obvious connections between pieces of knowledge—the core of creative problem-solving.
The study's findings suggest that simply scaling up model size or training data may not be sufficient to bridge this creativity gap. The benchmark provides a crucial tool for the AI community to measure progress toward systems that can genuinely reason and innovate, not just recall. As AI moves toward more autonomous applications, this ability to creatively synthesize information will be essential for tackling complex, real-world challenges.
- CresOWLve benchmark tests creative problem-solving using real-world knowledge puzzles, not artificial scenarios.
- Evaluation of frontier LLMs shows a performance drop of up to 17% on creative vs. factual questions.
- Models can retrieve relevant knowledge but fail at the creative integration needed for novel solutions.
Why It Matters
Exposes a fundamental weakness in current AI: the inability to move from information recall to genuine creative insight for real-world tasks.