ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models
New benchmark tests AI on spatial and verbal reasoning simultaneously, exposing a critical weakness.
A team of researchers including Tianlong Wang has published a new AI benchmark called ItinBench, designed to test large language models (LLMs) on complex, multi-dimensional planning. Unlike traditional benchmarks that isolate specific skills like math or coding, ItinBench integrates a spatial reasoning task—specifically route optimization—into a broader verbal reasoning challenge centered on planning a trip itinerary. This approach aims to mirror real-world scenarios where problem-solving requires juggling different types of cognitive tasks simultaneously.
The benchmark was used to evaluate leading models including GPT-4, Gemini 1.5 Pro, Mistral Large, and Llama 3.1 8B. The results were revealing: while LLMs can excel at individual reasoning tasks, their performance significantly drops and becomes inconsistent when they must concurrently manage multiple cognitive dimensions. This exposes a critical weakness in current AI agents, which are being positioned for autonomous planning and reasoning in complex environments. ItinBench provides a new, more holistic testbed for developers aiming to build AI that can handle the messy, integrated challenges of the real world.
- ItinBench integrates spatial reasoning (route optimization) with verbal reasoning in a single travel planning task.
- Tested models like GPT-4 and Gemini 1.5 Pro showed inconsistent performance when handling multiple cognitive domains at once.
- The benchmark provides a more comprehensive evaluation framework for developing AI agents capable of real-world planning.
Why It Matters
It reveals a fundamental limitation in current AI agents, crucial for developing reliable autonomous systems for complex real-world tasks.