GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams
New 90K-question benchmark reveals even GPT-5-nano trails humans by nearly 20 points on geometric reasoning.
A research team led by Yushun Zhang has launched GeoChallenge, a massive new benchmark designed to rigorously test the geometric reasoning capabilities of large language models (LLMs). The dataset contains 90,000 automatically generated multiple-choice problems that require multi-step logical proofs, uniquely combining aligned textual descriptions with diagrams. Unlike previous benchmarks, GeoChallenge provides fine-grained complexity ratings and formal language annotations, enabling controlled evaluation of how models process symbolic reasoning grounded in visual information.
Experiments on advanced LLMs, including OpenAI's GPT-5-nano, revealed a substantial performance gap between AI and human expertise. The top-performing model achieved only a 75.89% exact match score, compared to 94.74% for humans. The analysis identified three common failure modes: models struggling with exact matches in multiple-choice formats, demonstrating weak reliance on the provided diagrams, and exhibiting overextended reasoning that fails to converge to a correct answer. This benchmark provides a new, scalable tool for diagnosing specific weaknesses in AI reasoning.
- Dataset contains 90,000 automatically generated geometry proof problems combining text and diagrams.
- Best AI model (GPT-5-nano) scored 75.89%, nearly 20 points behind human performance at 94.74%.
- Identified three key LLM failure patterns: exact match issues, weak visual reliance, and non-convergent reasoning.
Why It Matters
Provides a scalable, diagnostic tool to measure and improve AI's multi-modal reasoning, a critical step toward more reliable and trustworthy systems.