Research & Papers

GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

New 90K-question benchmark reveals even GPT-5-nano trails humans by nearly 20 points on geometric reasoning.

Deep Dive

A research team led by Yushun Zhang has launched GeoChallenge, a massive new benchmark designed to rigorously test the geometric reasoning capabilities of large language models (LLMs). The dataset contains 90,000 automatically generated multiple-choice problems that require multi-step logical proofs, uniquely combining aligned textual descriptions with diagrams. Unlike previous benchmarks, GeoChallenge provides fine-grained complexity ratings and formal language annotations, enabling controlled evaluation of how models process symbolic reasoning grounded in visual information.

Experiments on advanced LLMs, including OpenAI's GPT-5-nano, revealed a substantial performance gap between AI and human expertise. The top-performing model achieved only a 75.89% exact match score, compared to 94.74% for humans. The analysis identified three common failure modes: models struggling with exact matches in multiple-choice formats, demonstrating weak reliance on the provided diagrams, and exhibiting overextended reasoning that fails to converge to a correct answer. This benchmark provides a new, scalable tool for diagnosing specific weaknesses in AI reasoning.

Key Points
  • Dataset contains 90,000 automatically generated geometry proof problems combining text and diagrams.
  • Best AI model (GPT-5-nano) scored 75.89%, nearly 20 points behind human performance at 94.74%.
  • Identified three key LLM failure patterns: exact match issues, weak visual reliance, and non-convergent reasoning.

Why It Matters

Provides a scalable, diagnostic tool to measure and improve AI's multi-modal reasoning, a critical step toward more reliable and trustworthy systems.