Research & Papers

CurveBench tests AI's topological reasoning, models hit 71% at best

Even top models fail on simple nested shape puzzles—GPT-5.4 beaten by fine-tuned 8B model.

Deep Dive

CurveBench is a new benchmark designed to measure AI's ability to perform exact topological reasoning from visual input. It consists of 756 images of pairwise non-intersecting Jordan curves across five difficulty levels: easy, polygonal, topographic-inspired, maze-like, and dense counting. Each image is annotated with a rooted tree that encodes the containment relations between planar regions, and the task is to recover that tree from the image alone.

The strongest evaluated model, Gemini 3.1 Pro, achieved only 71.1% tree-generation accuracy on CurveBench-Easy and 19.1% on CurveBench-Hard, revealing how far even frontier models are from robust spatial reasoning. However, the benchmark also shows promise: fine-tuning the open-weight Qwen3-VL-8B model with RLVR-style training boosted its accuracy from 2.8% to 33.3% on easy tasks, surpassing GPT-5.4 and Claude Opus 4.5. The gap on hard tasks remains stark, proving that exact topology-aware visual reasoning is far from solved.

Key Points
  • CurveBench uses 756 images of nested Jordan curves across 5 difficulty levels to test hierarchical topological reasoning.
  • Best model Gemini 3.1 Pro scores 71.1% on easy tasks and only 19.1% on hard tasks.
  • Fine-tuned Qwen3-VL-8B jumps from 2.8% to 33.3% on easy, beating GPT-5.4 and Claude Opus 4.5.

Why It Matters

Spatial reasoning remains a critical weakness for AI, and this benchmark could drive improvements in visual understanding.