CurveBench uses 756 images of nested Jordan curves across 5 difficulty levels to test hierarchical topological reasoning?

CurveBench uses 756 images of nested Jordan curves across 5 difficulty levels to test hierarchical topological reasoning.

Best model Gemini 3.1 Pro scores 71.1% on easy tasks and only 19.1% on hard tasks?

Best model Gemini 3.1 Pro scores 71.1% on easy tasks and only 19.1% on hard tasks.

Fine-tuned Qwen3-VL-8B jumps from 2.8% to 33.3% on easy, beating GPT-5.4 and Claude Opus 4.5?

Fine-tuned Qwen3-VL-8B jumps from 2.8% to 33.3% on easy, beating GPT-5.4 and Claude Opus 4.5.

Research & Papers

CurveBench tests AI's topological reasoning, models hit 71% at best

arXiv cs.CV May 15, 2026

⚡Even top models fail on simple nested shape puzzles—GPT-5.4 beaten by fine-tuned 8B model.

Deep Dive

CurveBench is a new benchmark designed to measure AI's ability to perform exact topological reasoning from visual input. It consists of 756 images of pairwise non-intersecting Jordan curves across five difficulty levels: easy, polygonal, topographic-inspired, maze-like, and dense counting. Each image is annotated with a rooted tree that encodes the containment relations between planar regions, and the task is to recover that tree from the image alone.

The strongest evaluated model, Gemini 3.1 Pro, achieved only 71.1% tree-generation accuracy on CurveBench-Easy and 19.1% on CurveBench-Hard, revealing how far even frontier models are from robust spatial reasoning. However, the benchmark also shows promise: fine-tuning the open-weight Qwen3-VL-8B model with RLVR-style training boosted its accuracy from 2.8% to 33.3% on easy tasks, surpassing GPT-5.4 and Claude Opus 4.5. The gap on hard tasks remains stark, proving that exact topology-aware visual reasoning is far from solved.

Key Points

CurveBench uses 756 images of nested Jordan curves across 5 difficulty levels to test hierarchical topological reasoning.
Best model Gemini 3.1 Pro scores 71.1% on easy tasks and only 19.1% on hard tasks.
Fine-tuned Qwen3-VL-8B jumps from 2.8% to 33.3% on easy, beating GPT-5.4 and Claude Opus 4.5.

Why It Matters

Spatial reasoning remains a critical weakness for AI, and this benchmark could drive improvements in visual understanding.

Read Original Article

CurveBench tests AI's topological reasoning, models hit 71% at best

Why It Matters

Related Articles

🚀 Stay Ahead in AI