ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?
New benchmark shows even SOTA models like GPT-4V and Claude 3.5 fail at basic physical and causal reasoning.
A research team from Tsinghua University and Tencent has published ViGoR-Bench (Vision-Generative Reasoning-centric Benchmark), a new evaluation framework designed to expose the fundamental reasoning deficits in today's most advanced visual AI models. The paper argues that beneath the impressive image and video generation capabilities of systems like DALL-E 3, Stable Diffusion, and Midjourney lies a 'logical desert'—a critical gap in physical, causal, and spatial reasoning that current benchmarks fail to measure. ViGoR addresses this by introducing four key innovations: holistic cross-modal coverage bridging image and video tasks, a dual-track mechanism evaluating both intermediate reasoning processes and final outputs, an evidence-grounded automated judge with high human alignment, and granular diagnostic analysis across fine-grained cognitive dimensions.
In testing over 20 state-of-the-art models, ViGoR-Bench revealed that even the most capable systems—including multimodal giants like GPT-4V, Claude 3.5, and Gemini—struggle with tasks that require understanding object permanence, cause-and-effect relationships, or complex spatial arrangements. The benchmark establishes that current evaluations create a 'performance mirage' by focusing on superficial metrics rather than the generative reasoning process itself. By providing this comprehensive stress test, ViGoR aims to guide the development of the next generation of truly intelligent vision models that can move beyond visual fidelity to genuine visual understanding.
- Exposes 'logical desert' where SOTA models fail at physical/causal reasoning despite high visual quality
- Tests over 20 leading models including GPT-4V, Claude 3.5, and DALL-E 3 with 4 innovative evaluation methods
- Provides granular diagnostics across cognitive dimensions to guide development of next-gen vision AI
Why It Matters
This benchmark will force AI developers to build models with genuine visual understanding, not just impressive outputs.