DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs
New 12,000-question benchmark exposes why AI can see molecules but can't reason about them.
A research team from IIT Delhi and IBM has published DISSECT, a groundbreaking diagnostic benchmark designed to systematically expose a fundamental weakness in scientific Vision-Language Models (VLMs). The core problem, termed the 'perception-integration gap,' occurs when a model like GPT-4V can correctly identify visual elements in a molecular diagram but then fails to reason about them, effectively seeing without thinking. DISSECT tackles this by evaluating models across 12,000 questions in Chemistry and Biology using five distinct input modes—including a novel 'Model Oracle' where the AI first describes an image in text and then reasons from its own description—to isolate where failures occur.
Evaluating 18 leading VLMs, the study produced critical findings. First, Chemistry questions proved far less susceptible to being answered from language priors alone, making them a harder test of genuine visual reasoning than Biology. Second, open-source models consistently performed better when reasoning from their own textual descriptions of images than from the raw images themselves, revealing a systematic bottleneck in integrating visual data. In stark contrast, closed-source models like GPT-4V showed no such gap, suggesting that effectively bridging perception and reasoning is the current frontier separating proprietary and open-source multimodal AI capabilities. The Model Oracle protocol is a model-agnostic tool that can be applied post-hoc to any VLM evaluation, providing a new standard for diagnosing the true source of AI reasoning failures.
- DISSECT is a 12,000-question benchmark (7K Chemistry, 5K Biology) that diagnoses the 'perception-integration gap' in VLMs.
- Open-source models score higher when reasoning from their own image descriptions than from raw images, exposing an integration bottleneck.
- Closed-source models like GPT-4V show no such gap, highlighting a key capability divide in current multimodal AI.
Why It Matters
This provides a crucial diagnostic tool for developers to build better scientific AI and reveals a major weakness in current open-source multimodal models.