VT-Bench: First unified benchmark for visual-tabular AI tasks
756K samples across 14 datasets in 9 domains reveal major gaps.
A team led by Zi-Yi Jia has released VT-Bench, the first comprehensive benchmark designed specifically for visual-tabular multi-modal learning. While text-and-vision models have advanced rapidly, combining images with structured tabular data—critical in fields like healthcare diagnostics and industrial inspection—has lagged behind. VT-Bench fills this gap by curating 14 distinct datasets spanning 9 domains, including medical imaging paired with patient records, pet species with breed tables, and transportation scenarios with metadata tables. The benchmark includes over 756,000 samples, providing a robust testbed for both discriminative prediction (e.g., diagnosis from X-ray + vitals) and generative reasoning (e.g., captioning with embedded table data).
To establish baselines, the authors evaluated 23 representative models, ranging from unimodal vision experts and specialized visual-tabular architectures to general-purpose vision-language models (VLMs) and tool-augmented systems. Results reveal that even state-of-the-art VLMs struggle to effectively fuse visual and tabular information, often underperforming simpler unimodal baselines. This underscores a critical bottleneck: models lack mechanisms to reason across structured numeric data and unstructured images. VT-Bench is publicly available, and the team hopes it will catalyze development of more powerful multi-modal foundation models capable of handling the real-world complexity of vision and tables together.
- Combines 14 datasets from 9 domains (medical, pet, media, transportation) into one benchmark
- Over 756K samples for both discriminative and generative tasks
- 23 models evaluated — VLMs and specialized models still fall short of human-level fusion
Why It Matters
Standardizes evaluation for visual-tabular AI, a must for healthcare and industrial accuracy.