Combines 14 datasets from 9 domains (medical, pet, media, transportation) into one benchmark?

Combines 14 datasets from 9 domains (medical, pet, media, transportation) into one benchmark

Over 756K samples for both discriminative and generative tasks?

Over 756K samples for both discriminative and generative tasks

23 models evaluated — VLMs and specialized models still fall short of human-level fusion?

23 models evaluated — VLMs and specialized models still fall short of human-level fusion

Research & Papers

VT-Bench: First unified benchmark for visual-tabular AI tasks

arXiv cs.CV May 12, 2026

⚡756K samples across 14 datasets in 9 domains reveal major gaps.

Deep Dive

A team led by Zi-Yi Jia has released VT-Bench, the first comprehensive benchmark designed specifically for visual-tabular multi-modal learning. While text-and-vision models have advanced rapidly, combining images with structured tabular data—critical in fields like healthcare diagnostics and industrial inspection—has lagged behind. VT-Bench fills this gap by curating 14 distinct datasets spanning 9 domains, including medical imaging paired with patient records, pet species with breed tables, and transportation scenarios with metadata tables. The benchmark includes over 756,000 samples, providing a robust testbed for both discriminative prediction (e.g., diagnosis from X-ray + vitals) and generative reasoning (e.g., captioning with embedded table data).

To establish baselines, the authors evaluated 23 representative models, ranging from unimodal vision experts and specialized visual-tabular architectures to general-purpose vision-language models (VLMs) and tool-augmented systems. Results reveal that even state-of-the-art VLMs struggle to effectively fuse visual and tabular information, often underperforming simpler unimodal baselines. This underscores a critical bottleneck: models lack mechanisms to reason across structured numeric data and unstructured images. VT-Bench is publicly available, and the team hopes it will catalyze development of more powerful multi-modal foundation models capable of handling the real-world complexity of vision and tables together.

Key Points

Combines 14 datasets from 9 domains (medical, pet, media, transportation) into one benchmark
Over 756K samples for both discriminative and generative tasks
23 models evaluated — VLMs and specialized models still fall short of human-level fusion

Why It Matters

Standardizes evaluation for visual-tabular AI, a must for healthcare and industrial accuracy.

Read Original Article

VT-Bench: First unified benchmark for visual-tabular AI tasks

Why It Matters

Related Articles

🚀 Stay Ahead in AI