Image & Video

Stanford's NeuroQA benchmark tests AI on 3D brain MRI with 56K questions

AI models score below random guessing on new 3D brain MRI benchmark

Deep Dive

Stanford University researchers have unveiled NeuroQA, a comprehensive benchmark for visual question answering (VQA) in 3D brain magnetic resonance imaging (MRI). The dataset comprises 56,953 question-answer pairs from 12,977 subjects across 12 diverse datasets, spanning ages 5 to 104 and five clinical domains: Alzheimer's disease, Parkinson's disease, brain tumors, white matter disease, and neurodevelopment. Unlike prior medical VQA efforts limited to 2D slices or narrow diagnostic labels, every item in NeuroQA is paired with a full 3D volume. The benchmark evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats, using 203 templates—131 image-grounded (answerable from a 3-plane viewer) and 72 image-informed (based on quantitative volumetry or clinical instruments). To prevent models from exploiting text-only shortcuts, the team applied answer-distribution refinement, reducing closed-format text-only accuracy from over 80% to 44.6%. A 38-rule deterministic pipeline and two rounds of expert review verify each QA pair against FreeSurfer measurements, metadata, or radiology report fields, ensuring zero same-subject contradictions.

In a clinician evaluation, two experts independently assessed 100 frozen test items using a three-plane viewer, establishing a human baseline. On closed-format (Yes/No + multiple-choice) test-public items, the best-performing zero-shot vision-language model reached only 47.5% accuracy, while a supervised 3D CNN baseline achieved 43.7%—both below the 49.4% text-only majority-template floor. This indicates that current models effectively fail the benchmark, underscoring the difficulty of genuine 3D medical image understanding. NeuroQA adopts a two-tier release: public QA pairs for open-access datasets and reproducible generation scripts for restricted datasets, plus subject-level splits, a held-out private test set, and an online leaderboard. The benchmark is poised to become a critical tool for advancing AI in medical imaging, revealing that sophisticated visual reasoning in 3D remains an unsolved challenge.

Key Points
  • 56,953 QA pairs from 12,977 subjects across 12 datasets spanning ages 5-104 and five clinical domains (Alzheimer's, Parkinson's, tumors, white matter disease, neurodevelopment).
  • Answer-distribution refinement reduced text-only shortcut accuracy from >80% to 44.6%; 131 of 203 templates are image-grounded (3-plane viewer), 72 are image-informed.
  • Best zero-shot vision-language model scored 47.5% on closed-format test items, below the 49.4% text-only majority-template floor; supervised 3D CNN baseline achieved 43.7%.

Why It Matters

Forces AI to truly understand 3D medical images, exposing current limitations and guiding future model development in medical VQA.