Stanford's NeuroQA benchmark tests AI on 3D brain MRI with 56K questions
AI models score below random guessing on new 3D brain MRI benchmark
Stanford University researchers have unveiled NeuroQA, a comprehensive benchmark for visual question answering (VQA) in 3D brain magnetic resonance imaging (MRI). The dataset comprises 56,953 question-answer pairs from 12,977 subjects across 12 diverse datasets, spanning ages 5 to 104 and five clinical domains: Alzheimer's disease, Parkinson's disease, brain tumors, white matter disease, and neurodevelopment. Unlike prior medical VQA efforts limited to 2D slices or narrow diagnostic labels, every item in NeuroQA is paired with a full 3D volume. The benchmark evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats, using 203 templates—131 image-grounded (answerable from a 3-plane viewer) and 72 image-informed (based on quantitative volumetry or clinical instruments). To prevent models from exploiting text-only shortcuts, the team applied answer-distribution refinement, reducing closed-format text-only accuracy from over 80% to 44.6%. A 38-rule deterministic pipeline and two rounds of expert review verify each QA pair against FreeSurfer measurements, metadata, or radiology report fields, ensuring zero same-subject contradictions.
In a clinician evaluation, two experts independently assessed 100 frozen test items using a three-plane viewer, establishing a human baseline. On closed-format (Yes/No + multiple-choice) test-public items, the best-performing zero-shot vision-language model reached only 47.5% accuracy, while a supervised 3D CNN baseline achieved 43.7%—both below the 49.4% text-only majority-template floor. This indicates that current models effectively fail the benchmark, underscoring the difficulty of genuine 3D medical image understanding. NeuroQA adopts a two-tier release: public QA pairs for open-access datasets and reproducible generation scripts for restricted datasets, plus subject-level splits, a held-out private test set, and an online leaderboard. The benchmark is poised to become a critical tool for advancing AI in medical imaging, revealing that sophisticated visual reasoning in 3D remains an unsolved challenge.
- 56,953 QA pairs from 12,977 subjects across 12 datasets spanning ages 5-104 and five clinical domains (Alzheimer's, Parkinson's, tumors, white matter disease, neurodevelopment).
- Answer-distribution refinement reduced text-only shortcut accuracy from >80% to 44.6%; 131 of 203 templates are image-grounded (3-plane viewer), 72 are image-informed.
- Best zero-shot vision-language model scored 47.5% on closed-format test items, below the 49.4% text-only majority-template floor; supervised 3D CNN baseline achieved 43.7%.
Why It Matters
Forces AI to truly understand 3D medical images, exposing current limitations and guiding future model development in medical VQA.