Research & Papers

Brain-IT-VQA decodes fMRI to answer visual questions with 20x richer data

New AI reads brain signals to answer questions about images you've seen

Deep Dive

A team led by Roman Beliy at the Weizmann Institute of Science has developed Brain-IT-VQA, a novel framework that translates fMRI brain signals into answers about visual content. Building on the Brain Interaction Transformer (Brain-IT), the system decodes language tokens directly from neural activity and feeds them into a language model to respond to questions about images a person was shown. The result significantly outperforms prior fMRI-based captioning and visual question answering (VQA) methods, marking a leap in decoding complex visual-semantic information from brain scans.

To support more rigorous evaluation, the researchers created NSD-VQA, a new benchmark dataset derived from the Natural Scenes Dataset. Unlike existing fMRI-VQA datasets that offer only a few broad questions per image, NSD-VQA provides an average of 20 question-answer pairs per image across 20 controlled categories—ranging from object presence to scene layout and color. This controlled design allows for interpretable, fine-grained analysis of what visual information can be reliably decoded from different brain regions, including early visual cortex and higher-level associative areas.

Beyond its predictive performance, Brain-IT-VQA serves as a scientific tool to probe the structure of visual representations in the brain. By testing which question types the model answers accurately, researchers can map how information is distributed across neural substrates. This dual-purpose approach advances both AI-driven brain decoding and neuroscience, with potential applications in brain-computer interfaces and assistive communication for individuals with locked-in syndrome.

Key Points
  • Brain-IT-VQA decodes fMRI signals into text answers via a transformer that combines brain activity with a language model.
  • NSD-VQA dataset offers ~20 Q&A pairs per image across 20 controlled categories, far surpassing previous benchmarks.
  • Outperforms prior fMRI captioning and VQA methods, enabling fine-grained analysis of visual information in brain regions.

Why It Matters

Could unlock brain-computer interfaces that let users communicate what they see just by thinking.