MindAlign reads brain waves with 54% accuracy in zero-shot visual decoding
EEG + AI = 83% top-5 accuracy decoding what you're seeing
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Visual decoding from brain signals has long been a challenge at the intersection of computer vision and neuroscience. A new framework called MindAlign, developed by researchers from multiple institutions, tackles this with a tri-modal contrastive approach that aligns EEG, visual, and textual representations in a unified latent space. The two-stage design first pre-trains an EEG encoder via masked reconstruction on unlabeled trials to learn spatio-temporal regularities, then jointly aligns EEG, images, and LLM-generated text descriptions through contrastive learning. Text acts as a semantic regularizer, injecting linguistic structure without overwhelming the primary EEG-image signal. The encoder incorporates subject-specific adaptation, graph-attention over channels, and temporal-spatial convolutional embeddings.
On the Things-EEG2 200-way zero-shot benchmark, MindAlign achieves 54.1% Top-1 and 83.4% Top-5 accuracy—a significant leap from the previous best baseline of 32.4% and 64.0%. Paired Wilcoxon tests confirm significance (p < 0.01) across all in-subject baselines. Analysis reveals that compact embedding geometries (CN-CLIP) outperform much larger backbones, and decoding aligns with established neurophysiology of visual processing. The framework also generalizes to Things-MEG data. This work is a critical step toward robust, semantically-grounded visual decoding from non-invasive temporal neural signals. The source code is publicly available.
- 54.1% Top-1 zero-shot accuracy on Things-EEG2, up from 32.4% baseline
- Tri-modal contrastive learning aligns EEG, images, and LLM-generated text
- Generalizes to MEG data and uses compact CN-CLIP embeddings for efficiency
Why It Matters
Non-invasive brain-computer interfaces get a major accuracy boost, bringing mind-reading closer to practical use.