NeuralBench: A unified framework to benchmark neuroAI models across 36 tasks
94 datasets, 14 architectures, and a surprising finding about foundation models.
NeuralBench, introduced by a team including Hubert Banville, Stéphane d'Ascoli, and Jean-Rémi King (among others), addresses a critical problem in neuroAI: the lack of standardized evaluation. While deep learning models for brain recordings have proliferated, inconsistent preprocessing, training, and evaluation across studies make true progress hard to measure. NeuralBench provides a unified benchmarking framework with a standardized interface, starting with Electroencephalography (EEG). The accompanying NeuralBench-EEG v1.0 benchmark includes 36 diverse tasks — from cognitive decoding to clinical predictions — evaluated across 14 deep learning architectures and 94 datasets. This scale allows for robust comparisons that were previously impossible.
Initial findings from the benchmark are notable. First, current foundation models (pre-trained on large EEG corpora) only marginally outperform simple task-specific models, challenging the assumption that large-scale pretraining is a silver bullet. Second, a significant portion of tasks remain highly challenging for all models, particularly in clinical predictions and fine-grained cognitive state decoding. The framework is designed for extensibility: preliminary support for MEG and fMRI is already demonstrated, and the full codebase and data access are open-source. The authors invite the community to contribute new tasks, datasets, and models, aiming to establish a unified standard for neuroimaging AI benchmarking.
- NeuralBench-EEG v1.0 covers 36 distinct EEG tasks, evaluated on 94 datasets via a unified interface.
- 14 deep learning architectures are benchmarked; foundation models only marginally outperform task-specific ones.
- Open-source framework with planned extensions to MEG, fMRI, and community-contributed tasks.
Why It Matters
Standardized benchmarking exposes gaps in neuroAI, pushing the field beyond inflated claims toward real clinical and cognitive impact.