Research & Papers

EEG Foundation Models fail at handwriting decoding, study shows

Current EEG FMs can't beat smaller specialist models on fine motor tasks.

Deep Dive

A new preprint from Srinivas Ravishankar and colleagues at UC San Diego challenges the robustness of Electroencephalography (EEG) Foundation Models. These models, such as LaBraM and BENDR, have claimed state-of-the-art performance on Motor Imagery (MI) tasks—typically classifying imagined limb movements like left hand vs. right foot. However, the authors argue that such coarse tasks may not fully test a model's ability to capture fine-grained motor signals. They introduce handwriting decoding as a more demanding benchmark: classifying which of four letters a user is writing in their mind, based solely on EEG data.

The results are striking. When the researchers rigorously controlled for movement-onset cues—a confound present in previous datasets—average decoding accuracy across subjects dropped from 41.3% to 32.4%, revealing that prior successes were partly artifacts of temporal alignment. Even more surprisingly, the best performing foundation model still lagged behind a carefully tuned, smaller convolutional neural network (specialist model) on the same task. The study also showed that improving test-time signal quality (e.g., using higher-quality electrode channels or artifact rejection) boosted performance dramatically—from 45% to 78% for their best subject—while simply adding more single-trial training data yielded diminishing returns. The authors make their code and dataset publicly available, urging the community to adopt handwriting decoding as a litmus test for EEG foundation models' true generalization.

Key Points
  • Existing EEG foundation models (e.g., LaBraM, BENDR) show SOTA on coarse motor imagery but are outperformed by smaller specialist models on 4-letter handwriting decoding.
  • Knowledge of movement onset inflated past results: average accuracy dropped from 41.3% to 32.4% when controlling for this confound.
  • Test-time signal quality improvement (45% to 78% for best subject) beats scaling training data with single-trial EEG.

Why It Matters

Questions the true generalization of EEG foundation models and highlights the need for more challenging, confound-free benchmarks.