OpenAI's Whisper aligns with human brain speech processing, study finds
Intermediate layers of Whisper best predict ECoG brain responses during speech perception.
A new study presented at the ICLR 2026 Workshop on Representational Alignment investigated how OpenAI's Whisper speech foundation model relates to human brain activity during naturalistic speech perception. Led by Matteo Ciferri and colleagues from multiple institutions, the researchers used intracranial electrocorticography (ECoG) recordings—a high-resolution method that places electrodes directly on the brain's surface—to capture neural responses while participants listened to speech. They then introduced a novel time-resolved neural encoder that combines Whisper's internal speech embeddings with a recurrent temporal model and soft attention mechanism, enabling them to examine layer-by-layer alignment between model representations and cortical activity.
The results revealed a hierarchical match: intermediate layers of Whisper provided the strongest correspondence with neural activity, outperforming both early and late layers. This suggests that speech foundation models develop representations that mirror the hierarchical organization of cortical speech processing. Comparisons with simpler linear baselines showed that the temporally structured modeling was essential for capturing high-resolution ECoG responses. Additionally, attention maps from the neural encoder exhibited temporally local alignment between speech embeddings and neural activity, while a phonemic interpretability analysis identified anatomically coherent phoneme-category organization among electrodes that contributed most to encoding. The study demonstrates that Whisper, originally trained only on text and audio, inadvertently learns features that align with biological speech processing, offering a powerful framework for computational neuroscience.
- Whisper's intermediate layers (not first or last) best predict human ECoG responses, indicating hierarchical alignment with cortical speech processing.
- The new time-resolved neural encoder uses recurrent temporal modeling and soft attention, outperforming simple linear mappings for high-resolution brain recordings.
- Phonemic analysis showed that encoding-informative electrodes exhibit anatomically coherent organization (e.g., place/manner of articulation), validating model interpretability.
Why It Matters
Speech AI models may help decode brain activity, advancing brain-computer interfaces and neuroscience.