Research & Papers

LLM 'Semantic Neurons' Map to Human Brain Regions with 94% Accuracy

Sparse autoencoders reveal how GPT-2 and Llama 3.1 features align with cortical topography.

Deep Dive

A team led by Dongxin Guo, Jikun Wu, and Siu Ming Yiu at HKU bridges mechanistic interpretability and computational neurolinguistics by applying sparse autoencoders (SAEs) to GPT-2 XL and Llama-3.1-8B. Decomposing each layer into 16K–32K features, they created a human-validated taxonomy (κ≥0.74) showing that semantic features alone explain 94% of the peak encoding performance of LLM activations against human brain responses (r=0.285, p<0.001, d=1.31). This quantifies why intermediate LLM layers best predict brain activity—semantic content, not syntax or position, drives the alignment.

Crucially, the SAE-discovered features recapitulate known cortical semantic organization: five predefined semantic subcategories (e.g., social, spatial, emotional) map onto distinct brain regions with high convergence (Spearman ρ=0.72, hypergeometric p=0.007). The features also predict human reading times beyond lexical baselines (ΔlogLik=38.4, p<0.001), with exploratory evidence that the brain encodes unexpected semantic content. Results hold across English, Chinese, and French, suggesting universal neural alignment. Accepted at CoNLL 2026, this work offers a mechanistic explanation for brain-LLM similarity and a new tool for studying language in the cortex.

Key Points
  • Sparse autoencoders decomposed GPT-2 XL and Llama-3.1-8B into 16K–32K interpretable semantic features per layer.
  • Semantic features alone achieve 94% of peak brain encoding performance (r=0.285) with Cohen's d=1.31 against baselines.
  • Five a priori semantic subcategories map to distinct brain regions (Spearman ρ=0.72, p<0.001), generalizing across English, Chinese, and French.

Why It Matters

Explains why LLM internals mirror brain activity and opens new avenues for non-invasive brain mapping via AI.