Research & Papers

Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences

New research shows language models process Python code and human genome sequences similarly to natural languages.

Deep Dive

A new study from researchers Jingnong Qu, Ashvin Ranjan, and Shane Steinert-Threlkeld challenges a key assumption in AI neuroscience. The team investigated Brain Score (BS), a popular metric that compares activations in language models (LMs) to human brain activity (fMRI) during reading. Previously, a high BS was interpreted as evidence that models like GPT-4 or Llama 3 process information similarly to the human brain. This paper, published on arXiv, rigorously tests that claim by training LMs on vastly different types of structured data.

The researchers found that models trained on various natural languages from different families performed similarly on BS, as expected. Surprisingly, models trained on non-linguistic but structured sequences—specifically Python code, the human genome, and even synthetic data of nested parentheses—also achieved reasonably high and sometimes comparable Brain Scores. This indicates that the BS metric is primarily sensitive to a model's ability to extract general hierarchical and sequential structure, a capability not unique to human language processing.

These results have significant implications for interpreting AI alignment with human cognition. The study suggests that while Brain Score is a useful tool for measuring how well LMs capture abstract patterns, a high score alone is not sufficient evidence to conclude an AI model processes information in a human-like way. It calls for more nuanced, multi-faceted benchmarks to truly assess the similarity between artificial and biological intelligence.

Key Points
  • Brain Score gives similar results for models trained on natural languages, Python code, and the human genome.
  • The metric measures general structural learning, not specifically human-like language processing.
  • Findings challenge the assumption that high Brain Scores prove AI understands language like humans.

Why It Matters

Forces a reevaluation of how we measure AI's similarity to human cognition, moving beyond single-metric benchmarks.