AI Safety

Mechanistic Interpretability of Biological Foundation Models

37 analyses show attention-based gene network extraction fails against trivial baselines, challenging LLM interpretability.

Deep Dive

In a landmark study published on LessWrong, researcher Ihor Kendiukhov performed the most rigorous evaluation to date of mechanistic interpretability methods applied to biological foundation models like scGPT and Geneformer. The research involved 37 distinct analyses, 153 statistical tests across 4 cell types, and revealed that attention-based extraction of gene regulatory networks consistently fails—trivial gene-level baselines already capture the signal, and heads most aligned with known biological regulation prove most dispensable for the model's actual computation. Crucially, the study identified a large, formally quantifiable non-additivity bias in activation patching that undermines standard component ranking methods, a finding likely relevant to large language model (LLM) interpretability. Kendiukhov argues biological models offer external ground truth validation, more tractable scales, and direct biomedical benefits with lower dual-use risks than frontier LLM interpretability.

Key Points
  • 37 analyses and 153 statistical tests across 4 cell types show attention-based gene network extraction fails completely
  • Found quantifiable non-additivity bias in activation patching that undermines standard interpretability component rankings
  • Heads most aligned with known biological regulation were most dispensable for the model's actual computation

Why It Matters

Reveals fundamental flaws in current AI interpretability methods with direct implications for understanding and safely developing frontier LLMs.