AI Safety

Extracting Performant Algorithms Using Mechanistic Interpretability

Researchers extracted a functional cell differentiation algorithm from a 96-attention-unit transformer model using automated hypothesis search.

Deep Dive

Researchers at Goodfire AI have successfully used mechanistic interpretability to extract a performant biological algorithm from a foundation model. The work focused on scGPT, a transformer model trained on millions of single-cell gene expression profiles to predict masked gene values. By systematically searching the model's 96 attention units (12 layers x 8 heads) with an AI executor-reviewer pair, they identified a compact 8-to-10-dimensional manifold within specific attention heads. This geometric structure encoded the entire process of hematopoietic differentiation, with stem cells at one end and terminally differentiated blood cells like T cells and monocytes along distinct branches.

The discovered manifold wasn't just a curious artifact; it represented a functional algorithm. The team developed a three-stage extraction pipeline to convert this internal representation into a standalone tool. Validation on independent datasets, including the Tabula Sapiens atlas and a multi-donor immune panel, confirmed its accuracy in mapping cell development. This proves that the model, trained only for next-token prediction, had internally learned and structured a complex biological process—the developmental hierarchy of blood cells—to improve its core task. The finding mirrors a prior discovery where the Evo 2 DNA model encoded the evolutionary tree of life in its activations.

Key Points
  • Automated search of scGPT's 96 attention units revealed an 8-10D manifold encoding blood cell development.
  • The extracted geometric algorithm was validated via zero-shot transfer to the independent Tabula Sapiens dataset.
  • Demonstrates foundation models learn complex, structured biological representations (like differentiation pathways) to perform their training task.

Why It Matters

This provides a blueprint for discovering novel, functional algorithms hidden within black-box AI models, potentially accelerating scientific discovery.