AI Safety

[Linkpost] Interpreting Language Model Parameters

A new adversarial decomposition reveals how models compute, even in attention layers.

Deep Dive

Goodfire AI introduces adVersarial Parameter Decomposition (VPD), a method that splits a language model’s parameters into small, faithful subcomponents. VPD improves on prior techniques (SPD/APD) by handling attention layers, where SAEs struggle. Using adversarial ablation, it identifies causally important subnetworks. VPD was applied to the 28 million non-embedding parameters of a small 67‑million‑parameter model, and the approach is now considered more‑or‑less ready for scaling to larger models, enabling deeper mechanistic interpretability.

Key Points
  • VPD decomposes parameters into simple, faithful subcomponents using adversarial ablation, outperforming prior methods like SPD and APD.
  • It successfully handles attention layers, a persistent challenge for SAEs and transcoders, and avoids feature splitting.
  • Tested on a 67M-parameter model; the approach is now considered ready for scaling to larger, production-level models.

Why It Matters

Scaling interpretability to real models enables safer AI deployments and deeper debugging of neural algorithms.