AI Safety

[Linkpost] Interpreting Language Model Parameters

Adversarial ablation pinpoints causally important parameter subcomponents for the first time.

Deep Dive

A new paper introduces adVersarial Parameter Decomposition (VPD), a method that decomposes a small language model's parameters into interpretable subcomponents, each implementing a small part of the model's algorithm. VPD improves on the authors' earlier Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD) by using adversarial ablation—selectively removing subcomponents in a worst-case manner—to ensure that only causally important pieces are retained. This allows VPD to decompose attention layers, a task that transcoders and SAEs have historically struggled with.

Importantly, the adversarial approach reveals that many other subnetwork identification methods fail faithfulness tests, as their subnetworks may not actually drive model behavior under adversarial conditions. VPD also does not suffer from 'feature splitting' (the parameter-space analog of polysemantic features) and compares favorably to per-layer transcoders and cluster-level treatments. While currently demonstrated on a small model, the authors believe VPD is ready to scale to models of practical interest, opening the door to more reliable mechanistic interpretability for real-world neural networks.

Key Points
  • VPD successfully decomposes attention layers, a key weakness of SAEs and transcoders.
  • Adversarial ablation ensures identified subcomponents are causally faithful to model behavior.
  • VPD avoids feature splitting and compares favorably to per-layer transcoders and CLTs.

Why It Matters

Brings reliable mechanistic interpretability closer to analyzing large, real-world language models.