SAE feature universality hides a random rotation problem – fix found
Decoder columns match, but encoders fail when cross-model applied – rotation is the key.
Independent researcher Jordan McCann’s new analysis reveals a subtle flaw in how SAE universality is measured. While decoder column cosine similarities between SAEs trained on different random seeds of the same transformer architecture hover around 0.9 (reported as evidence of shared features), applying one seed’s SAE encoder to another seed’s activations produces catastrophic failure – negative explained variance. McCann identifies the root cause: polymorphism. The two networks compute the same function, but their residual-stream activation spaces are rotated relative to each other by an amount statistically indistinguishable from a uniform random draw on the orthogonal group SO(d). This means the bases are “mutually unintelligible” even though the mathematical content is identical.
Critically, McCann demonstrates a cheap fix: one matrix multiplication – an orthogonal Procrustes rotation that aligns the activation bases. After rotation, cross-model reconstruction scores jump to 0.99 on a toy model and 0.85–0.99 on Pythia-70m (nine seeds). The Frobenius distance between the learned rotation and identity matches predictions from a uniform distribution on SO(d), confirming the random-rotation hypothesis. McCann recommends that future SAE evaluations report both raw cross-model explained variance and post-rotation variance, not just decoder cosine. Practical upshot: steering vectors from one model can be transferred to another by applying this rotation matrix, requiring only lightweight computation.
- Decoder cosine similarities between SAEs from different seeds are ~0.9, but cross-model encoder reconstruction yields negative explained variance.
- The mismatch is due to a uniform random rotation of residual-stream bases, termed polymorphism; fixed by a single orthogonal Procrustes rotation.
- Post-rotation reconstruction scores reach 0.85–0.99 on toy and Pythia models without retraining; rotation matches SO(d) Haar distribution.
Why It Matters
Enables true cross-model SAE transfer and steering vector reuse, saving compute and improving interpretability reliability.