Research & Papers

Structural Instability of Feature Composition

Sparse autoencoders can't combine features without interference – here's why.

Deep Dive

The paper “Structural Instability of Feature Composition” by Yunpeng Zhou tackles a blind spot in interpretability research: what happens when you steer multiple AI features at once. Sparse autoencoders (SAEs) have become the go-to tool for disentangling feature superposition in transformer models, enabling precise single-concept steering. But the Linear Representation Hypothesis assumes features combine cleanly—a simplification that breaks under real-world overcomplete dictionaries. Zhou models the activation space as a high-dimensional sparse cone manifold and derives an asymptotic collapse threshold using the Gaussian mean width (statistical dimension) of the signal cone.

In the high-bias regime, ReLU rectification converts microscopic correlation-induced variance fluctuations into a systematic drift that accumulates under composition—a ratchet effect. Tests on structured semantic features from CLEVR confirm that hierarchical correlations accelerate the collapse relative to random baselines. The result is a clear geometric constraint: scaling union-based steering requires explicit interference management, not just linear superposition. For practitioners, this means that current feature steering techniques (like those used in alignment research or controllable generation) may produce unpredictable failures when multiple concepts are combined. The paper opens a new research direction for robust compositional steering mechanisms.

Key Points
  • Derives a compositional-collapse threshold using Gaussian mean width of the signal cone in high-dimensional sparse cone manifolds.
  • Demonstrates that ReLU rectification turns variance fluctuations into systematic drift, creating a ratchet effect under feature composition.
  • Validated on CLEVR features, where hierarchical correlations accelerate collapse compared to random baselines.

Why It Matters

Scaling activation steering across multiple features requires new mechanisms—linear superposition assumptions are geometrically unstable.