Substrate-Sensitivity
LayerNorm contributes ~30% of self-repair, and quantization alters attack surfaces.
A new LessWrong post by mfatt and Vardhan, part of the AI Safety Camp project MoSSAIC, argues that 'substrate-sensitivity'—the influence of implementation details like LayerNorm and quantization—unifies several AI safety phenomena. The authors define 'substrate' as the computational context surrounding a model's formal function f(x) = y, including architectural components and weight storage formats. They present two case studies: LayerNorm's role in self-repair (the Hydra effect) and quantization's impact on bit-flip attacks.
In Pythia-160M, LayerNorm contributes ~30% of observed self-repair when attention heads are ablated, by rescaling residual stream activations. Quantization formats like FP8 and INT4, while saving memory, create new vulnerabilities to hardware fault-injection attacks that alter stored weights. The authors argue that these 'passive' components are safety-relevant and should be considered in risk analyses, expanding the list of substrate-sensitive phenomena.
- LayerNorm contributes ~30% of self-repair in Pythia-160M when attention heads are ablated
- Quantization formats (FP8, INT4) create vulnerabilities to bit-flip attacks via hardware fault injection
- The authors define 'substrate' as computational context beyond the formal model function f(x) = y
Why It Matters
Implementation details often ignored in safety analyses can significantly alter model behavior and create new attack surfaces.