AI Safety

Substrate-Sensitivity

LayerNorm contributes ~30% of self-repair, and quantization alters attack surfaces.

Deep Dive

A new LessWrong post by mfatt and Vardhan, part of the AI Safety Camp project MoSSAIC, argues that 'substrate-sensitivity'—the influence of implementation details like LayerNorm and quantization—unifies several AI safety phenomena. The authors define 'substrate' as the computational context surrounding a model's formal function f(x) = y, including architectural components and weight storage formats. They present two case studies: LayerNorm's role in self-repair (the Hydra effect) and quantization's impact on bit-flip attacks.

In Pythia-160M, LayerNorm contributes ~30% of observed self-repair when attention heads are ablated, by rescaling residual stream activations. Quantization formats like FP8 and INT4, while saving memory, create new vulnerabilities to hardware fault-injection attacks that alter stored weights. The authors argue that these 'passive' components are safety-relevant and should be considered in risk analyses, expanding the list of substrate-sensitive phenomena.

Key Points
  • LayerNorm contributes ~30% of self-repair in Pythia-160M when attention heads are ablated
  • Quantization formats (FP8, INT4) create vulnerabilities to bit-flip attacks via hardware fault injection
  • The authors define 'substrate' as computational context beyond the formal model function f(x) = y

Why It Matters

Implementation details often ignored in safety analyses can significantly alter model behavior and create new attack surfaces.