AI Safety

Power Steering: Behavior Steering via Layer-to-Layer Jacobian Singular Vectors

New method computes steering vectors in just 15 forward passes instead of costly optimization, mapping entire model sensitivities.

Deep Dive

Researchers have developed 'Power Steering,' a new technique for controlling large language model (LLM) behavior by analyzing the mathematical relationships between neural network layers. Created by Omar Ayyub, the method computes layer-to-layer Jacobians—matrices that show how small changes in one layer's activations affect later layers—and extracts their most influential components via singular value decomposition. Crucially, Power Steering uses power iteration to approximate these top components in just ~15 forward passes through the model, making it vastly more efficient than previous approaches.

This efficiency enables researchers to systematically map every possible source/target layer pair in a model, creating a comprehensive 'sensitivity map' of behavioral controls. Unlike contrastive activation addition (CAA), which requires carefully crafted prompt pairs, Power Steering can discover latent behaviors and steering vectors from single prompts. The technique produces results comparable to more computationally expensive methods like MELBO (which uses non-linear optimization) while being orders of magnitude faster, opening new possibilities for interpretability research and controlled AI behavior modification.

Key Points
  • Computes steering vectors in ~15 forward passes via power iteration on layer Jacobians
  • Maps all source/target layer pairs to create complete model sensitivity profiles
  • Achieves comparable performance to MELBO's non-linear optimization at dramatically lower cost

Why It Matters

Enables affordable, systematic exploration of AI model internals for safety research and controlled behavior modification.