Introducing the AE Alignment Podcast (Ep. 1: Endogenous Steering Resistance with Alex McKenzie)
Llama-3.3-70B catches its own mistakes mid-generation, resisting attempts to steer it off-topic.
AE Studio's alignment research team has launched the AE Alignment Podcast, a new series diving into critical AI safety research. The inaugural episode features host James Bowler interviewing researcher Alex McKenzie about their paper on Endogenous Steering Resistance (ESR). This phenomenon occurs when large language models, specifically Llama-3.3-70B, spontaneously resist attempts to artificially steer their internal activations off-topic during inference. Instead of following the manipulated path, the model can catch itself, utter something akin to 'Wait, that's not right,' and course-correct back to the original task, even while the steering signal remains active.
The research provides causal evidence for dedicated internal 'consistency-checking' circuits. By using sparse autoencoders (SAEs) to identify and perturb specific model activations, the team pinpointed 26 SAE latents that activate during off-topic content and are linked to self-correction. Zero-ablating (effectively disabling) these latents reduced the model's multi-attempt correction rate by 25%. Key findings show ESR can be enhanced 4x through meta-prompting and has dual safety implications: it could fortify models against adversarial manipulation but might also interfere with safety techniques like representation engineering that rely on activation steering.
This work, funded by the AI Alignment Foundation and now supported by a UK AI Security Institute grant, connects to broader concepts like endogenous attention control in biological systems. It raises pivotal questions for the alignment community about developing transparent and controllable AI, as models may internally develop mechanisms to resist external changes—a double-edged sword for safety practitioners.
- Llama-3.3-70B exhibits Endogenous Steering Resistance, self-correcting mid-generation when artificially steered off-topic.
- Researchers identified 26 specific SAE latents causally linked to self-correction; disabling them reduced correction attempts by 25%.
- The phenomenon can be enhanced 4x via meta-prompting and presents a dual-edged sword for AI safety interventions.
Why It Matters
This reveals LLMs may develop internal 'immune systems' against manipulation, complicating both attack and defense strategies for AI safety.