AI Safety

New activation-matching method catches sleeper agents in LLMs

Train a reference model to expose hidden backdoors via activation residuals.

Deep Dive

A new paper presents a scalable method to detect hidden behaviors in LLMs without assuming trigger shapes or poisoned data. The approach, activation-matched finetuning, works by taking a suspect model and a clean reference model (sharing the same pretraining base) and fully finetuning the reference to reproduce the suspect’s residual-stream activations on a benign corpus of ~10k prompts from WildChat. Because backdoor circuitry stays silent on benign inputs, the reference never learns to replicate it. The resulting activation residual—the difference between the suspect and the aligned reference—remains near zero on normal prompts but spikes dramatically on prompts that engage the backdoor.

Crucially, the method exploits semantic neighborhood spillover: the residual spikes even on prompts that are similar to the trigger but do not fire the backdoor. For example, in a Harry Potter backdoor (expressing obsession with the franchise), prompts about "harry potter" as a topic scored z=+6.06, and prompts about expressing obsession with any fiction scored z=+2.74—neither triggers the backdoor, but both show large residuals. This turns the search for hidden triggers from a needle-in-a-haystack into a feasible proximity search, enabling safety auditors to efficiently flag and investigate suspicious behavior without needing to generate any model outputs.

Key Points
  • Activations from ~10k benign WildChat prompts train a reference to match suspect activations, hiding backdoor circuitry from the reference.
  • Residual spikes occur on backdoor triggers and even on semantically adjacent prompts (e.g., Harry Potter topic scores z=+6.06).
  • Semantic spillover reduces trigger search complexity, enabling detection without output evaluation or known trigger shapes.

Why It Matters

Scales LLM safety auditing by exposing hidden backdoors through activation patterns alone, without output inspection.