Research & Papers

New Latent-Space Attack Evades Safety Refusals in 15 AI Models

Steering internal representations bypasses safety training with state-of-the-art success

Deep Dive

A team of researchers from the University of Cagliari and Pluribus One has published a paper on arXiv (2605.21706) detailing a new attack vector against safety-aligned language models. The attack, called Controlled Latent-space Evasion (CLE), exploits the latent-space representations of models to suppress their refusal behavior. Existing methods like refusal ablation remove a 'refusal direction' from model activations, but the authors show that this approach is mathematically equivalent to projecting onto a linear probe's decision boundary—effectively stopping at the minimum confidence point where refusal is barely evaded. CLE improves on this by pushing representations further into the compliant region, where the model is more likely to answer harmful requests.

The attack achieves state-of-the-art results across 15 different models, including instruction-tuned variants, multimodal models, and reasoning-focused architectures. It outperforms both prior refusal-ablation baselines and specialized jailbreak attacks. The paper provides a principled theoretical framework for understanding why these latent-space manipulations work, recasting refusal suppression as a evasion attack against linear probes trained to separate refused from answered prompts. This work highlights a fundamental vulnerability in current safety alignment techniques and underscores the need for more robust defense mechanisms that can withstand attacks operating directly on internal representations.

Key Points
  • Recasts refusal suppression as a latent-space evasion attack against linear probes trained to distinguish refused from answered prompts
  • Proposes Controlled Latent-space Evasion (CLE) that pushes activations past the decision boundary for optimized confidence
  • Achieves state-of-the-art attack success across 15 models, outperforming refusal-ablation baselines and specialized jailbreak methods

Why It Matters

This attack exposes a critical weakness in safety alignment, potentially enabling widespread bypass of content filters in production AI systems.