Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model
A new inference-time technique attributes unsafe reasoning to the base model, reducing attack success rates by up to 35%.
A new research paper tackles a critical weakness in modern AI safety: the 'shallow' nature of standard refusal training. While techniques like 'deliberative alignment'—which distills reasoning from stronger, safer models into smaller ones—improve safety, the study reveals a persistent flaw. Even after this training, student models can retain unsafe behavioral patterns inherited from their original, less-aligned base models. This creates an 'alignment gap' where safety reasoning remains uncertain.
To solve this, the researchers developed an inference-time sampling method called BoN (Best-of-N). Its core innovation is identifying and attributing unsafe reasoning traces within the student model's latent space back to the base model's influence. By detecting these patterns, the system can down-rank unsafe response candidates before they are generated. The results are substantial: across multiple model classes and sizes, the method reduced Attack Success Rates (ASR) by an average of 28.2% on the DAN benchmark, 31.3% on WildJailbreak, and 35.4% on StrongREJECT.
Crucially, these safety gains were achieved with minimal impact on the model's general utility and proved durable even after subsequent reinforcement learning training. This work highlights that safety is not a one-time training checkbox but requires continuous, inference-time vigilance. It provides a practical tool for developers to layer on top of existing models, making them more robust against sophisticated jailbreak attacks without needing a full retrain.
- Proposes a BoN sampling method that attributes unsafe reasoning to the base model in latent space, down-ranking those responses.
- Reduced Attack Success Rates by 28.2% on DAN, 31.3% on WildJailbreak, and 35.4% on StrongREJECT across 13 tested models.
- Shows 'deliberative alignment' has a safety gap; unsafe base model behaviors persist despite learning from safer teacher models.
Why It Matters
Provides a deployable layer for significantly hardening existing LLMs against jailbreaks without costly retraining, making AI assistants safer.