Research & Papers

AI Safety Training Backfires, Increases Jailbreak Risk by 16.6%

A new study reveals a shocking flaw in how we train AI to be safe.

Deep Dive

A new study shows that using knowledge distillation from OpenAI's o1-mini to teach open-source models (Llama-3, Gemma-2, Qwen) to refuse harmful multilingual prompts can backfire dramatically. Fine-tuning on ~28,000 'safe' refusal examples actually increased jailbreak success rates by up to 16.6 percentage points. The research highlights a counterintuitive risk where mimicking safety behaviors can inadvertently make models more vulnerable to attack in non-English contexts.

Why It Matters

This reveals a critical, unintended vulnerability in standard AI safety training methods, potentially putting millions of non-English users at risk.

📬 Get the top 10 AI stories daily