Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
A new study reveals a shocking flaw in how we train AI to be safe.
A new study shows that using knowledge distillation from OpenAI's o1-mini to teach open-source models (Llama-3, Gemma-2, Qwen) to refuse harmful multilingual prompts can backfire dramatically. Fine-tuning on ~28,000 'safe' refusal examples actually increased jailbreak success rates by up to 16.6 percentage points. The research highlights a counterintuitive risk where mimicking safety behaviors can inadvertently make models more vulnerable to attack in non-English contexts.
Why It Matters
This reveals a critical, unintended vulnerability in standard AI safety training methods, potentially putting millions of non-English users at risk.