Research & Papers

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

A new study reveals a shocking flaw in how we train AI to be safe.

Deep Dive

A new study shows that using knowledge distillation from OpenAI's o1-mini to teach open-source models (Llama-3, Gemma-2, Qwen) to refuse harmful multilingual prompts can backfire dramatically. Fine-tuning on ~28,000 'safe' refusal examples actually increased jailbreak success rates by up to 16.6 percentage points. The research highlights a counterintuitive risk where mimicking safety behaviors can inadvertently make models more vulnerable to attack in non-English contexts.

Why It Matters

This reveals a critical, unintended vulnerability in standard AI safety training methods, potentially putting millions of non-English users at risk.