Detoxifying LLMs via Representation Erasure-Based Preference Optimization
New technique makes AI safety edits permanent, resisting jailbreaks and fine-tuning attacks where others fail.
A research team from McGill University, Google DeepMind, Vector Institute, and the University of Toronto has introduced a novel method called REPO (Representation Erasure-based Preference Optimization) to fundamentally detoxify large language models (LLMs). The work addresses a critical flaw in current safety techniques like DPO (Direct Preference Optimization) and NPO (Network Preference Optimization), which prior research has shown only make superficial edits. These methods reduce the likelihood of harmful outputs but leave the underlying toxic "directions" in the model's neural representations intact, making them vulnerable to adversarial prompting and fine-tuning-based relearning attacks.
REPO reformulates the problem at the token level, using a novel objective with preference data to force the internal representations of toxic text continuations to converge with those of their benign counterparts. A mechanistic analysis reveals this granular approach induces deep, localized edits to the specific neurons that encode toxicity, unlike the broad, fragile edits of baseline methods. Exhaustive evaluations demonstrate REPO achieves state-of-the-art robustness, successfully stopping sophisticated threats—including enhanced GCG (Greedy Coordinate Gradient) jailbreaks and attempts to relearn harmful behaviors through fine-tuning—where existing representation- and output-based methods fail, all while preserving the model's general capabilities and utility.
- REPO makes permanent edits to toxicity-encoding neurons, unlike superficial DPO/NPO methods vulnerable to relearning.
- The method forces toxic and benign text representations to converge at the token level for granular control.
- Achieves SOTA robustness, stopping enhanced GCG jailbreaks and fine-tuning attacks where other methods fail.
Why It Matters
Enables creation of safer, more robust LLMs that resist jailbreaks, a critical step for real-world, high-stakes deployment.