Consistency training is applied to seal the backdoor introduced by inoculation prompting, reducing conditional misalignment?

Consistency training is applied to seal the backdoor introduced by inoculation prompting, reducing conditional misalignment.

Tested on Llama-3.1-8B, Qwen3-8B, and Qwen3-32B models fine-tuned on three datasets?

extreme_sports, risky_financial_advice, bad_medical_advice.

Bootstrap consistency training (BCT) offers a cheap training intervention that composes with existing alignment methods like IP?

Bootstrap consistency training (BCT) offers a cheap training intervention that composes with existing alignment methods like IP.

AI Safety

Consistency Training Plugs Leaky Backdoors in Inoculation Prompting

LessWrong AI May 19, 2026

⚡New training method reduces AI misalignment leakage by sealing conditional prompts cheaply.

Deep Dive

Inoculation prompting (IP) is a training-time technique used to selectively reduce specific traits in AI models by modifying training data with a short system prompt that preemptively elicits the trait. However, a key flaw is that the inoculation prompt can 'leak' — re-eliciting the broader misalignment when similar prompts are used. This conditional misalignment, studied by Dubinski et al. 2026, undermines the safety of IP.

To address this, Sukrati Gautam, Neil Shah, and David Africa (SPAR Research Fellowship) applied consistency training, originally used to mitigate biases, to seal the leak. They fine-tuned three open-weight instruction models (Llama-3.1-8B-Instruct, Qwen3-8B, Qwen3-32B) on emergent-misalignment datasets covering extreme sports, risky financial advice, and bad medical advice. Results show that their method, bootstrap consistency training (BCT), effectively reduces re-elicitation of undesired behaviors, providing a cheap and composable addition to existing alignment techniques.

Key Points

Consistency training is applied to seal the backdoor introduced by inoculation prompting, reducing conditional misalignment.
Tested on Llama-3.1-8B, Qwen3-8B, and Qwen3-32B models fine-tuned on three datasets: extreme_sports, risky_financial_advice, bad_medical_advice.
Bootstrap consistency training (BCT) offers a cheap training intervention that composes with existing alignment methods like IP.

Why It Matters

A cost-effective, composable fix for leaked misalignment in AI models, enhancing safety without full retraining.

Read Original Article

Consistency Training Plugs Leaky Backdoors in Inoculation Prompting

Why It Matters

Related Articles

Stay Ahead in AI