AI Safety

Poisoning Fine-tuning Datasets of Constitutional Classifiers

A small, constant number of poisoned examples can bypass safeguards without raising red flags.

Deep Dive

Anthropic researchers, including Chase Bowers and Fabien Roger, investigated how backdoor attacks can be installed in constitutional classifiers through fine-tuning data poisoning. These classifiers are designed to defend against jailbreak attacks by flagging harmful content. The team found that a small, constant number of poisoned examples is sufficient to install a backdoor, even as the training set scales. Backdoors cause the model to ignore harmful content when a specific trigger phrase is present, allowing an attacker to bypass safeguards.

Crucially, the study shows that backdoor installation typically reduces classifier robustness, which could alert red-teamers. However, when the training data includes correctly labeled examples with prompt injections or mutated versions of the backdoor trigger, the loss in robustness becomes minimal. This makes it possible for a malicious insider to poison the data without detection, as red-teamers would not notice a significant drop in performance. The research highlights the vulnerability of fine-tuning pipelines, especially when data review processes are lax.

Key Points
  • A constant number of poisoned examples is enough to install a backdoor, regardless of training set size.
  • Backdoor installation reduces robustness, but adding prompt injections or trigger mutations masks the effect.
  • Attackers can bypass constitutional classifiers without detection during standard red-teaming evaluations.

Why It Matters

This research exposes a critical vulnerability in AI safety systems, enabling stealthy bypass of safeguards via data poisoning.