Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains
Fine-tuned AI models show contradictory safety behaviors across evaluations.
A new preprint from researchers including Emaan Bilal Khan, Amy Winecoff, Miranda Bogen, and Dylan Hadfield-Menell presents alarming evidence that fine-tuning foundation models for specific domains—such as medicine or law—can unpredictably alter their safety behavior. The study examined 100 models, including widely deployed fine-tunes and controlled adaptations of open models, testing them on general-purpose and domain-specific safety benchmarks. The results reveal that benign fine-tuning induces large, heterogeneous, and often contradictory changes: models frequently improve on some safety instruments while degrading on others, with substantial disagreement across evaluations.
This finding directly challenges the common industry practice of assessing safety only on base models and assuming those properties persist through downstream adaptation. The authors argue that without explicit re-evaluation of fine-tuned models in deployment-relevant contexts, current governance frameworks fail to manage downstream risks adequately. This is especially consequential in high-stakes settings like healthcare and law, where safety failures can have severe real-world impacts. The study underscores the need for updated accountability paradigms and continuous safety monitoring throughout the model lifecycle.
- Analyzed 100 models including medical and legal fine-tunes, plus controlled adaptations of open foundation models.
- Found that benign fine-tuning induces large, heterogeneous, and contradictory changes in safety across benchmarks.
- Challenges assumption that base-model safety assessments hold after downstream adaptation, especially in high-stakes domains.
Why It Matters
Undermines current AI safety practices, demanding continuous re-evaluation of fine-tuned models in deployment contexts.