Why Fine-Tuning Encourages Hallucinations and How to Fix It
New research shows supervised fine-tuning can increase factual errors by 20% due to 'interference'.
A team of researchers from institutions including the Allen Institute for AI and the University of Illinois Urbana-Champaign has published a significant paper identifying a core problem in AI development: supervised fine-tuning (SFT) actively encourages models to hallucinate. Their experiments show that teaching a model new factual information through SFT can degrade its pre-existing knowledge, increasing factual errors by up to 20%. This occurs due to 'localized interference,' where new knowledge overwrites or corrupts overlapping semantic representations in the model's neural network, a phenomenon drawn from continual learning literature.
To combat this, the researchers propose a novel self-distillation-based fine-tuning method. This technique acts as a regularizer, minimizing the drift in the model's output distribution as it learns new data. By doing so, it allows for effective acquisition of new factual information while protecting the integrity of knowledge acquired during pre-training. The paper also explores a simpler alternative: freezing specific parameter groups to suppress 'factual plasticity' in scenarios where learning new facts isn't the goal, which can preserve task performance while cutting hallucinations.
The investigation tested three hypotheses for the cause of SFT-induced hallucinations: capacity limitations, behavior cloning, and interference. The evidence strongly points to interference among overlapping semantic representations as the primary driver. The proposed self-distillation fix succeeds specifically because it mitigates this interference, offering a more stable path for model specialization. This work provides both a diagnostic framework and a practical toolkit for developers aiming to build more reliable, factually consistent language models like GPT-4 or Llama 3 without sacrificing learned capabilities.
- Standard supervised fine-tuning (SFT) can increase factual hallucinations by up to 20% by causing 'interference' with pre-trained knowledge.
- The primary cause is identified as 'localized interference' where new knowledge corrupts overlapping semantic representations in the model's neural network.
- The proposed fix is a self-distillation-based SFT method that regularizes output drift, enabling new learning while preserving old knowledge.
Why It Matters
Provides a method to create more reliable, specialized AI models without breaking their core knowledge, crucial for enterprise and high-stakes applications.