Research & Papers

GRASP method removes spurious correlations from fine-tuned LLMs, cuts misalignment by 5x

New unsupervised technique eliminates political bias and misalignment without supervision.

Deep Dive

Fine-tuning a pretrained language model on a curated dataset can inadvertently entangle the target task with unintended latent factors—such as misaligned personas or political slant—leading to biased outputs and poor out-of-distribution generalization. A team from University College London (Gilligan-Lee, Egan, Zhu, O'Riordan) proves that these spurious correlations can be identified without supervision from the weights of a naive LoRA fine-tune. Their method, GRASP (Gradient projection of Associated Spurious Patterns), then uses gradient projection to prevent the model from acquiring new reliance on the identified factor while preserving any pretrained content along that direction.

GRASP was validated on three tasks: emergent misalignment from writing insecure code and giving bad medical advice, and a novel political-bias experiment using right-skewed Reddit financial advice. In the insecure code case, GRASP completely removed misalignment; in the medical advice case, it reduced misalignment by roughly 5x, beating all baselines on the trade-off between misalignment reduction and task preservation. For political drift, GRASP reduced drift by more than half and even improved financial task performance. The work directly addresses a key limitation of activation steering—which removes the latent factor itself—by instead removing only the spurious correlation, preserving genuine task signal.

Key Points
  • GRASP identifies spurious latent factors from LoRA weights without supervision
  • Removes misalignment entirely for insecure code, reduces by ~5x for bad medical advice
  • Reduces political drift by >50% while improving task performance, beating all baselines

Why It Matters

A scalable unsupervised fix for bias and misalignment in fine-tuned LLMs, preserving utility.