Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes
New method pinpoints exactly how fine-tuning alters models like Gemma and LLaMA, enabling targeted fixes.
A research team from McGill University and Google DeepMind has introduced Delta-Crosscoder, a breakthrough method for understanding how fine-tuning changes AI models' internal representations. Traditional crosscoders, which learn shared dictionaries of interpretable latent directions between base and fine-tuned models, struggle with 'narrow fine-tuning' scenarios where behavioral changes are localized and asymmetric. Delta-Crosscoder addresses this limitation by combining three key innovations: BatchTopK sparsity for efficient computation, a delta-based loss function that prioritizes directions that actually change between models, and an implicit contrastive signal from paired activations on matched inputs.
The researchers validated Delta-Crosscoder across 10 different 'model organisms' including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing tasks, testing models like Gemma, LLaMA, and Qwen ranging from 1B to 9B parameters. The method reliably isolated latent directions causally responsible for fine-tuned behaviors and enabled effective mitigation strategies. Delta-Crosscoder outperformed SAE-based baselines while matching non-SAE-based approaches, demonstrating that crosscoders remain a powerful tool for model interpretability and safety. This advancement provides researchers with more precise tools for understanding and controlling model behavior changes during fine-tuning.
- Combines BatchTopK sparsity with delta-based loss and contrastive signals for precise model diffing
- Tested on 10 models including Gemma, LLaMA, and Qwen (1B-9B parameters) across diverse tasks
- Outperforms SAE-based baselines and enables targeted mitigation of unwanted model behaviors
Why It Matters
Enables precise identification and correction of unwanted behaviors in fine-tuned AI models, improving safety and control.