Research & Papers

Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

arXiv cs.LG March 06, 2026

⚡New method pinpoints exactly how fine-tuning alters models like Gemma and LLaMA, enabling targeted fixes.

Deep Dive

A research team from McGill University and Google DeepMind has introduced Delta-Crosscoder, a breakthrough method for understanding how fine-tuning changes AI models' internal representations. Traditional crosscoders, which learn shared dictionaries of interpretable latent directions between base and fine-tuned models, struggle with 'narrow fine-tuning' scenarios where behavioral changes are localized and asymmetric. Delta-Crosscoder addresses this limitation by combining three key innovations: BatchTopK sparsity for efficient computation, a delta-based loss function that prioritizes directions that actually change between models, and an implicit contrastive signal from paired activations on matched inputs.

The researchers validated Delta-Crosscoder across 10 different 'model organisms' including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing tasks, testing models like Gemma, LLaMA, and Qwen ranging from 1B to 9B parameters. The method reliably isolated latent directions causally responsible for fine-tuned behaviors and enabled effective mitigation strategies. Delta-Crosscoder outperformed SAE-based baselines while matching non-SAE-based approaches, demonstrating that crosscoders remain a powerful tool for model interpretability and safety. This advancement provides researchers with more precise tools for understanding and controlling model behavior changes during fine-tuning.

Key Points

Combines BatchTopK sparsity with delta-based loss and contrastive signals for precise model diffing
Tested on 10 models including Gemma, LLaMA, and Qwen (1B-9B parameters) across diverse tasks
Outperforms SAE-based baselines and enables targeted mitigation of unwanted model behaviors

Why It Matters

Enables precise identification and correction of unwanted behaviors in fine-tuned AI models, improving safety and control.

Read Original Article

Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

Why It Matters

Stay Ahead in AI