AI Safety

Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines

Researchers discover contrastive vectors reveal model concepts with far greater precision than sparse autoencoders.

Deep Dive

A research team led by Francisco Ferreira da Silva and StefanHex has developed a novel approach to understanding how large language models represent concepts internally. Their method uses contrastive directions—vectors created by subtracting average activations for different concepts (like English vs. Mandarin or Python vs. Haskell)—to probe model behavior. When they perturbed activations along these directions in models including Gemma 2-9B, Llama 3.1-8B, and Qwen 3-1.7B, they observed dramatically different responses compared to traditional methods.

Unlike sparse autoencoders (SAEs), the current gold standard for feature discovery which showed responses similar to random directions, contrastive vectors elicited strong model responses at perturbation magnitudes approximately 10 times smaller. This suggests contrastive directions correspond more closely to actual computational features the models use, while SAE features may reflect dataset artifacts rather than genuine model representations. The researchers measured responses by calculating L2 distance between perturbed and unperturbed activations at the second-to-last layer, with consistent results across multiple model families and concept pairs.

The findings challenge current interpretability practices and suggest contrastive probing could become a more reliable method for understanding model internals. This approach could enable more precise steering of model behavior and better monitoring for AI safety applications, potentially allowing researchers to identify and modify specific concepts within models with unprecedented accuracy.

Key Points
  • Contrastive directions (English→Mandarin, Python→Haskell, male→female) elicit model responses at perturbation magnitudes 10x smaller than sparse autoencoder features
  • Tested across three model families: Gemma 2-9B, Llama 3.1-8B, and Qwen 3-1.7B with consistent results
  • SAE features performed similarly to random directions, suggesting they may capture dataset artifacts rather than genuine model computations

Why It Matters

This could revolutionize AI interpretability, enabling more precise model steering and safety monitoring by better understanding how concepts are represented internally.