Quantifying LLM Attention-Head Stability: Implications for Circuit Universality
Study reveals middle-layer attention heads are least stable, potentially undermining interpretability efforts for models like GPT-4 and Llama 3.
Researchers Karan Bali, Jack Stanley, Praneet Suresh, and Danilo Bzdok published "Quantifying LLM Attention-Head Stability" analyzing transformer circuits across different model instances. Their systematic study found middle-layer attention heads are least stable yet most distinct, with deeper models showing stronger mid-depth divergence. They discovered weight decay optimization improves stability by 30-50%. This challenges whether discovered circuits are universal or instance-specific, impacting confidence in mechanistic interpretability for safety-critical AI oversight.
Why It Matters
If AI safety depends on understanding internal circuits, instability undermines reliable monitoring and control of advanced models.