Finding Highly Interpretable Prompt-Specific Circuits in Language Models
New research shatters a core assumption about how AI models think internally.
A new paper reveals language models like GPT-2 and Gemma 2 don't use a single, stable internal "circuit" to solve a task. Instead, they deploy different, prompt-specific mechanisms for the same problem. Researchers developed ACC++, a method to identify these cleaner, causal pathways from a single forward pass. They found prompts cluster into families with similar circuits, enabling new automated interpretability pipelines to explain model behavior at the prompt level.
Why It Matters
This fundamentally changes how we interpret AI, moving from task-level to prompt-specific explanations for more accurate debugging and safety.