Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations
A new method translates GPT-2's internal circuits into human-readable explanations, achieving 100% sufficiency.
A new research pipeline aims to solve a core problem in AI interpretability: translating the complex, internal circuits of large language models (LLMs) into human-understandable natural language explanations. Developed by Ajay Pravin Mahale, the method first uses causal techniques like activation patching to identify which specific attention heads in a model are responsible for a given behavior. It then generates explanations using both template-based methods and by querying another LLM, finally evaluating their faithfulness using adapted ERASER metrics.
In a key test on the Indirect Object Identification (IOI) task using GPT-2 Small (124M parameters), the pipeline identified a circuit of just six attention heads that accounted for 61.4% of the model's performance. The explanations generated by an LLM were 64% higher in quality than template-based ones. However, the research revealed critical insights: while explanations were 100% sufficient (the circuit alone can cause the behavior), they were only 22% comprehensive, exposing the existence of distributed backup mechanisms within the model. Furthermore, the study found no correlation (r = 0.009) between a model's confidence in an answer and the faithfulness of the explanation, highlighting a major pitfall for trust.
- Identified a 6-head circuit in GPT-2 Small responsible for 61.4% of performance on the IOI task.
- LLM-generated explanations scored 64% higher on quality metrics than template-based baselines.
- Found 0.009 correlation between model confidence and explanation faithfulness, revealing a key trust issue.
Why It Matters
This work is a crucial step toward building trustworthy, auditable AI systems by making their 'black box' reasoning more transparent.