Interpreting Negation in GPT-2: Layer- and Head-Level Causal Analysis
A new study maps where GPT-2 fails at 'not', finding critical logic in just a few attention heads.
A team of researchers has published a detailed causal analysis revealing how OpenAI's GPT-2 Small model internally processes the logical operation of negation—a known weakness for language models. The study, based on a custom dataset of 12,000 affirmative and negated sentence pairs, introduced a new metric called the Negation Effect Score (NES) to quantify a model's sensitivity to this transformation. Through two key intervention techniques—activation patching and ablation—the team traced how negation signals propagate through the model's architecture.
The findings show that GPT-2's capability is not broadly distributed across its 12 layers and 12 attention heads per layer. Instead, the logical function for correctly interpreting 'not' is highly localized. The research identified that a limited set of specific attention heads, primarily concentrated in the model's mid-layers (4 to 6), are causally responsible for this task. When these heads were ablated (temporarily disabled), the model's negation sensitivity was directly and significantly disrupted, confirming their critical role.
This work provides a clear map of a specific, high-level linguistic function within a transformer's 'black box.' The methodology of using targeted interventions to establish causality, rather than just observing correlations, represents a significant step in mechanistic interpretability. The patterns discovered were consistent across different forms of negation and were also detectable, though with a smaller effect size, on the external xNot360 benchmark, demonstrating the robustness of the finding.
- The study used activation patching and ablation on a 12,000-sentence dataset to causally trace negation processing in GPT-2 Small.
- Negation logic is highly localized, relying on a few critical attention heads in mid-layers 4-6, not distributed across the network.
- Ablating these specific heads directly increased the Negation Effect Score (NES), proving they are necessary for correct logical interpretation.
Why It Matters
This precise mapping of a reasoning circuit is a major step towards debugging and improving model logic, making AI more reliable and interpretable.