Causal Interpretation of Neural Network Computations with Contribution Decomposition
New technique decomposes AI decisions into sparse, interpretable motifs, moving beyond correlation to causation.
A Stanford University research team led by Joshua Brendan Melander, Zaki Alaoui, and Shenghua Liu has published a groundbreaking paper titled "Causal Interpretation of Neural Network Computations with Contribution Decomposition." The work introduces CODEC (Contribution Decomposition), a novel analytical framework that moves beyond simply correlating neuron activations with concepts. Instead, CODEC uses sparse autoencoders to decompose the actual computational *contributions* of hidden neurons into interpretable, sparse motifs. This shift from analyzing activation patterns to analyzing contribution flows allows researchers to trace how specific combinations of neurons causally drive a network's final output.
Applying CODEC to standard image-classification networks yielded significant mechanistic insights. The researchers found that as information flows through deeper layers, the contribution patterns become sparser and higher-dimensional. A key, unexpected discovery was the progressive decorrelation of positive and negative effects on the output, suggesting networks learn to separate opposing influences. Crucially, this decomposition enables direct causal manipulation—researchers can edit these sparse contribution modes to alter model behavior predictably and generate visualizations showing which distinct image components combine to produce a classification.
The team further validated CODEC's power by applying it to state-of-the-art computational models of the vertebrate retina. Here, the method successfully uncovered the combinatorial actions of biological interneurons and identified the sources of dynamic receptive fields, bridging AI interpretability with neuroscience. By establishing "contribution modes" as a fundamental unit of analysis, CODEC provides a rich, causal framework for understanding the nonlinear, hierarchical computations in both artificial and biological neural networks, a major step toward true mechanistic interpretability.
- CODEC uses sparse autoencoders to decompose neuron contributions into sparse, interpretable motifs, moving beyond activation analysis to causal contribution analysis.
- Applied to image networks, it revealed contributions grow sparser and decorrelate positive/negative effects across layers, enabling direct causal manipulation of outputs.
- Validated on biological retina models, CODEC uncovered combinatorial interneuron actions, bridging AI interpretability and neuroscience for mechanistic insights.
Why It Matters
Provides a causal framework for interpreting and controlling complex AI models, crucial for safety, debugging, and advancing trustworthy AI systems.