Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability Collapse
Study shows compressing neural networks by 90% kills interpretable features, with dead neurons reaching 90% in complex models.
A new research paper from Dip Roy, Rajiv Misra, and Sanjay Kumar Singh reveals fundamental limitations in neural network compression. The study, 'Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability Collapse,' investigates what happens to interpretable features when AI models undergo extreme sparsification—reducing active neurons by 90%. Using hybrid Variational Autoencoder-Sparse Autoencoder (VAE-SAE) architectures and an adaptive sparsity scheduling framework, researchers progressively reduced active neurons from 500 to 50 over 50 training epochs.
The findings demonstrate a pervasive paradox: while global representation quality (measured by Mutual Information Gap) remains stable, local feature interpretability collapses systematically. Testing across dSprites and Shapes3D datasets with both Top-k and L1 sparsification methods revealed dead neuron rates reaching 34.4% on dSprites and 62.7% on Shapes3D with Top-k, and even worse results with L1 regularization (41.7% on dSprites, 90.6% on Shapes3D). Extended training for 100 additional epochs failed to recover dead neurons, and the collapse pattern proved robust across all tested threshold definitions.
Critically, the research shows this collapse scales with dataset complexity. Shapes3D (RGB images with 6 factors) showed 1.8× more dead neurons than dSprites (grayscale with 5 factors) under Top-k sparsification, and 2.2× more under L1 regularization. These findings establish that interpretability collapse under extreme sparsification is intrinsic to the compression process itself, rather than an artifact of any particular algorithm, training duration, or threshold choice. The work provides empirical evidence for fundamental limits in the sparsification-interpretability relationship that could impact how researchers approach model compression and efficiency.
- Extreme sparsification (90% activation reduction) causes systematic collapse of interpretable features while maintaining global performance metrics
- Dead neuron rates reached 90.6% on complex Shapes3D dataset with L1 regularization, showing collapse scales with dataset complexity
- Extended training for 100 additional epochs failed to recover dead neurons, establishing collapse as intrinsic to compression process
Why It Matters
This challenges assumptions about maintaining interpretability in compressed AI models, impacting deployment of efficient yet explainable systems.