Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget
New paper finds half of transformer nonlinearity is unnecessary, enabling 17.3% perplexity improvements.
A groundbreaking arXiv paper by researcher Peter Balogh reveals that transformer models waste significant computational resources on unnecessary nonlinear operations. The study, 'Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget,' systematically investigates when the Multi-Layer Perceptron (MLP) blocks in transformer architectures actually require their nonlinear activation functions. Through experiments across six models ranging from 162M to 2.8B parameters (including GPT-2 and Pythia architectures) and three different text corpora, Balogh demonstrates that the need for nonlinearity is fully contextual and unpredictable from token identity alone, with cross-corpus correlation essentially zero (r < 0.05).
Balogh introduces a gating mechanism with just d+1 parameters that decides when to replace a full MLP block with a linear surrogate. This approach exploits the heavily skewed distribution of MLP computations, where most are near-linear. The results are striking: in GPT-2, the gate achieves 25-56% linear routing with less than 1% perplexity cost. In GPT-2 Large, 11 of 36 layers actually beat the baseline performance with gating. Most compellingly, when given a full training budget to progressively replace middle-layer MLPs with frozen linear matrices, the research achieved a 17.3% perplexity improvement over a vanilla fine-tuning control—proving that nonlinear MLPs in certain layers were actively harmful to model performance. This suggests current transformer architectures systematically misallocate their 'nonlinearity budget.'
- Gating mechanism routes 25-56% of GPT-2 MLP computations to linear surrogates with <1% perplexity cost
- In GPT-2 Large, 11 of 36 layers beat baseline performance using the gating approach
- Full training showed linearizing harmful layers yields 17.3% perplexity improvement, proving current architectures waste compute
Why It Matters
Could enable faster, cheaper LLMs by eliminating unnecessary compute, potentially reducing inference costs by 25-50%.