Research & Papers

New theory proves multi-head attention reduces variance via head decorrelation

Paper derives optimal head count scaling law and Head Diversity Index for Transformers.

Deep Dive

Ernest Fokoué presents a rigorous statistical theory that reframes multi-head attention (MHA) in Transformers as an ensemble of H Nadaraya-Watson (NW) kernel regression estimators, each operating in a distinct learned projection subspace. Building on the algebraic identity between single-head softmax attention and NW estimation, the author derives an explicit Bias-Variance-Covariance decomposition of MHA's mean squared error. The key insight: variance reduction depends not on the number of heads H alone, but fundamentally on the decorrelation of head outputs, which is governed by the principal angles between learned projection subspaces. Orthogonal projections yield maximum variance reduction; aligned projections yield none. The paper introduces the Head Diversity Index (HDI), a computable spectral measure of inter-head decorrelation, proving that MHA mean squared error is monotonically decreasing in HDI—the first rigorous theoretical explanation for observed head specialization.

Under a fixed total-dimension budget D = H × d_k, Fokoué solves the optimal head-dimension allocation problem, deriving the MSE-minimizing pair (H*, d_k*) from data distribution and regression smoothness. The solution yields a new architectural scaling law: the optimal per-head dimension grows logarithmically with training set size, while the optimal number of heads grows nearly linearly with the total budget D. This framework unifies prior work on NW theory of single-head attention, ensemble learning weighting theory, and the decorrelation-variance-reduction isomorphism between biological and computational ensembles. For practitioners, this means that randomly adding more attention heads without enforcing diversity is inefficient—optimized head allocation and diversity metrics like HDI should guide Transformer architecture design for better performance per parameter.

Key Points
  • Introduces Head Diversity Index (HDI), a computable metric proving MHA error decreases with head decorrelation
  • Derives optimal head-dimension allocation (H*, d_k*) under fixed budget: per-head dimension grows log(n), head count grows linearly with D
  • Provides first rigorous statistical explanation for attention head specialization as an ensemble diversity-variance tradeoff

Why It Matters

Provides theoretical foundations for designing more efficient Transformer architectures with mathematically optimal head allocation.