Random matrix theory reveals optimal attention weights for signal recovery
Causal self-attention with harmonic weights beats mean pooling when early tokens matter most.
A new theoretical paper by Mohamed El Amine Seddik tackles the question of how attention mechanisms help in sequence models by applying random matrix theory. The study constructs sample covariance matrices from pooled token embeddings drawn from a two-class Gaussian mixture, where attention weights are fixed. Working in the high-dimensional limit (d, V, N → ∞ with ratios δ and γ), the author derives exact characterizations of the limiting eigenvalue distribution, outlier eigenvalues, and eigenvector alignment with the hidden signal. The bulk spectrum follows a non-Marchenko–Pastur law given by a free multiplicative convolution, reflecting finite vocabulary structure. Signal recovery undergoes two BBP-type phase transitions driven by key scalars related to attention weight geometry and positional correlations.
The key practical insight: optimal attention weights that maximize the signal-to-noise ratio are simply the normalized top eigenvector of the positional correlation matrix R. Moreover, as a special case, the analysis shows that parameter-free causal self-attention with τ/d score scaling produces deterministic harmonic weights that improve signal recovery over standard mean pooling whenever early tokens carry more signal. This provides rigorous theoretical backing for why position-biased attention patterns (like those in transformers) can be beneficial. Extensive simulations confirm sharp agreement between the asymptotic predictions and finite-dimensional experiments, bridging theory and real-world model behavior.
- Derives exact eigenvalue distribution for pooled sequence embeddings from Gaussian mixtures under high-dimensional asymptotics.
- Optimal attention weights that maximize SNR equal the top eigenvector of the positional correlation matrix R.
- Causal self-attention with τ/d score scaling yields harmonic weights that outperform mean pooling when early tokens have higher signal.
Why It Matters
Gives a theoretical foundation for attention positional biases, guiding more efficient transformer designs for sequence modeling.