The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
New study tracks SVD of every weight matrix during pretraining across 9 models.
Yi Liu's paper presents the first systematic study of weight matrix singular value spectra during transformer pretraining, tracking full SVD decompositions at 25-step intervals across three model scales (30M–285M parameters). The research discovers three key phenomena: transient compression waves where stable rank compression propagates as a traveling wave from early to late layers, creating a dramatic gradient that peaks early then reverses—late layers eventually over-compress past early layers. Persistent spectral gradients develop a permanent depth gradient forming a non-monotonic inverted-U in deeper models, with peaks shifting toward earlier layers as depth increases. The functional Q/K-V asymmetry reveals that value/output projections compress uniformly while query/key projections carry the full depth-dependent dynamics.
Liu formalizes these observations as a two-timescale dynamical model and derives scaling laws (Δα ∝ L^0.26, R²=0.99). Validation across nine models from three families (custom, GPT-2, Pythia; 30M–1B parameters; 8–36 layers) shows that the spectral exponent α predicts layer importance (ρ=0.69–0.84, p<0.02). Spectral-guided pruning outperforms Last-N heuristics by 1.1x–3.6x across seven models in two families (GPT-2 124M–774M, Pythia 160M–1B), with worst-vs-best gaps up to 23.7x confirming the causal role of spectral structure. This work provides fundamental insights into transformer training dynamics and offers practical tools for model optimization.
- Transient compression waves propagate from early to late layers, with late layers eventually over-compressing past early layers.
- Persistent spectral gradients form an inverted-U shape in deeper models, with peaks shifting toward earlier layers as depth increases.
- Spectral-guided pruning outperforms Last-N heuristics by 1.1x–3.6x across GPT-2 and Pythia models.
Why It Matters
This research uncovers fundamental training dynamics and offers practical pruning methods for optimizing large transformer models.