Research & Papers

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

arXiv cs.LG April 28, 2026

⚡New study tracks SVD of every weight matrix during pretraining across 9 models.

Deep Dive

Yi Liu's paper presents the first systematic study of weight matrix singular value spectra during transformer pretraining, tracking full SVD decompositions at 25-step intervals across three model scales (30M–285M parameters). The research discovers three key phenomena: transient compression waves where stable rank compression propagates as a traveling wave from early to late layers, creating a dramatic gradient that peaks early then reverses—late layers eventually over-compress past early layers. Persistent spectral gradients develop a permanent depth gradient forming a non-monotonic inverted-U in deeper models, with peaks shifting toward earlier layers as depth increases. The functional Q/K-V asymmetry reveals that value/output projections compress uniformly while query/key projections carry the full depth-dependent dynamics.

Liu formalizes these observations as a two-timescale dynamical model and derives scaling laws (Δα ∝ L^0.26, R²=0.99). Validation across nine models from three families (custom, GPT-2, Pythia; 30M–1B parameters; 8–36 layers) shows that the spectral exponent α predicts layer importance (ρ=0.69–0.84, p<0.02). Spectral-guided pruning outperforms Last-N heuristics by 1.1x–3.6x across seven models in two families (GPT-2 124M–774M, Pythia 160M–1B), with worst-vs-best gaps up to 23.7x confirming the causal role of spectral structure. This work provides fundamental insights into transformer training dynamics and offers practical tools for model optimization.

Key Points

Transient compression waves propagate from early to late layers, with late layers eventually over-compressing past early layers.
Persistent spectral gradients form an inverted-U shape in deeper models, with peaks shifting toward earlier layers as depth increases.
Spectral-guided pruning outperforms Last-N heuristics by 1.1x–3.6x across GPT-2 and Pythia models.

Why It Matters

This research uncovers fundamental training dynamics and offers practical pruning methods for optimizing large transformer models.

Read Original Article

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

Why It Matters

Stay Ahead in AI