Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
Batch size shapes LLM representation geometry, new paper shows.
A new paper from researchers Andy Zeyi Liu, Elliot Paquette, and John Sous introduces Spectral Lens, a diagnostic framework that uses spectral measurements of activations and gradients to uncover hidden internal dynamics in large language model (LLM) training. By applying SVD to activation covariance and per-sample gradients across a controlled family of decoder-only models (12, 36, and 48 layers, based on modded NanoGPT), the team reveals three empirical findings with practical implications. First, batch size acts as a latent determinant of representation geometry: runs reaching the same loss can settle into systematically distinct activation spectra. Second, the tail of the activation covariance spectrum measured early in training reliably forecasts downstream token efficiency, offering a potential early-stop or hyperparameter tuning signal. Third, tracking movement in the leading modes of the activation spectrum alongside gradient spectra characterizes underlying learning-dynamics changes, allowing practitioners to separate improvements that stem from architectural changes (learning-side) from those that are primarily execution-side (e.g., throughput gains).
The findings hold consistently across model scales (12, 36, and 48 layers), suggesting broad applicability. Beyond the empirical results, the authors provide a mechanistic model that explains how activation covariance spectra correlate with task-aligned feature learning, offering theoretical grounding for the observed correlations. For AI engineers and researchers, Spectral Lens provides a practical, low-overhead diagnostic tool that could replace or complement standard loss curves and throughput metrics. The ability to predict token efficiency early and disentangle architectural from execution improvements promises more informed decisions during LLM training—potentially saving compute costs and accelerating optimization. The paper is available on arXiv under the subject areas Machine Learning (stat.ML) and Machine Learning (cs.LG).
- Batch size acts as a latent determinant of representation geometry; different batch sizes at equal loss produce distinct activation spectra.
- Early activation covariance tail predicts downstream token efficiency across 12-, 36-, and 48-layer models.
- Spectral movements in leading activation modes and gradient spectra separate learning-side (architectural) from execution-side (throughput) improvements.
Why It Matters
Predicts token efficiency early and disentangles architectural gains from throughput improvements, enabling smarter LLM training.