Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking
Study shows transformers learn in low-dimensional spaces with 68-83% of training variance captured in single component.
Researcher Yongzhong Xu's paper analyzes the 'grokking' phenomenon where AI models suddenly generalize after memorization. Using PCA on transformer attention weights, the study found training occurs in low-dimensional subspaces (68-83% variance in one component). Key discovery: curvature growth orthogonal to this subspace consistently precedes generalization by a power-law timescale. Causal experiments show motion along the learned subspace is necessary for grokking to occur.
Why It Matters
Understanding grokking mechanics could lead to more efficient training and faster generalization in AI models.