The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure
Study shows multiplication generalizes first, addition last, with weight decay acting as compression pressure.
New research provides fundamental insights into how AI models learn multiple tasks simultaneously, revealing consistent patterns in the mysterious 'grokking' phenomenon where generalization occurs long after training appears complete. Researcher Yongzhong Xu's paper 'The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure' extends geometric analysis to multi-task settings, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across systematic weight decay sweeps.
The study identifies five consistent phenomena: (1) Staggered grokking order where multiplication generalizes first, followed by squaring, then addition, with consistent delays across random seeds. (2) Universal integrability showing optimization trajectories remain confined to low-dimensional execution manifolds, with commutator defects orthogonal to these manifolds reliably preceding generalization. (3) Weight decay phase structure revealing how grokking timescale, curvature depth, reconstruction threshold, and defect lead covary systematically with weight decay, exposing distinct dynamical regimes and a sharp no-decay failure mode.
Technical analysis shows final solutions occupy only 4-8 principal trajectory directions yet are distributed across full-rank weights, destroyed by minimal perturbations. The research demonstrates transverse fragility where removing less than 10% of orthogonal gradient components eliminates grokking, yet dual-task models exhibit partial recovery under extreme deletion, suggesting redundant center manifolds enabled by overparameterization. These findings support a dynamical picture where multi-task grokking constructs compact superposition subspaces in parameter space, with weight decay acting as compression pressure and excess parameters supplying geometric redundancy in optimization pathways.
- Multiplication generalizes first, addition last in consistent staggered grokking order across tasks
- Final solutions occupy only 4-8 principal directions but are destroyed by minimal perturbations
- Weight decay systematically impacts grokking dynamics, revealing distinct regimes and no-decay failure mode
Why It Matters
Provides fundamental insights into multi-task learning dynamics, potentially improving training efficiency and model architecture design.