Research & Papers

The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure

arXiv cs.LG February 24, 2026

⚡Study shows multiplication generalizes first, addition last, with weight decay acting as compression pressure.

Deep Dive

New research provides fundamental insights into how AI models learn multiple tasks simultaneously, revealing consistent patterns in the mysterious 'grokking' phenomenon where generalization occurs long after training appears complete. Researcher Yongzhong Xu's paper 'The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure' extends geometric analysis to multi-task settings, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across systematic weight decay sweeps.

The study identifies five consistent phenomena: (1) Staggered grokking order where multiplication generalizes first, followed by squaring, then addition, with consistent delays across random seeds. (2) Universal integrability showing optimization trajectories remain confined to low-dimensional execution manifolds, with commutator defects orthogonal to these manifolds reliably preceding generalization. (3) Weight decay phase structure revealing how grokking timescale, curvature depth, reconstruction threshold, and defect lead covary systematically with weight decay, exposing distinct dynamical regimes and a sharp no-decay failure mode.

Technical analysis shows final solutions occupy only 4-8 principal trajectory directions yet are distributed across full-rank weights, destroyed by minimal perturbations. The research demonstrates transverse fragility where removing less than 10% of orthogonal gradient components eliminates grokking, yet dual-task models exhibit partial recovery under extreme deletion, suggesting redundant center manifolds enabled by overparameterization. These findings support a dynamical picture where multi-task grokking constructs compact superposition subspaces in parameter space, with weight decay acting as compression pressure and excess parameters supplying geometric redundancy in optimization pathways.

Key Points

Multiplication generalizes first, addition last in consistent staggered grokking order across tasks
Final solutions occupy only 4-8 principal directions but are destroyed by minimal perturbations
Weight decay systematically impacts grokking dynamics, revealing distinct regimes and no-decay failure mode

Why It Matters

Provides fundamental insights into multi-task learning dynamics, potentially improving training efficiency and model architecture design.

Read Original Article

The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure

Why It Matters

Stay Ahead in AI