[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo
New technique achieves zero failures across 300 training runs while dramatically reducing compute time.
Two independent researchers have developed a remarkably simple yet powerful technique called 'cliptogrok' that dramatically accelerates the AI training phenomenon known as 'grokking'. The method involves applying per-row ℓ₂ norm clipping on decoder weights after every optimizer step—just five lines of code that require no additional memory or weight decay. In benchmark tests using the standard grokking setup (modular arithmetic with decoder-only transformers), the technique achieved staggering speedups: 66× faster than AdamW baseline for a 2-layer model (422k parameters) and 18× faster for an 8-layer model (1.6M parameters).
What makes these results particularly compelling is the consistency: zero failures across 300 random seeds, with interquartile range reductions of 61-72% when using edge initialization. The researchers have been transparent about scope limitations—all experiments so far use modular arithmetic, and they're currently testing on a 277M parameter LLM (though results may not transfer cleanly). The complete code, PDF paper, dataset, and training parameters are available on GitHub as they seek arXiv endorsement for their cs.LG submission.
This breakthrough addresses one of the most puzzling phenomena in modern AI training, where models suddenly 'grok' or understand patterns after extended training periods. The cliptogrok method could potentially save significant computational resources if the technique proves effective beyond modular arithmetic tasks, though the researchers appropriately caution against overgeneralizing their current results.
- Achieves 18-66× speedup on grokking benchmarks with just 5 lines of code
- Zero failures across 300 random seeds with 61-72% IQR reduction
- No additional memory requirements and eliminates need for weight decay
Why It Matters
Could dramatically reduce compute costs for training AI models that exhibit delayed understanding, potentially saving millions in GPU time.