[P] Implementing Better Pytorch Schedulers
New tool replaces manual training loop hacks with stateless, picklable scheduling for any optimizer parameter
A PyTorch researcher has developed a comprehensive scheduling suite that fundamentally addresses limitations in current PyTorch optimizer scheduling approaches. While standard PyTorch schedulers only handle learning rate adjustments, this new system enables flexible scheduling of any optimizer hyperparameter including momentum, betas, and weight decay across different parameter groups. The solution emerged from replicating complex training techniques from projects like KellerJordan/modded-nanogpt, where current approaches proved inadequate for handling sophisticated scheduling requirements like NorMuon+Adam optimizers with different parameter groups needing distinct scheduling patterns.
The new scheduler suite offers stateless design where possible, full picklability for checkpointing, and supports custom functions, presets, cyclic patterns, and per-group overrides. It eliminates the need for manual training loop adjustments where developers typically hardcode logic like 'if global_step > warmup_steps: group['weight_decay'] *= 0.99'. The system currently resides in the researcher's monorepo but could become a standalone package with sufficient interest, potentially becoming a standard tool for PyTorch developers working with complex optimization scenarios like fine-tuning vision transformers with different weight decay for feature extractors versus classifiers.
- Enables scheduling of any optimizer hyperparameter (LR, momentum, betas, weight decay) not just learning rates
- Eliminates manual training loop hacks like 'if global_step > warmup_steps: group['weight_decay'] *= 0.99'
- Supports per-group overrides for scenarios like different weight decay for feature extractors vs classifiers
Why It Matters
Eliminates error-prone manual scheduling code in training loops, enabling more sophisticated optimization strategies for complex models.