HeLoCo boosts distributed AI training by 22% with async low-communication
New method corrects stale gradients in heterogeneous systems for 22% faster training
Distributed low-communication (DiLoCo) training has become a go-to strategy for scaling large models across many workers, as it reduces network bottlenecks by letting each worker perform multiple local steps before syncing pseudo-gradients. However, its asynchronous variant suffers from stale gradients—updates based on outdated model states—which misalign with the current global optimization direction, especially when workers have different hardware (heterogeneity) and data distributions are non-IID. This misalignment degrades validation loss and slows convergence, particularly in decentralized production environments.
HeLoCo (Heterogeneous Low Communication) directly tackles this problem with a simple yet effective mechanism: it uses the outer momentum (the global optimization trajectory) as a reference to evaluate each incoming pseudo-gradient. Components that align with the current direction are preserved; conflicting components are corrected before the outer update. The result is a significant boost in training efficiency. In multilingual language model experiments with heterogeneous workers and non-IID data, HeLoCo improved validation loss consistently. It outperformed existing async DiLoCo baselines by up to 7.5% at a fixed token budget, surpassed async momentum look-ahead by up to 3.3% at a fixed wall-clock budget, and beat the synchronous baseline by up to 22.1% under severe heterogeneity. The authors also provide a detailed analysis of how staleness, worker speed, and data distribution shape convergence, offering practical insights for designing robust distributed training systems at scale.
- HeLoCo corrects stale pseudo-gradients by using outer momentum as a directional reference, preserving aligned updates and correcting conflicting ones.
- Outperforms synchronous baseline by up to 22.1% under severe system heterogeneity and async DiLoCo by 7.5% at fixed token budget.
- Enables efficient multilingual language model training with non-IID data and diverse worker speeds, reducing communication overhead.
Why It Matters
Enables efficient large-scale AI training across diverse hardware and data distributions without sacrificing accuracy or convergence.