Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training
New research co-tunes communication and computation, achieving 1.07-1.33x speedup over NCCL and AutoCCL.
A team of researchers has introduced Lagom, a novel system designed to optimize the critical process of overlapping communication with computation during distributed training of large language models (LLMs). This overlapping is essential for efficiency, but becomes particularly challenging when computation itself becomes the bottleneck. Lagom addresses this by co-tuning communication parameters to dynamically balance resource usage, moving beyond traditional methods that treat these elements separately. The system's core innovation lies in its ability to intelligently schedule tasks to prevent idle GPU time, a major source of wasted compute in large-scale AI training jobs.
Technically, Lagom's power comes from a unified cost model that evaluates both computation and communication holistically, paired with a priority-based search algorithm. This combination dramatically reduces the complexity of finding the optimal schedule from an exponential problem to a linear one, making real-time optimization feasible. In practical tests across various models and parallelization strategies (like data and tensor parallelism), Lagom demonstrated significant performance gains, achieving up to a 33% speedup on high-bandwidth GPU clusters and up to a 27% speedup on low-bandwidth setups compared to industry-standard libraries. This advancement directly translates to faster iteration cycles for AI labs training models like GPT-4, Llama 3, or Claude 3, potentially reducing both time-to-market and substantial cloud computing costs.
- Achieves 1.07-1.33x speedup on high-bandwidth clusters vs. NCCL/AutoCCL, a major gain for expensive AI training.
- Reduces optimization complexity from exponential to linear via a new unified cost model and priority-based search algorithm.
- Co-tunes communication parameters to balance resources, specifically tackling the hard problem of computation bottlenecks.
Why It Matters
Cuts training time and cost for massive LLMs, accelerating AI development and making large-scale experimentation more feasible.