SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs
New fault-tolerance system masks GPU failures with only 2-3x overhead, preventing costly restarts.
A research team including Jin Lee, Zhonghao Chen, and Franck Cappello has introduced SPARe (Stacked Parallelism with Adaptive Reordering), a novel fault-tolerance framework designed for the extreme scale of modern LLM pretraining. As clusters grow beyond 100,000 GPUs, hardware failures become inevitable, and the time spent restarting training jobs can consume more wall-clock time than the actual computation. SPARe addresses this by fundamentally rethinking how redundancy is managed, allowing training to continue even when individual nodes fail, rather than forcing a full system restart. The system is specifically engineered for the 'restart-dominant regime' where existing checkpoint-and-restart methods become prohibitively expensive.
The technical innovation lies in SPARe's ability to 'mask' node failures during the critical gradient synchronization phase by stacking redundant data shards across different parallelism groups (like data, tensor, or pipeline parallelism) and then adaptively reordering the execution flow. This approach provides availability comparable to traditional full replication but with a near-constant computational overhead of only 2 to 3 times the baseline, regardless of redundancy level. In simulations scaling up to 600,000 GPUs, SPARe demonstrated a 40-50% reduction in total time-to-train compared to conventional methods. The framework also includes a joint optimization strategy for redundancy and checkpointing, derived from closed-form mathematical expressions, to systematically minimize training time. This represents a critical infrastructure advancement for training the next generation of trillion-parameter models.
- Cuts time-to-train by 40-50% for LLM pretraining on clusters up to 600k GPUs
- Maintains only 2-3x computation overhead vs. linear overhead of traditional replication
- Masks node failures during gradient sync via stacked data shards and adaptive reordering
Why It Matters
Enables faster, more reliable training of massive AI models, reducing costs and accelerating research breakthroughs.