LiveR slashes LLM training downtime from minutes to seconds on volatile GPUs
Live reconfiguration replaces checkpoint-restart, achieving 14-23x speedup.
Training large language models increasingly relies on cheap but volatile GPU capacity—spot instances or reclaimable cluster resources. The challenge is that when capacity changes, traditional elastic training systems halt, save a checkpoint, rebuild the distributed runtime with the new topology, and restart. This stop-and-restart approach incurs massive downtime from I/O, CUDA initialization, and communicator setup.
LiveR, built on Megatron-LM and PyTorch, solves this by performing a live handoff between mixed-parallel training worlds. While the current world continues training, LiveR asynchronously prepares the target world—bootstrapping new workers in isolation and streaming model state over high-bandwidth interconnects while reshaping it across tensor, pipeline, and data parallel dimensions. A lightweight commit then switches training to the new world with virtually no pause. In tests on multi-node GPU clusters, LiveR reduced reconfiguration downtime from minutes to seconds, achieved 14-23x speedup over checkpoint/restart, and maintained 99% training goodput under volatile conditions, making low-cost GPU capacity far more practical for LLM training.
- LiveR replaces checkpoint-based resize with a live, bounded-memory handoff between training worlds.
- Cuts reconfiguration downtime from minutes to seconds, achieving 14–23x speedup over checkpoint/restart.
- Sustains up to 99% training goodput on volatile GPU resources like spot instances.
Why It Matters
Makes cheap spot GPUs viable for LLM training, slashing costs while maintaining near-perfect training efficiency.