LiveR replaces checkpoint-based resize with a live, bounded-memory handoff between training worlds?

LiveR replaces checkpoint-based resize with a live, bounded-memory handoff between training worlds.

Cuts reconfiguration downtime from minutes to seconds, achieving 14–23x speedup over checkpoint/restart?

Cuts reconfiguration downtime from minutes to seconds, achieving 14–23x speedup over checkpoint/restart.

Sustains up to 99% training goodput on volatile GPU resources like spot instances?

Sustains up to 99% training goodput on volatile GPU resources like spot instances.

Research & Papers

LiveR slashes LLM training downtime from minutes to seconds on volatile GPUs

arXiv cs.DC May 22, 2026

⚡Live reconfiguration replaces checkpoint-restart, achieving 14-23x speedup.

Deep Dive

Training large language models increasingly relies on cheap but volatile GPU capacity—spot instances or reclaimable cluster resources. The challenge is that when capacity changes, traditional elastic training systems halt, save a checkpoint, rebuild the distributed runtime with the new topology, and restart. This stop-and-restart approach incurs massive downtime from I/O, CUDA initialization, and communicator setup.

LiveR, built on Megatron-LM and PyTorch, solves this by performing a live handoff between mixed-parallel training worlds. While the current world continues training, LiveR asynchronously prepares the target world—bootstrapping new workers in isolation and streaming model state over high-bandwidth interconnects while reshaping it across tensor, pipeline, and data parallel dimensions. A lightweight commit then switches training to the new world with virtually no pause. In tests on multi-node GPU clusters, LiveR reduced reconfiguration downtime from minutes to seconds, achieved 14-23x speedup over checkpoint/restart, and maintained 99% training goodput under volatile conditions, making low-cost GPU capacity far more practical for LLM training.

Key Points

LiveR replaces checkpoint-based resize with a live, bounded-memory handoff between training worlds.
Cuts reconfiguration downtime from minutes to seconds, achieving 14–23x speedup over checkpoint/restart.
Sustains up to 99% training goodput on volatile GPU resources like spot instances.

Why It Matters

Makes cheap spot GPUs viable for LLM training, slashing costs while maintaining near-perfect training efficiency.

Read Original Article

LiveR slashes LLM training downtime from minutes to seconds on volatile GPUs

Why It Matters

Related Articles

🚀 Stay Ahead in AI