Research & Papers

ReCoVer: Resilient LLM pre-training survives 256 GPU failures at 2.23x throughput

New system keeps training trajectory intact even after losing half the GPU cluster.

Deep Dive

A new preprint from researchers at multiple institutions introduces ReCoVer, a drop-in resilient pre-training system for large language models that can sustain massive hardware failures without derailing the training trajectory. Unlike existing frameworks that either limit parallelism schemes or suffer gradient drift, ReCoVer guarantees each iteration produces gradients stochastically equivalent to a failure-free run by keeping the number of microbatches constant. The system is built on three decoupled protocol layers: fault-tolerant collectives that isolate failures between replicas, in-step fine-grained recovery that prevents gradient corruption, and a versatile-workload policy that dynamically reassigns microbatch quotas to surviving GPUs. This design works seamlessly with both 3D parallelism and Hybrid Sharded Data Parallel (HSDP) as a drop-in substrate.

In end-to-end pre-training tests on up to 512 GPUs, ReCoVer preserved the exact training trajectory of a failure-free reference even after losing 256 GPUs spread across the run. Compared to traditional checkpoint-and-restart methods, ReCoVer achieved 2.23× higher effective throughput after successive failures, and processed 74.9% more tokens at 234 GPU-hours—a gap that widens with longer training runs. By eliminating the overhead of frequent checkpointing and restarts, ReCoVer offers a practical path to robust, fault-tolerant LLM training at scale, addressing one of the most costly pain points in distributed AI infrastructure.

Key Points
  • ReCoVer maintains stochastic gradient equivalence to failure-free runs by keeping microbatches constant per iteration.
  • Achieves 2.23× higher effective throughput than checkpoint-restart on up to 512 GPUs, even after losing 256 GPUs.
  • Works as a drop-in substrate for both 3D parallelism and Hybrid Sharded Data Parallel (HSDP).

Why It Matters

ReCoVer slashes downtime from GPU failures, making large-scale LLM pre-training dramatically more efficient and cost-effective.