Research & Papers

TierCheck slashes LLM checkpointing to under 10s with three-tier fault tolerance

New tiered checkpointing system cuts state-saving overhead and recovery time by orders of magnitude.

Deep Dive

TierCheck, developed by researchers at multiple institutions (including Patrick P.C. Lee), addresses a critical pain point in large-scale LLM training: the frequent interruptions caused by diverse hardware failures ranging from individual GPU crashes to cluster-wide outages. Traditional checkpointing systems use a single storage tier, forcing a painful trade-off between the overhead of saving state and the speed of recovering from a failure. TierCheck introduces a cluster-aware, three-tier architecture that elegantly sidesteps this dilemma.

Lightweight differential checkpoints are written to local and peer memory, enabling fast localized recovery when a single node fails. Meanwhile, heavyweight base checkpoints are asynchronously migrated to remote persistent storage for catastrophic failures. The system enforces strict global consistency across all tiers without ever stalling the training pipeline. In tests with models up to 40 billion parameters, TierCheck achieves end-to-end checkpointing times under 10 seconds, making high-frequency checkpointing practical for the first time. This balances the two core requirements: low overhead during normal operation and fast recovery when failures inevitably occur.

Key Points
  • Three-tier storage design: local memory, peer memory (for fast recovery), and remote persistent storage.
  • End-to-end checkpointing time under 10 seconds for models up to 40B parameters.
  • Maintains global consistency across tiers without stalling training, enabling high-frequency checkpointing.

Why It Matters

TierCheck makes LLM training dramatically more resilient to failures, reducing costly downtime and wasted compute.