GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
Erasure coding protects KV caches, cutting recovery time for million-token tasks
As LLMs power million-token, agent-based applications, the long-running nature of these tasks makes them highly susceptible to hardware and software faults. A single failure can waste costly compute and damage user experience. The key-value (KV) cache, which grows with sequence length, is the most vulnerable component in distributed serving systems. Existing fault-tolerance methods like full replication or heavy recomputation are inefficient.
GhostServe, presented at MLSys 2026 by researchers Shakya Jayakody, Youpeng Zhao, and Chinmay Dhanraj Nehate, tackles this with a novel checkpointing approach. It operates "in the shadow" by using erasure coding to generate parity shards of the streaming KV cache and store them in host memory. On device failure, GhostServe reconstructs the lost cache quickly, allowing inference to resume seamlessly. In evaluations, it reduced checkpointing latency by up to 2.7x and recovery latency by 2.1x for a single batch, and improved median response latency by 1.2x under failures. This paves the way for high-availability, cost-effective LLM serving at scale.
- Protects streaming KV cache using erasure-coded parity shards stored in host memory
- Reduces checkpointing latency by up to 2.7x and recovery latency by 2.1x vs. existing methods
- Achieves 1.2x median response latency improvement under system failures
- Targets million-token, agent-based LLM inference where long-running tasks are common
Why It Matters
Enables cost-effective, high-availability LLM serving for long-running agent tasks without expensive state replication.