Protects streaming KV cache using erasure-coded parity shards stored in host memory?

Protects streaming KV cache using erasure-coded parity shards stored in host memory

Achieves 1.2x median response latency improvement under system failures?

Achieves 1.2x median response latency improvement under system failures

Targets million-token, agent-based LLM inference where long-running tasks are common?

Targets million-token, agent-based LLM inference where long-running tasks are common

Research & Papers

GhostServe slashes LLM fault recovery with 2.7x faster checkpointing

arXiv cs.DC May 05, 2026

⚡Erasure coding protects KV caches, cutting recovery time for million-token tasks

Deep Dive

As LLMs power million-token, agent-based applications, the long-running nature of these tasks makes them highly susceptible to hardware and software faults. A single failure can waste costly compute and damage user experience. The key-value (KV) cache, which grows with sequence length, is the most vulnerable component in distributed serving systems. Existing fault-tolerance methods like full replication or heavy recomputation are inefficient.

GhostServe, presented at MLSys 2026 by researchers Shakya Jayakody, Youpeng Zhao, and Chinmay Dhanraj Nehate, tackles this with a novel checkpointing approach. It operates "in the shadow" by using erasure coding to generate parity shards of the streaming KV cache and store them in host memory. On device failure, GhostServe reconstructs the lost cache quickly, allowing inference to resume seamlessly. In evaluations, it reduced checkpointing latency by up to 2.7x and recovery latency by 2.1x for a single batch, and improved median response latency by 1.2x under failures. This paves the way for high-availability, cost-effective LLM serving at scale.

Key Points

Protects streaming KV cache using erasure-coded parity shards stored in host memory
Reduces checkpointing latency by up to 2.7x and recovery latency by 2.1x vs. existing methods
Achieves 1.2x median response latency improvement under system failures
Targets million-token, agent-based LLM inference where long-running tasks are common

Why It Matters

Enables cost-effective, high-availability LLM serving for long-running agent tasks without expensive state replication.

Read Original Article

GhostServe slashes LLM fault recovery with 2.7x faster checkpointing

Why It Matters

Related Articles

🚀 Stay Ahead in AI