ObjectCache uses S3 storage for LLM KV cache, adding only 5.6% latency
The conventional wisdom says object storage is too slow for real-time AI inference, but a new system called ObjectCache achieves just 5.6% latency overhead by cleverly overlapping data transfer with GPU computation—challenging the assumption that serving large models requires expensive GPU memory.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
The memory appetite of large language models during inference is voracious. Each request's key-value cache can consume gigabytes of GPU memory, especially for long contexts. This forces providers to choose between expensive high-bandwidth memory or limiting context lengths. ObjectCache, a system detailed in a recent arXiv paper, takes a radical step: it offloads the prefix KV cache to S3-compatible object storage, the same cheap, elastic storage used for backups and archives. By redesigning the data transfer pipeline to overlap with GPU compute, the system reports only a 5.6% latency increase for 64K-token contexts, and 56–75 milliseconds for 4K-token prompts—a range that many interactive applications can tolerate.
This approach enters a landscape already replete with optimization strategies. vLLM's PagedAttention manages KV cache like virtual memory, reducing fragmentation and enabling higher batch sizes within GPU memory. FlexGen offloads cache to CPU memory and SSDs, using pipelining to hide latency. CacheGen compresses the cache before offloading to CPU or disk. ObjectCache distinguishes itself by targeting network-attached object storage, which offers near-infinite capacity and minimal cost per gigabyte compared to GPU memory or even local SSDs. The key insight is that by streaming data in parallel with GPU computation—fetching the next chunks while the model processes previous ones—the system can mask much of the network latency. This makes the theoretical benefits of cheap cloud storage practically accessible for the first time.
The implications extend beyond a single paper. If production systems can incorporate ObjectCache's principles, the cost per token for long-context inference could drop significantly. The LLM inference market is projected to exceed $10 billion by 2025, with memory costs accounting for a large share. However, several hidden risks temper the excitement. S3 egress costs can be substantial, potentially erasing the savings from cheaper storage—especially for high-throughput services. Network tail latency, inherent to any cloud storage, could cause unpredictable slowdowns for latency-sensitive applications. Additionally, storing prompt prefixes in shared object storage raises data privacy concerns, as tenants might share underlying storage infrastructure. The system's reliance on careful prefetching and pipelining also makes it vulnerable to bursty request patterns or sudden spikes in context length.
The bottom line is that ObjectCache represents a pragmatic step toward tiered storage for LLMs, but its real-world viability will hinge on production S3 performance guarantees, cost modeling that includes egress fees, and the ability to handle the worst-case latency spikes. For now, it provides a compelling blueprint for how to trade marginal latency for dramatic cost savings in scenarios where real-time interactivity is not paramount—such as batch processing, offline analysis, or applications with relaxed latency budgets.
- ObjectCache uses S3-compatible object storage for LLM KV cache, achieving only 5.6% latency overhead for 64K contexts by overlapping data transfer with GPU computation.
- Compared to vLLM (GPU memory), FlexGen (local SSD/CPU), and CacheGen (compression), ObjectCache offers cheaper scalability but introduces dependency on network reliability and egress costs.
- Hidden risks include S3 egress fees potentially negating savings, tail latency variability for real-time use, and data privacy concerns from using shared object storage for prompt prefixes.
Why It Matters
ObjectCache challenges the high-cost memory orthodoxy for LLM inference, opening the door to dramatically cheaper serving through cloud object storage.