ObjectCache uses S3-compatible object storage for LLM KV cache, achieving only 5.6% latency overhead for 64K contexts by overlapping data transfer with GPU computation?

ObjectCache uses S3-compatible object storage for LLM KV cache, achieving only 5.6% latency overhead for 64K contexts by overlapping data transfer with GPU computation.

Compared to vLLM (GPU memory), FlexGen (local SSD/CPU), and CacheGen (compression), ObjectCache offers cheaper scalability but introduces dependency on network reliability and egress costs?

Compared to vLLM (GPU memory), FlexGen (local SSD/CPU), and CacheGen (compression), ObjectCache offers cheaper scalability but introduces dependency on network reliability and egress costs.

Hidden risks include S3 egress fees potentially negating savings, tail latency variability for real-time use, and data privacy concerns from using shared object storage for prompt prefixes?

Hidden risks include S3 egress fees potentially negating savings, tail latency variability for real-time use, and data privacy concerns from using shared object storage for prompt prefixes.

Research & Papers

ObjectCache uses S3 storage for LLM KV cache, adding only 5.6% latency

arXiv cs.DC May 25, 2026

⚡The conventional wisdom says object storage is too slow for real-time AI inference, but a new system called ObjectCache achieves just 5.6% latency overhead by cleverly overlapping data transfer with GPU computation—challenging the assumption that serving large models requires expensive GPU memory.

Deep Dive

ObjectCache, a new system from researchers at ETH Zurich and Hewlett Packard Labs, tackles the exploding cost of KV cache memory in large language model (LLM) serving. Current servers store the prefix KV cache in remote DRAM pools to keep time-to-first-token (TTFT) low, but this requires expensive memory and adds cluster size. ObjectCache instead stores the cache in S3-compatible object storage (e.g., Ceph RGW, DAOS), which is far cheaper and nearly infinitely scalable. The key innovation is a co-designed storage protocol and transfer schedule that ensures object storage delivers KV cache data in exactly the order the GPU's transformer layers consume it. This allows overlapping data transfer with compute across concurrent requests, masking latency.

Prototyped on a 100 Gbps RoCE cluster with NIXL (an inference library abstracting storage), ObjectCache achieves remarkable results. For the 64K token contexts common in today's systems, it adds only 5.6% latency over the ideal local DRAM baseline. For shorter 4K contexts where less compute is available to hide transfer time, it adds 56–75 ms over the optimal local layerwise baseline. Under shared bandwidth caps, the scheduler further reduces added TTFT by 1.2–1.8x compared to equal bandwidth sharing. This is a practical step toward decoupling LLM serving cost from memory capacity, enabling smaller clusters and lower operational expenses.

Key Points

ObjectCache uses S3-compatible object storage for LLM KV cache, achieving only 5.6% latency overhead for 64K contexts by overlapping data transfer with GPU computation.
Compared to vLLM (GPU memory), FlexGen (local SSD/CPU), and CacheGen (compression), ObjectCache offers cheaper scalability but introduces dependency on network reliability and egress costs.
Hidden risks include S3 egress fees potentially negating savings, tail latency variability for real-time use, and data privacy concerns from using shared object storage for prompt prefixes.

Why It Matters

ObjectCache challenges the high-cost memory orthodoxy for LLM inference, opening the door to dramatically cheaper serving through cloud object storage.

Read Original Article

ObjectCache uses S3 storage for LLM KV cache, adding only 5.6% latency

Why It Matters

Related Articles

🚀 Stay Ahead in AI