Research & Papers

[R] ContextCache: Persistent KV Cache with Content-Hash Addressing — 29x TTFT speedup for tool-calling LLMs

New system eliminates redundant prefill, cutting time-to-first-token from 5.6 seconds to 200ms for 50-tool LLMs.

Deep Dive

A new research paper introduces ContextCache, a system designed to dramatically accelerate tool-calling large language models (LLMs) by eliminating redundant computation. In standard deployments, LLM agents that use tools must process lengthy, static JSON schema definitions (the tool descriptions) with every single user request, a process called prefill that dominates latency. ContextCache solves this by persistently caching the Key-Value (KV) states generated during the first prefill of a given set of tools. It indexes these cached states using a SHA-256 hash of the sorted schema text, allowing subsequent requests with the same toolset to skip recomputing those tokens entirely and proceed directly to processing the user's query.

The technical breakthrough is 'group caching,' where all tools are cached as a single block. The researchers found that caching tools independently catastrophically broke model performance, dropping tool selection accuracy from 85% to 10%, because models rely on cross-tool attention during prefill. Group caching preserves full model quality exactly. Benchmarks on Qwen3-8B (4-bit quantized) show TTFT holds constant at ~200ms from 5 to 50 tools, while a full prefill grows from 466ms to 5,625ms—a 29x speedup at the upper limit, skipping 99% of prompt tokens per request. The main current limitation is memory usage with eager attention at very high tool counts (75+), which integration with FlashAttention could mitigate. This work directly enables more responsive and scalable LLM agents in production.

Key Points
  • Achieves 29x TTFT speedup for 50-tool setups, reducing latency from 5.6 seconds to 200ms
  • Uses 'group caching' of all tools as one block to preserve 100% of model accuracy (catastrophic failure with independent caching)
  • Indexes cached KV states with a SHA-256 content hash, allowing 99% of prompt tokens to be skipped per subsequent request

Why It Matters

Enables production deployment of responsive LLM agents with large, stable toolkits, removing a major latency bottleneck for real-time applications.