Research & Papers

Resident KV Claims: A New Contract for AI Cache Memory Under Pressure

New conformance contract solves hidden KV-cache eviction in LLM inference.

Deep Dive

Lukas Stepanek's arXiv paper (2605.24259) tackles a growing pain in LLM serving: KV-cache reuse mechanisms expose priority, offload, and routing hints, but lack a portable contract for what happens when 'resident KV' (cached prefixes for future reuse) and 'active KV' (current request compute) exceed available memory. Existing systems silently evict residents, breaking reuse guarantees. Stepanek's 'resident KV claims' formally bind future-reuse intent to a materialization predicate, lifecycle state, and feasibility outcome.

In vLLM allocator probes, a 60-block resident claim and a 70-block active prefill oversubscribe an 80-block pool. Write no-admit prevents the active request from becoming reusable but still allows eviction of residents. A minimal vLLM prototype shows that hard protected resident claims convert this failure into scheduler-visible active refusal with direct blocking-claim attribution. The result isn't a speedup—it's a runtime contract that turns unreported resident loss into reconstructable arbitration, with companion litmus tests distinguishing eviction, soft priority, demotion, expiry, and active refusal.

Key Points
  • 60-block resident claim + 70-block active prefill exceed 80-block usable KV pool in vLLM probes.
  • Hard protected resident claims convert silent eviction into scheduler-visible active refusal with attribution.
  • Companion MicroRuntime and vLLM litmus suite distinguish 9 different failure modes (eviction, demotion, expiry, etc.).

Why It Matters

Enables predictable KV-cache reuse in LLM servers, crucial for efficient long-context and multi-turn inference.