Research & Papers

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Apple researchers drop layers' caches to slash memory without performance loss.

Deep Dive

A team of Apple researchers (Filippova, Grangier, Cuturi, Monteiro) introduced Stochastic KV Routing, a novel training technique that slashes the memory footprint of transformer LLMs by enabling adaptive depth-wise cache sharing. During autoregressive generation, models cache Key-Value (KV) states to avoid redundant computation, but this cache consumes massive memory, driving up serving costs. While prior work focused on compressing or evicting caches along the temporal axis, this paper argues the depth dimension—across layers—offers an orthogonal and robust optimization avenue. The key insight: a full KV cache for every layer is often redundant.

The method is elegantly simple: during training, each layer randomly chooses to attend either to its own KV states or to those of a preceding layer. This stochastic process makes the model robust to various depth-wise cache sharing strategies at deployment, without requiring knowledge of hardware constraints upfront. The approach works both during pre-training and fine-tuning, and for larger models in data-constrained settings, it often preserves or even improves performance while significantly reducing cache memory. This contrasts with existing cross-layer sharing methods that typically hurt throughput or increase time-to-first-token. The paper suggests this technique could lower the cost of serving large language models, making them more accessible for real-world applications.

Key Points
  • Stochastic KV Routing reduces memory footprint by allowing layers to share caches across depth, avoiding redundant storage.
  • The method uses random cross-layer attention during training, making models robust to unknown cache-sharing strategies at deployment.
  • For larger models in data-constrained settings, it often preserves or improves performance while cutting cache memory.

Why It Matters

Cuts LLM serving costs by reducing KV cache memory, enabling more efficient deployment of large models.