Research & Papers

The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

A new paper proves a core LLM optimization is fundamentally broken, causing systematic and deterministic output errors.

Deep Dive

A new research paper from authors Ranjith Chodavarapu and Lei Xu shatters a long-held assumption in large language model (LLM) inference. The study, "The Illusion of Equivalence," proves that KV caching—a standard technique to speed up autoregressive text generation by storing previous computations—is not numerically equivalent to running without a cache when using FP16 precision. Due to the non-associative nature of FP16 math, the order of operations differs between cache-on and cache-off paths, leading to deterministic divergence in the generated token sequences.

The researchers tested three open-weight models—LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B—on the GSM8K benchmark and observed a 100% token divergence rate across all sampling strategies, including greedy decoding. This rules out randomness and confirms the error is systematic. Intriguingly, the cache-on path often yielded higher accuracy, indicating the divergence has a predictable direction. A controlled test using FP32 precision reduced divergence by eight orders of magnitude and eliminated token flips, pinpointing FP16 non-associativity as the sole cause.

Further analysis revealed that the divergence propagates through model layers in architecturally predictable patterns. Models using Grouped-Query Attention showed sharp divergence at the first layer, while Gemma's design led to uniform accumulation across all layers. Critically, the researchers used activation patching and found that manipulating the residual stream could not recover the cache-free trajectory, localizing the root cause to the stateful KV cache itself. This provides a mechanistic framework for understanding a previously overlooked source of numerical instability in production LLM systems.

Key Points
  • FP16 KV caching causes 100% token divergence in tested models (LLaMA-2-7B, Mistral-7B, Gemma-2B), proving it's not equivalent to recomputation.
  • The divergence is deterministic and systematic, not random, with cache-on sometimes improving accuracy on GSM8K in 8 of 9 test conditions.
  • The root cause is FP16's non-associative math; using FP32 precision reduces divergence by 8 orders of magnitude and eliminates token flips.

Why It Matters

This challenges a core assumption in LLM deployment, meaning model outputs can vary based on inference optimization settings, affecting reproducibility and reliability.