Discovered Jensen bias?

quantization noise inflates cached-key attention due to softmax convexity.

Proposed on-the-fly correction using quantization step sizes and query norms with negligible overhead?

Proposed on-the-fly correction using quantization step sizes and query norms with negligible overhead.

At INT2, matches BF16 quality on three video models—50% less memory than INT4?

At INT2, matches BF16 quality on three video models—50% less memory than INT4.

Image & Video

New bias correction cuts KV-cache memory 50% in video diffusion models

arXiv eess.IV May 27, 2026

⚡Quantized keys steal attention—researchers fix it at near-zero cost.

Deep Dive

Chunk-wise autoregressive video diffusion models store a KV cache of previous chunks to avoid redundant computation, but as videos grow longer, this cache rapidly becomes a memory bottleneck. Quantizing the cache to low bitwidths reduces memory pressure but typically harms video quality. The authors show the key driver is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys—a phenomenon they term the Jensen bias. This causes quantized keys to steal attention mass from the unquantized current chunk, degrading output fidelity.

The team proposes a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. The method uses a second-order Taylor approximation, making the computational overhead negligible and requiring no additional memory beyond the cache. Tested on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, the correction recovers most of the quality lost to aggressive quantization, achieving near-BF16 video quality. It even outperforms INT4 quantization while using 50% less memory, making it a practical solution for long-video generation.

Key Points

Discovered Jensen bias: quantization noise inflates cached-key attention due to softmax convexity.
Proposed on-the-fly correction using quantization step sizes and query norms with negligible overhead.
At INT2, matches BF16 quality on three video models—50% less memory than INT4.

Why It Matters

Enables high-quality long video generation by slashing KV-cache memory requirements without compromising fidelity.

Read Original Article

New bias correction cuts KV-cache memory 50% in video diffusion models

Why It Matters

Related Articles

🚀 Stay Ahead in AI