FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8
New open-source KV-cache technique compresses memory 5-8x and decodes 1.5-2x faster than vLLM.
FastDMS is a new, MIT-licensed kernel implementation of Dynamic Memory Sparsification (DMS), originally proposed by researchers from NVIDIA, University of Warsaw, and University of Edinburgh. The technique uses learned per-head token eviction to compress KV-cache entries, reclaiming physical memory as slots are evicted. The author replicated the original DMS results on Llama 3.2 1B (WikiText-2) with a perplexity delta of only -0.28% at 6.4x compression, then spent weeks optimizing kernels to make it fast.
In benchmarks, FastDMS dramatically outperforms vLLM’s dense BF16 and FP8 KV-cache strategies. For Llama-3.2-1B at 8K context, FastDMS uses 0.056 GiB vs vLLM BF16’s 0.312 GiB — a 5.6x memory reduction — while decoding at 698.9 tok/s vs 459.4 tok/s (1.52x faster). With an int4 speed profile, decoding reaches 1060 tok/s (2.31x vLLM BF16). Similarly, on Qwen3 8B, memory savings hit 7.6x. FastDMS also surpasses TurboQuant in both memory and speed. The project is available on GitHub with reference implementations and trainer for reproducibility.
- FastDMS compresses KV-cache 5.6x on Llama 3.2 1B and 7.6x on Qwen3 8B (8K context).
- Decoding speed is 1.5–2.3x faster than vLLM BF16 baseline (e.g., 1060 tok/s with int4 speed profile).
- Open-source MIT license; trained predictors for Llama 3.2 1B and original NVIDIA DMS checkpoint for Qwen3 8B available.
Why It Matters
Practical KV-cache compression lets LLMs run longer contexts or serve more users on the same GPU hardware.