TurboQuant study: FP8 remains best default for KV-cache quantization
FP8 delivers 2x KV-cache capacity with negligible accuracy loss
A new comprehensive study comparing TurboQuant variants against FP8 quantization for large language model KV-cache has settled the debate: FP8 remains the best default. The analysis shows that FP8 via --kv-cache-dtype fp8 achieves a 2x increase in cache capacity with negligible accuracy loss, while fully matching BF16 performance on most metrics. In memory-constrained serving scenarios, FP8 actually improves performance compared to BF16, making it a clear winner for production deployments.
TurboQuant variants fared poorly in comparison. The k8v4 variant offers only a marginal capacity improvement (2.4x vs 2x) while consistently degrading latency and throughput — not worth the trade-off. The 4bit-nc variant is the most practical TurboQuant option, but it still trades extra memory savings for moderate accuracy drops and higher inference costs, making it viable only for edge deployments where memory is the dominant constraint. The higher compression variants (k3v4-nc and 3bit-nc) show meaningful accuracy degradation, especially on reasoning and very long-context tasks, along with substantial latency and throughput penalties, ruling them out for production use.
- FP8 quantization delivers 2x KV-cache capacity with negligible accuracy loss and matches BF16 performance.
- TurboQuant k8v4 offers only 2.4x capacity (vs FP8's 2x) but consistently degrades throughput and latency.
- TurboQuant 4bit-nc is viable for edge deployments with dominant memory constraints, but trades capacity for moderate accuracy and speed costs.
Why It Matters
Provides clear guidance for LLM serving: default to FP8 for efficiency without sacrificing accuracy.