Implemented TurboQuant and results don’t fully match paper
A developer finds only 95.8% correlation at 4-bit vs paper's 99%, dropping attention accuracy to 67%.
A developer (Reddit user Routine-Thanks-572) attempted to implement TurboQuant, a KV cache quantization method from a recent arXiv paper (2504.19874), from scratch. Preliminary results show a clear discrepancy: the MSE variant performs as expected with good compression and distortion, but the PROD variant—which the paper claims achieves >99% correlation at 4-bit—only reaches ~95.8% in the implementation. More critically, even at this correlation level, attention quality degrades severely, with only ~67% top-1 accuracy achieved in a simple simulation. This suggests that high correlation does not guarantee good ranking preservation, and attention is highly sensitive to ordering errors.
Beyond the accuracy gap, the developer encountered several technical hurdles: variance scaling differences (unit vs 1/d) initially broke the MSE variant, requiring a re-derivation of QJL variance scaling. Bit packing was also necessary for actual compression. The developer questions whether the discrepancy stems from missing scaling factors in PROD, is inherent to small dimensions (d=256), or if the paper's results rely on larger dimensions/setups. The full code is open-sourced on GitHub (https://github.com/Ashx098/Turboquant-Implementation) for community feedback and verification.
- PROD variant correlation at 4-bit is ~95.8%, far below the paper's claimed >99%.
- Attention quality drops to ~67% top-1 accuracy despite 95.8% correlation, showing ranking fragmentation.
- Other implementation issues: variance scaling (unit vs 1/d) and mandatory bit packing for compression.
Why It Matters
Ranking preservation, not just correlation, is critical for attention—rethinking KV cache quantization metrics.