RotorQuant: 10-19x faster alternative to TurboQuant via Clifford rotors (44x fewer params)
New method uses Clifford algebra to slash parameters by 44x while matching TurboQuant's accuracy.
Scrya has introduced RotorQuant, a breakthrough vector quantization technique that reimagines the core mechanism of the popular TurboQuant method. Instead of using a dense d×d random orthogonal matrix, RotorQuant leverages Clifford rotors in the Cl(3,0) algebra. This approach chunks a vector into groups of three dimensions and rotates each with a 4-parameter rotor via the sandwich product. The result is a massive reduction in computational complexity, replacing 16,384 fused multiply-add operations (for d=128) with roughly 100. This fundamental change yields a 44x reduction in parameters, dropping from 16,399 to just 372.
In practical benchmarks on a Qwen2.5-3B-Instruct KV cache, RotorQuant maintains near-identical performance to TurboQuant, with a cosine similarity of 0.990 versus 0.991. The real gains are in speed and efficiency: its fused CUDA kernel runs 10-19x faster than a cuBLAS matrix multiplication on an RTX PRO 4000, and its Metal shader achieves 9-31x speedups on an Apple M4 chip. Despite a higher synthetic mean squared error on random vectors, when applied with QJL correction to real models, attention fidelity is preserved and can even improve top-1/top-5 retrieval tasks. The fused kernel design, which keeps operations in registers to avoid memory round-trips, is key to outperforming the highly optimized BLAS routines used by TurboQuant.
- Uses Clifford rotors to replace dense matrix multiplication, slashing parameters by 44x (from 16,399 to 372 for d=128).
- Achieves 10-19x faster inference on NVIDIA CUDA and 9-31x faster on Apple Metal with fused kernels.
- Maintains near-identical model fidelity (0.990 cosine similarity) on real tasks like KV cache quantization for the Qwen2.5-3B model.
Why It Matters
Enables drastically faster and more efficient LLM inference and retrieval, making advanced quantization practical for real-time applications on consumer hardware.