Open Source

RotorQuant: 10-19x faster alternative to TurboQuant via Clifford rotors (44x fewer params)

New method uses Clifford algebra to slash parameters by 44x while matching TurboQuant's accuracy.

Deep Dive

Scrya has introduced RotorQuant, a breakthrough vector quantization technique that reimagines the core mechanism of the popular TurboQuant method. Instead of using a dense d×d random orthogonal matrix, RotorQuant leverages Clifford rotors in the Cl(3,0) algebra. This approach chunks a vector into groups of three dimensions and rotates each with a 4-parameter rotor via the sandwich product. The result is a massive reduction in computational complexity, replacing 16,384 fused multiply-add operations (for d=128) with roughly 100. This fundamental change yields a 44x reduction in parameters, dropping from 16,399 to just 372.

In practical benchmarks on a Qwen2.5-3B-Instruct KV cache, RotorQuant maintains near-identical performance to TurboQuant, with a cosine similarity of 0.990 versus 0.991. The real gains are in speed and efficiency: its fused CUDA kernel runs 10-19x faster than a cuBLAS matrix multiplication on an RTX PRO 4000, and its Metal shader achieves 9-31x speedups on an Apple M4 chip. Despite a higher synthetic mean squared error on random vectors, when applied with QJL correction to real models, attention fidelity is preserved and can even improve top-1/top-5 retrieval tasks. The fused kernel design, which keeps operations in registers to avoid memory round-trips, is key to outperforming the highly optimized BLAS routines used by TurboQuant.

Key Points
  • Uses Clifford rotors to replace dense matrix multiplication, slashing parameters by 44x (from 16,399 to 372 for d=128).
  • Achieves 10-19x faster inference on NVIDIA CUDA and 9-31x faster on Apple Metal with fused kernels.
  • Maintains near-identical model fidelity (0.990 cosine similarity) on real tasks like KV cache quantization for the Qwen2.5-3B model.

Why It Matters

Enables drastically faster and more efficient LLM inference and retrieval, making advanced quantization practical for real-time applications on consumer hardware.