TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed)
Custom Metal kernels compress Qwen 32B's KV cache by 4.6x while maintaining 98% of FP16 speed.
Developer Anton Rozanov has successfully ported Google's recent TurboQuant research paper to Apple's MLX framework, creating a highly optimized implementation that dramatically reduces the memory footprint of large language models on Mac hardware. The core achievement is a 4.6x compression of the KV (key-value) cache—a memory-intensive component critical for maintaining conversation context—for the 32-billion-parameter Qwen2.5 model. On an M4 Pro Mac with 48GB of memory, this compression slashes the cache for a 16,000-token context from 4.2GB down to just 897MB, a crucial improvement for hardware with constrained VRAM.
The technical breakthrough was overcoming a severe performance penalty. An initial implementation ran at only 28% of the original FP16 speed, making it impractical. Rozanov solved this by writing custom, fused Metal kernels for the quantization and dequantization operations and implementing an incremental decode buffer. These optimizations boosted performance to 98% of the native FP16 speed while preserving identical output quality, making the compression effectively 'free' in terms of both accuracy and latency. The code is open-source and a pull request has been submitted to the main `mlx-lm` repository for integration.
This work directly addresses one of the primary bottlenecks for running state-of-the-art LLMs locally: memory bandwidth and capacity. By compressing the KV cache with minimal speed loss, users can either run significantly larger models on the same Apple Silicon hardware or maintain much longer conversation contexts with existing models. It represents a major step in making advanced AI more accessible and efficient on consumer-grade Macs.
- Achieves 4.6x compression of the KV cache for Qwen2.5-32B, reducing a 16K context from 4.2GB to 897MB.
- Runs at 98% of the original FP16 speed, up from an initial 28%, via custom fused Metal kernels.
- Enables running larger models or longer contexts on memory-constrained Apple Silicon Macs without quality loss.
Why It Matters
Dramatically increases the feasible model size and context length for local AI on Macs, making advanced LLMs more accessible.