llama.cpp b9455 adds quantized KV cache for 2x memory savings
Run larger contexts on consumer GPUs with tensor-parallel KV cache quantization.
Deep Dive
llama.cpp's b9455 release introduces quantized KV cache support with tensor parallelism, along with a fix for partial view and removal of an overly strict assert. Available across Linux, Windows, macOS, and Android builds.
Key Points
- Quantized KV cache cuts memory usage by 50-75% for attention layers, enabling 128k token contexts on 24GB GPUs
- Tensor parallelism support distributes the quantized cache across multiple GPUs for faster inference
- Available immediately for Linux, Windows, macOS (including Apple Silicon), and Android with multiple backends (CUDA, Vulkan, ROCm)
Why It Matters
Developers can now run long-context LLMs locally on consumer GPUs, reducing cloud dependency for RAG and document analysis.