Developer Tools

llama.cpp b9455 adds quantized KV cache for 2x memory savings

Run larger contexts on consumer GPUs with tensor-parallel KV cache quantization.

Deep Dive

llama.cpp's b9455 release introduces quantized KV cache support with tensor parallelism, along with a fix for partial view and removal of an overly strict assert. Available across Linux, Windows, macOS, and Android builds.

Key Points
  • Quantized KV cache cuts memory usage by 50-75% for attention layers, enabling 128k token contexts on 24GB GPUs
  • Tensor parallelism support distributes the quantized cache across multiple GPUs for faster inference
  • Available immediately for Linux, Windows, macOS (including Apple Silicon), and Android with multiple backends (CUDA, Vulkan, ROCm)

Why It Matters

Developers can now run long-context LLMs locally on consumer GPUs, reducing cloud dependency for RAG and document analysis.