Developer Tools

llama.cpp b9473 optimizes KV cache for sliding window attention

New release cuts memory usage by storing only non-masked cells in SWA

Deep Dive

llama.cpp released version b9473. Key update: kv-cache for SWA checkpoints now stores only non-masked cells. Builds available for macOS (Apple Silicon, Intel), Linux (CPU, Vulkan, ROCm, OpenVINO), Windows (CPU, CUDA, Vulkan), Android, iOS, and more.

Key Points
  • KV cache for SWA checkpoints now stores only non-masked cells, reducing memory usage
  • Optimization benefits models using sliding window attention (e.g., Mistral, Gemma)
  • Supports macOS, Linux, Windows, Android; backends include CPU, CUDA, Vulkan, ROCm, OpenVINO

Why It Matters

Makes local LLM inference more efficient, enabling larger models on consumer hardware