llama.cpp b9473 optimizes KV cache for sliding window attention
New release cuts memory usage by storing only non-masked cells in SWA
Deep Dive
llama.cpp released version b9473. Key update: kv-cache for SWA checkpoints now stores only non-masked cells. Builds available for macOS (Apple Silicon, Intel), Linux (CPU, Vulkan, ROCm, OpenVINO), Windows (CPU, CUDA, Vulkan), Android, iOS, and more.
Key Points
- KV cache for SWA checkpoints now stores only non-masked cells, reducing memory usage
- Optimization benefits models using sliding window attention (e.g., Mistral, Gemma)
- Supports macOS, Linux, Windows, Android; backends include CPU, CUDA, Vulkan, ROCm, OpenVINO
Why It Matters
Makes local LLM inference more efficient, enabling larger models on consumer hardware