KV cache for SWA checkpoints now stores only non-masked cells, reducing memory usage?

KV cache for SWA checkpoints now stores only non-masked cells, reducing memory usage

Optimization benefits models using sliding window attention (e.g., Mistral, Gemma)?

Optimization benefits models using sliding window attention (e.g., Mistral, Gemma)

Supports macOS, Linux, Windows, Android; backends include CPU, CUDA, Vulkan, ROCm, OpenVINO?

Supports macOS, Linux, Windows, Android; backends include CPU, CUDA, Vulkan, ROCm, OpenVINO

Developer Tools

llama.cpp b9473 optimizes KV cache for sliding window attention

llama.cpp Releases June 02, 2026

⚡New release cuts memory usage by storing only non-masked cells in SWA

Deep Dive

llama.cpp released version b9473. Key update: kv-cache for SWA checkpoints now stores only non-masked cells. Builds available for macOS (Apple Silicon, Intel), Linux (CPU, Vulkan, ROCm, OpenVINO), Windows (CPU, CUDA, Vulkan), Android, iOS, and more.

Key Points

KV cache for SWA checkpoints now stores only non-masked cells, reducing memory usage
Optimization benefits models using sliding window attention (e.g., Mistral, Gemma)
Supports macOS, Linux, Windows, Android; backends include CPU, CUDA, Vulkan, ROCm, OpenVINO

Why It Matters

Makes local LLM inference more efficient, enabling larger models on consumer hardware

Read Original Article

llama.cpp b9473 optimizes KV cache for sliding window attention

Why It Matters

Related Articles

🚀 Stay Ahead in AI