llama.cpp b9512 introduces return filter for memory savings
New release reduces memory usage during LLM inference with smarter filter handling.
The latest release of llama.cpp, version b9512, brings a memory optimization that matters for anyone running LLMs locally. The core change is the addition of a 'return filter' — a mechanism that properly deallocates filter state after each generation step. Previously, filters used in token selection or sampling could accumulate in memory, reducing headroom for larger context windows or models. By returning (freeing) these filters when no longer needed, the update can save tens to hundreds of MBs depending on batch size and model architecture. This is especially beneficial on systems with 8 GB or 16 GB of RAM, where every megabyte counts. The commit, signed and co‑authored by lvyichen (StepFun), is part of the ongoing effort to make local LLM inference as efficient as proprietary cloud APIs.
The release also reaffirms llama.cpp’s cross‑platform dominance. Pre‑built binaries are available for macOS (Apple Silicon and Intel), Linux (x64, arm64, s390x), Windows (x64, arm64), Android arm64, and many GPU backends (Vulkan, ROCm, CUDA, OpenVINO, SYCL, HIP). Notably, the macOS Apple Silicon builds include both standard and KleidiAI‑enabled variants for performance‑sensitive users. With 115,000 GitHub stars and nearly 20,000 forks, the project continues to be the go‑to solution for developers and researchers running models like Llama, Mistral, and Qwen on‑device. The b9512 tag is available for immediate download from the releases page, and users can test the memory savings directly.
- New 'return filter' mechanism (PR #24125) frees memory by deallocating filters after generation steps.
- Supports macOS (Apple Silicon & Intel), Linux (x64/arm64/s390x), Windows (x64/arm64), Android arm64, plus GPU backends.
- Part of llama.cpp's ongoing optimizations — the project now has 115k stars and 19.2k forks on GitHub.
Why It Matters
Reducing memory overhead lets users run larger models or longer contexts on consumer hardware, democratizing local AI.