llama.cpp b9410 uses f16 mask for Flash Attention, saving VRAM
llama.cpp's latest release cuts VRAM usage with f16 Flash Attention masks...
llama.cpp, the popular C++ implementation for running large language models locally, has released version b9410. The key change is using f16 (float16) masks for Flash Attention (FA) to save VRAM, a significant optimization for users running models on consumer GPUs. This update includes a new `llama_cast` function and formatting improvements to support the change.
The release is available for multiple platforms: macOS (Apple Silicon, Intel, and iOS XCFramework), Linux (x64, arm64, s390x, with Vulkan, ROCm 7.2, OpenVINO, SYCL FP32), Android (arm64), and Windows (x64 CPU, arm64 CPU, CUDA 12/13, Vulkan, HIP). The update was signed with GitHub's verified signature and received positive reactions from the community.
- Uses f16 masks for Flash Attention to reduce VRAM usage
- Adds `llama_cast` function and formatting improvements
- Available on macOS, Linux, Windows, and Android with multiple backends
Why It Matters
Lower VRAM requirements let more users run larger LLMs locally on consumer hardware.