llama.cpp b9437 adds auto flash attention benchmark flag
New release optimizes local LLM benchmarking with -fa auto and default ngl=-1.
The open-source llama.cpp project, widely used for running large language models locally on consumer hardware, has released version b9437. This update focuses on improving the benchmarking experience via the llama-bench tool. The headline feature is support for the `-fa auto` flag, which automatically selects the optimal flash attention implementation during benchmarks. Flash attention can significantly reduce memory usage and speed up inference, but configuring it manually is often error-prone. The `auto` option removes that guesswork.
Additionally, the default value for `-ngl` (the number of GPU layers offloaded) has been changed to -1, matching the convention used in other llama.cpp command-line tools. This reduces friction when running benchmarks out of the box. The release also updates the README with latest usage examples. Precompiled binaries are available for a wide range of platforms, including macOS (Apple Silicon and Intel), multiple Linux variants (CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL), Windows (CPU, CUDA 12/13, Vulkan, HIP), and Android arm64. For developers tuning local LLM deployments, this release streamlines performance testing without requiring manual flag tweaking.
- llama-bench now supports `-fa auto` for automatic flash attention selection during benchmarks
- Default `-ngl` value changed to -1 for consistency with other llama.cpp tools
- Prebuilt binaries provided for macOS, Linux, Windows, and Android across CPU and various GPU backends
Why It Matters
Simpler local LLM benchmarking helps developers optimize model inference speed and memory usage on diverse hardware.