llama-bench now supports `-fa auto` for automatic flash attention selection during benchmarks?

llama-bench now supports `-fa auto` for automatic flash attention selection during benchmarks

Default `-ngl` value changed to -1 for consistency with other llama.cpp tools?

Default `-ngl` value changed to -1 for consistency with other llama.cpp tools

Prebuilt binaries provided for macOS, Linux, Windows, and Android across CPU and various GPU backends?

Prebuilt binaries provided for macOS, Linux, Windows, and Android across CPU and various GPU backends

Developer Tools

llama.cpp b9437 adds auto flash attention benchmark flag

llama.cpp Releases May 31, 2026

⚡New release optimizes local LLM benchmarking with -fa auto and default ngl=-1.

Deep Dive

The open-source llama.cpp project, widely used for running large language models locally on consumer hardware, has released version b9437. This update focuses on improving the benchmarking experience via the llama-bench tool. The headline feature is support for the `-fa auto` flag, which automatically selects the optimal flash attention implementation during benchmarks. Flash attention can significantly reduce memory usage and speed up inference, but configuring it manually is often error-prone. The `auto` option removes that guesswork.

Additionally, the default value for `-ngl` (the number of GPU layers offloaded) has been changed to -1, matching the convention used in other llama.cpp command-line tools. This reduces friction when running benchmarks out of the box. The release also updates the README with latest usage examples. Precompiled binaries are available for a wide range of platforms, including macOS (Apple Silicon and Intel), multiple Linux variants (CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL), Windows (CPU, CUDA 12/13, Vulkan, HIP), and Android arm64. For developers tuning local LLM deployments, this release streamlines performance testing without requiring manual flag tweaking.

Key Points

llama-bench now supports `-fa auto` for automatic flash attention selection during benchmarks
Default `-ngl` value changed to -1 for consistency with other llama.cpp tools
Prebuilt binaries provided for macOS, Linux, Windows, and Android across CPU and various GPU backends

Why It Matters

Simpler local LLM benchmarking helps developers optimize model inference speed and memory usage on diverse hardware.

Read Original Article

llama.cpp b9437 adds auto flash attention benchmark flag

Why It Matters

Related Articles

🚀 Stay Ahead in AI