Developer Tools

b8922

llama.cpp Releases April 25, 2026

⚡Running LLMs in-browser just got faster with flash attention for WebGPU.

Deep Dive

The llama.cpp project has released version b8922, a significant update that brings flash attention (FLASH_ATTN_EXT) to WebGPU-enabled browsers. This feature allows for faster and more memory-efficient attention mechanisms in large language models running locally in the browser. The release includes tile flash attention fallback for browsers lacking subgroup matrix support, alongside a vec path that discards the mnk parameter for cleaner execution. Developers can expect improved performance when running models like Llama or Mistral directly in the browser without server-side compute.

Key technical improvements include staging KV cache for the flash attention tile version, merging bindings when KV overlaps, and moving path selection into the shader library for a single flash-attention decision object. The update also addresses buffer overlapping issues when nwg==1 and turns off skip_validation for tighter control. This release is part of the ongoing effort to make local AI inference more accessible and performant across platforms, with pre-built binaries available for macOS, Linux, Windows, Android, and iOS.

Key Points

Enables flash attention (FLASH_ATTN_EXT) for WebGPU on browsers without subgroup matrix support, with tile and vec fallback paths.
Includes optimized KV cache staging and shader path selection for improved inference performance.
Available as pre-built binaries for macOS, Linux, Windows, Android, and iOS, including CPU and GPU variants.

Why It Matters

Faster in-browser LLM inference enables more responsive AI apps without server costs.

Read Original Article

b8922

Why It Matters

Stay Ahead in AI