b8922
Running LLMs in-browser just got faster with flash attention for WebGPU.
The llama.cpp project has released version b8922, a significant update that brings flash attention (FLASH_ATTN_EXT) to WebGPU-enabled browsers. This feature allows for faster and more memory-efficient attention mechanisms in large language models running locally in the browser. The release includes tile flash attention fallback for browsers lacking subgroup matrix support, alongside a vec path that discards the mnk parameter for cleaner execution. Developers can expect improved performance when running models like Llama or Mistral directly in the browser without server-side compute.
Key technical improvements include staging KV cache for the flash attention tile version, merging bindings when KV overlaps, and moving path selection into the shader library for a single flash-attention decision object. The update also addresses buffer overlapping issues when nwg==1 and turns off skip_validation for tighter control. This release is part of the ongoing effort to make local AI inference more accessible and performant across platforms, with pre-built binaries available for macOS, Linux, Windows, Android, and iOS.
- Enables flash attention (FLASH_ATTN_EXT) for WebGPU on browsers without subgroup matrix support, with tile and vec fallback paths.
- Includes optimized KV cache staging and shader path selection for improved inference performance.
- Available as pre-built binaries for macOS, Linux, Windows, Android, and iOS, including CPU and GPU variants.
Why It Matters
Faster in-browser LLM inference enables more responsive AI apps without server costs.