b8105
The latest update patches a critical CUDA kernel selection bug, improving performance for GPU users.
Deep Dive
The open-source project llama.cpp, maintained by the ggml organization, released version b8105. This update specifically fixes a kernel selection logic bug for tile-based Flash Attention (FA) on CUDA GPUs, as detailed in pull request #19686. The fix ensures the correct, faster kernel is selected during inference, which can improve the speed and stability of running models like Llama 3 and others on NVIDIA hardware across Windows and Linux.
Why It Matters
For developers running local LLMs, this fix means more reliable and potentially faster performance on consumer and server GPUs.