GATED_DELTA_NET for vulkan merged in llama.cpp
A new Vulkan backend optimization in llama.cpp delivers 30% faster token generation on AMD GPUs.
The open-source powerhouse behind efficient local AI inference, llama.cpp, has merged a significant performance optimization for its Vulkan compute backend. The change, officially titled 'Gated Delta Net' and submitted via GitHub Pull Request #20334, is now part of the project's main branch and the latest release. This update specifically targets the Vulkan API, a cross-platform graphics and compute standard, to improve how large language models (LLMs) run on compatible hardware, with AMD GPUs seeing immediate benefits.
Early benchmark results from a community tester demonstrate the tangible impact. Running on a Fedora Linux system with an AMD Radeon RX 7800 XT GPU, the inference speed for the 27-billion-parameter Qwen 3.5 model increased from approximately 28 tokens per second to about 36 tokens per second. This represents a roughly 30% performance uplift in token generation, a critical metric for real-time AI applications. The optimization works by refining how the model's computational graph is executed on the Vulkan backend, reducing overhead and improving hardware utilization.
This merge is part of llama.cpp's ongoing effort to democratize high-performance AI by supporting a wide range of hardware, from Apple Silicon and NVIDIA CUDA to this enhanced Vulkan path for AMD and Intel GPUs. For developers and enthusiasts using AMD graphics cards, this update lowers the barrier to running powerful local models like Qwen 3.5 27B, making interactive applications more responsive and feasible without expensive, proprietary hardware setups.
- Llama.cpp merged 'Gated Delta Net' optimization for its Vulkan backend via GitHub PR #20334.
- Performance on an AMD RX 7800 XT increased Qwen 3.5 27B inference from ~28 to ~36 tokens/sec, a 30% gain.
- The update enhances hardware accessibility, making powerful local LLMs more viable on non-NVIDIA GPUs.
Why It Matters
Lowers cost and increases performance for running local LLMs on AMD hardware, challenging NVIDIA's dominance in AI inference.