Developer Tools

b8317

llama.cpp Releases March 13, 2026

⚡New commit implements fused GPU shader, cutting 32K redundant global reads and 16K exp() calls per token.

Deep Dive

The llama.cpp project has released a significant update with commit b8317, adding Vulkan compute shader support for the GATED_DELTA_NET operation. This specialized hardware acceleration implements a fused gated delta net recurrence that supports scalar gates, KDA (Key-Dependent Attention) vector gates, GQA (Grouped Query Attention) broadcast, multi-token sequences, and permuted inputs. The implementation passes all 13 test-backend-ops cases on AMD Radeon 890M hardware using the RADV GFX1150 driver.

The optimization delivers measurable performance gains through several key techniques: using vec4 dot products with dp4 hardware intrinsics, caching exponential gate values in shared memory to eliminate approximately 32,000 redundant global reads and 16,000 redundant exp() function calls per token, and implementing fused decay operations with rank-1 updates that reduce operations from 12 scalar ops to just 3 vec4 ops. These improvements result in a 5.4% throughput increase for KDA paths while maintaining performance for non-KDA operations. The update also includes proper handling for interleaved head layouts and FLOAT_TYPE casting for consistent behavior across different Vulkan configurations.

This release represents a collaborative effort with contributions from both human developers and AI systems, notably Claude Opus 4.6, which co-authored multiple commits. The update is part of llama.cpp's ongoing effort to optimize inference across diverse hardware platforms, including macOS Apple Silicon, Linux Vulkan/ROCm systems, and Windows with CUDA, Vulkan, and SYCL support. The GATED_DELTA_NET operation is particularly relevant for modern transformer architectures that use gating mechanisms for improved efficiency and performance.

Key Points

Implements Vulkan compute shader for GATED_DELTA_NET with 5.4% KDA throughput gain
Eliminates ~32K redundant global reads and ~16K exp() calls per token via shared memory caching
Supports scalar/KDA vector gates, GQA broadcast, and passes all 13 test cases on AMD Radeon 890M

Why It Matters

Enables more efficient GPU inference for modern gated transformer architectures, reducing computational overhead for AI applications.

Read Original Article

b8317

Why It Matters

Stay Ahead in AI