b8299
A new commit adds GPU acceleration for DeltaNet models, pushing Qwen3.5-0.8B to 213 tokens/sec on an M4 Max.
The open-source project llama.cpp, maintained by ggml-org, has released a significant performance update with commit b8299. The core addition is a new, fused Metal kernel for the GGML_OP_GATED_DELTA_NET operation, which is the architecture used by models like Qwen3.5. This kernel enables GPU-accelerated inference for these models on Apple Silicon Macs, supporting both GDA (scalar gate) and KDA (per-row gate) modes for head sizes of 64 and 128. The result is a tangible 25% speed increase, pushing a quantized Qwen3.5-0.8B model from 170 to 213 tokens per second on an Apple M4 Max.
Beyond Apple hardware, the commit includes substantial CUDA backend improvements for Nvidia GPUs. These changes optimize the gated delta net operation by sharding computations across warps, which reduces register pressure and gives the warp scheduler more concurrent thread arrays (CTAs) to manage. This improves latency hiding and overall throughput. The update also refactors the internal API for building the delta network and ensures graceful fallback to CPU for unsupported configurations, maintaining broad compatibility across macOS, Linux, Windows, and openEuler systems.
- Adds a fused Metal kernel for GGML_OP_GATED_DELTA_NET, enabling GPU acceleration for Qwen3.5 on Apple Silicon.
- Delivers a 25% performance boost, from 170 to 213 tokens/sec for Qwen3.5-0.8B Q4_K_M on an M4 Max.
- Includes CUDA optimizations that reduce register pressure and improve warp scheduling for Nvidia GPUs.
Why It Matters
This significantly lowers the hardware barrier for running state-of-the-art models locally, making efficient AI inference more accessible on consumer Apple devices.