Removes redundant CUDA copy operations for GDN (Gated Delta Net) state snapshots?

Removes redundant CUDA copy operations for GDN (Gated Delta Net) state snapshots.

Eliminates up to 4 extra `ggml_cuda_cpy` calls per decode step when using MTP draft length 3?

Eliminates up to 4 extra `ggml_cuda_cpy` calls per decode step when using MTP draft length 3.

Directly writes recurrent state into cache, reducing memory bandwidth and latency on NVIDIA GPUs?

Directly writes recurrent state into cache, reducing memory bandwidth and latency on NVIDIA GPUs.

Developer Tools

llama.cpp b9862 eliminates redundant CUDA copies for faster inference

llama.cpp Releases July 04, 2026

⚡Removes 4 extra GPU copy calls per decode step with GDN optimization.

Deep Dive

llama.cpp, the popular open-source C/C++ implementation for running large language models locally, has released version b9862 with a targeted performance boost for models using Gated Delta Net (GDN) architectures. The commit, authored by the ggml-org team, removes redundant CUDA memory copy operations that occurred during GDN inference.

Previously, the GDN kernel would write recurrent state snapshots into the tail of its output buffer. The compute graph then immediately issued separate `ggml_cuda_cpy` calls to copy those snapshots into the `ssm_states_all` cache. When using multi-token prediction (MTP) with a draft length of 3, this pattern triggered 4 extra CUDA copy kernels per decode step — adding unnecessary latency and memory bandwidth pressure.

The fix intelligently detects this exact pattern: `gated_delta_net -> view -> cpy`. The CUDA GDN kernel now writes the state snapshots directly into the recurrent cache, bypassing the intermediate writes and copy kernels entirely when safe. This optimization is particularly beneficial for models that heavily rely on stateful recurrent layers, such as certain state-space model variants.

No additional changes to API or configuration are required — this is a transparent speedup for anyone already running llama.cpp with CUDA. The release also includes the usual cross-platform binaries for Linux, Windows, macOS, and Android, ensuring broad compatibility. Users will notice reduced inference latency and lower GPU memory bandwidth utilization in applicable scenarios.

Key Points

Removes redundant CUDA copy operations for GDN (Gated Delta Net) state snapshots.
Eliminates up to 4 extra `ggml_cuda_cpy` calls per decode step when using MTP draft length 3.
Directly writes recurrent state into cache, reducing memory bandwidth and latency on NVIDIA GPUs.

Why It Matters

Faster LLM inference on consumer and enterprise GPUs with zero configuration changes.

Read Original Article

llama.cpp b9862 eliminates redundant CUDA copies for faster inference

Why It Matters

Related Articles

🚀 Stay Ahead in AI