Developer Tools

llama.cpp b9862 eliminates redundant CUDA copies for faster inference

Removes 4 extra GPU copy calls per decode step with GDN optimization.

Deep Dive

llama.cpp, the popular open-source C/C++ implementation for running large language models locally, has released version b9862 with a targeted performance boost for models using Gated Delta Net (GDN) architectures. The commit, authored by the ggml-org team, removes redundant CUDA memory copy operations that occurred during GDN inference.

Previously, the GDN kernel would write recurrent state snapshots into the tail of its output buffer. The compute graph then immediately issued separate `ggml_cuda_cpy` calls to copy those snapshots into the `ssm_states_all` cache. When using multi-token prediction (MTP) with a draft length of 3, this pattern triggered 4 extra CUDA copy kernels per decode step — adding unnecessary latency and memory bandwidth pressure.

The fix intelligently detects this exact pattern: `gated_delta_net -> view -> cpy`. The CUDA GDN kernel now writes the state snapshots directly into the recurrent cache, bypassing the intermediate writes and copy kernels entirely when safe. This optimization is particularly beneficial for models that heavily rely on stateful recurrent layers, such as certain state-space model variants.

No additional changes to API or configuration are required — this is a transparent speedup for anyone already running llama.cpp with CUDA. The release also includes the usual cross-platform binaries for Linux, Windows, macOS, and Android, ensuring broad compatibility. Users will notice reduced inference latency and lower GPU memory bandwidth utilization in applicable scenarios.

Key Points
  • Removes redundant CUDA copy operations for GDN (Gated Delta Net) state snapshots.
  • Eliminates up to 4 extra `ggml_cuda_cpy` calls per decode step when using MTP draft length 3.
  • Directly writes recurrent state into cache, reducing memory bandwidth and latency on NVIDIA GPUs.

Why It Matters

Faster LLM inference on consumer and enterprise GPUs with zero configuration changes.

📬 Get the top 10 AI stories daily