Developer Tools

llama.cpp b9827 adds CUDA fast path for 2D strided copies

New cudaMemcpy2DAsync optimization speeds up recurrent snapshot updates by 10x...

Deep Dive

The open-source llama.cpp project, led by ggml-org, has released version b9827, a significant update primarily focused on GPU performance optimization. The headline feature is a new fast path for the ggml_cuda_cpy operation in CUDA environments. Previously, when copying tensors that were not fully contiguous but had rows that were contiguous (strided 2D blocks), the library used a slow element-wise scalar kernel. With b9827, these strided copies now leverage cudaMemcpy2DAsync, a highly efficient CUDA API designed for pitched 2D block copies. This change dramatically accelerates memory transfers in scenarios like GDN (Grouped-query attention with Dynamic Normalization) recurrent snapshot updates when using multiple GPUs (e.g., with the -np 4 flag), where rollback slots are separated by cache stride gaps.

The release includes comprehensive testing across many platforms and hardware backends: macOS Apple Silicon (both with and without KleidiAI optimizations), Ubuntu x64/arm64/s390x with CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL, and AMD HIP; Windows x64/arm64 with CPU, CUDA 12/13, Vulkan, OpenVINO, and SYCL; Android arm64; and openEuler with Ascend hardware. Notably, the OpenVINO backend explicitly marks strided copy as unsupported due to new test failures. The update also includes new tests to verify the optimized strided copy path, ensuring regressions are caught. This release is particularly important for developers running inference on large language models with recurrent architectures across multiple GPUs, as it reduces memory transfer latency and improves overall throughput.

Key Points
  • b9827 introduces a cudaMemcpy2DAsync fast path for ggml_cuda_cpy, replacing slow scalar kernels for strided 2D copies.
  • Fixes performance issues in GDN recurrent snapshot updates with -np 4 (multi-GPU scenarios) by optimizing cache stride gap rollback slots.
  • Supports 15+ platform/backend combinations including CUDA 12/13, ROCm 7.2, Vulkan, SYCL, ARM, and macOS KleidiAI.

Why It Matters

Boosts inference speed for multi-GPU LLMs with recurrent architectures, reducing memory transfer bottlenecks by up to 10x.

📬 Get the top 10 AI stories daily