Developer Tools

llama.cpp b9873 fixes KV rotation crash on unallocated buffers

A critical patch prevents GPU abort during DFlash speculative decoding.

Deep Dive

The llama.cpp project has released version b9873, a patch release that addresses a critical crash bug in the K/V rotation input handling. The issue occurred in certain speculative decoding workflows—specifically DFlash speculative decoding’s KV-injection pass—where the rotation tensor's buffer could be unallocated (NULL) when the graph only stores K/V without attending. When set_input_k_rot and set_input_v_rot were called, they would invoke ggml_backend_buffer_is_host() on a NULL buffer, which triggered a GGML_ASSERT abort.

The fix adds a guard check: the rotation input functions now verify the tensor's buffer is non-null before proceeding, identical to the check already used for kq_mask inputs. Since an unallocated buffer has no data to upload, skipping the operation is safe and correct. This was contributed by liminfei-amd and signed with a verified GPG key. The release includes pre-built binaries for Apple Silicon (with KleidiAI optional), Intel macOS, iOS XCFramework, Linux on x64/arm64/s390x, various GPU backends (Vulkan, ROCm, CUDA, SYCL, HIP), Android arm64, Windows (x64/arm64, CUDA, Vulkan, OpenCL, OpenVINO), and openEuler platforms.

Key Points
  • Fixes GGML_ASSERT crash on NULL buffer during K/V rotation input (issue #25191).
  • Affects DFlash speculative decoding's KV-injection pass when buffer is unallocated.
  • Adds same buffer-non-null guard as used for kq_mask inputs; patch by liminfei-amd.

Why It Matters

Ensures stable speculative decoding in llama.cpp, preventing crashes in advanced LLM inference pipelines.

📬 Get the top 10 AI stories daily