Fixes GGML_ASSERT crash on NULL buffer during K/V rotation input (issue #25191)?

Fixes GGML_ASSERT crash on NULL buffer during K/V rotation input (issue #25191).

Affects DFlash speculative decoding's KV-injection pass when buffer is unallocated?

Affects DFlash speculative decoding's KV-injection pass when buffer is unallocated.

Adds same buffer-non-null guard as used for kq_mask inputs; patch by liminfei-amd?

Adds same buffer-non-null guard as used for kq_mask inputs; patch by liminfei-amd.

Developer Tools

llama.cpp b9873 fixes KV rotation crash on unallocated buffers

llama.cpp Releases July 05, 2026

⚡A critical patch prevents GPU abort during DFlash speculative decoding.

Deep Dive

The llama.cpp project has released version b9873, a patch release that addresses a critical crash bug in the K/V rotation input handling. The issue occurred in certain speculative decoding workflows—specifically DFlash speculative decoding’s KV-injection pass—where the rotation tensor's buffer could be unallocated (NULL) when the graph only stores K/V without attending. When set_input_k_rot and set_input_v_rot were called, they would invoke ggml_backend_buffer_is_host() on a NULL buffer, which triggered a GGML_ASSERT abort.

The fix adds a guard check: the rotation input functions now verify the tensor's buffer is non-null before proceeding, identical to the check already used for kq_mask inputs. Since an unallocated buffer has no data to upload, skipping the operation is safe and correct. This was contributed by liminfei-amd and signed with a verified GPG key. The release includes pre-built binaries for Apple Silicon (with KleidiAI optional), Intel macOS, iOS XCFramework, Linux on x64/arm64/s390x, various GPU backends (Vulkan, ROCm, CUDA, SYCL, HIP), Android arm64, Windows (x64/arm64, CUDA, Vulkan, OpenCL, OpenVINO), and openEuler platforms.

Key Points

Fixes GGML_ASSERT crash on NULL buffer during K/V rotation input (issue #25191).
Affects DFlash speculative decoding's KV-injection pass when buffer is unallocated.
Adds same buffer-non-null guard as used for kq_mask inputs; patch by liminfei-amd.

Why It Matters

Ensures stable speculative decoding in llama.cpp, preventing crashes in advanced LLM inference pipelines.

Read Original Article

llama.cpp b9873 fixes KV rotation crash on unallocated buffers

Why It Matters

Related Articles

🚀 Stay Ahead in AI