Bug in common_prompt_batch_decode saved only n-1 tokens instead of all n tokens in session_tokens and KV cache?

Bug in common_prompt_batch_decode saved only n-1 tokens instead of all n tokens in session_tokens and KV cache

Caused token replay at the wrong position when restoring a session after a prompt match?

Caused token replay at the wrong position when restoring a session after a prompt match

Fix stores all n tokens in session_tokens while memory state reflects n-1 processed tokens; tested on multiple model types?

Fix stores all n tokens in session_tokens while memory state reflects n-1 processed tokens; tested on multiple model types

Developer Tools

llama.cpp b9479 fixes session state bug saving wrong tokens

llama.cpp Releases June 02, 2026

⚡A bug causing token replay in the wrong position is now patched.

Deep Dive

The latest llama.cpp release (b9479) from ggml-org addresses a critical bug in the common_prompt_batch_decode function that affected session state save and restore operations in completion.cpp and save-load-state.cpp. The bug caused the system to save only n-1 tokens in both the session_tokens array and the KV cache during state saving. When loading the session tokens back, if the prompt matched, the code would replay the last saved token (the n-1th) into the next position, effectively duplicating that token in the wrong place. This led to incorrect model output and could corrupt long-running sessions.

The fix ensures that all n tokens are stored in session_tokens, while the memory state accurately reflects that only n-1 tokens have been processed (since saving occurs before the last token is decoded in common_prompt_batch_decode). The commit, co-authored by fairydreaming, resolves issue #23400 and has been tested on transformer, recurrent, and hybrid model architectures. This release is available across all major platforms including macOS (Apple Silicon and Intel), iOS, Linux (x64, arm64, s390x), Android, and Windows (x64, arm64), with various backend support like CUDA, Vulkan, ROCm, OpenVINO, and SYCL.

Key Points

Bug in common_prompt_batch_decode saved only n-1 tokens instead of all n tokens in session_tokens and KV cache
Caused token replay at the wrong position when restoring a session after a prompt match
Fix stores all n tokens in session_tokens while memory state reflects n-1 processed tokens; tested on multiple model types

Why It Matters

Fixes a subtle session corruption bug, ensuring reliable state restoration in production LLM applications.

Read Original Article

llama.cpp b9479 fixes session state bug saving wrong tokens

Why It Matters

Related Articles

🚀 Stay Ahead in AI