llama.cpp b9479 fixes session state bug saving wrong tokens
A bug causing token replay in the wrong position is now patched.
The latest llama.cpp release (b9479) from ggml-org addresses a critical bug in the common_prompt_batch_decode function that affected session state save and restore operations in completion.cpp and save-load-state.cpp. The bug caused the system to save only n-1 tokens in both the session_tokens array and the KV cache during state saving. When loading the session tokens back, if the prompt matched, the code would replay the last saved token (the n-1th) into the next position, effectively duplicating that token in the wrong place. This led to incorrect model output and could corrupt long-running sessions.
The fix ensures that all n tokens are stored in session_tokens, while the memory state accurately reflects that only n-1 tokens have been processed (since saving occurs before the last token is decoded in common_prompt_batch_decode). The commit, co-authored by fairydreaming, resolves issue #23400 and has been tested on transformer, recurrent, and hybrid model architectures. This release is available across all major platforms including macOS (Apple Silicon and Intel), iOS, Linux (x64, arm64, s390x), Android, and Windows (x64, arm64), with various backend support like CUDA, Vulkan, ROCm, OpenVINO, and SYCL.
- Bug in common_prompt_batch_decode saved only n-1 tokens instead of all n tokens in session_tokens and KV cache
- Caused token replay at the wrong position when restoring a session after a prompt match
- Fix stores all n tokens in session_tokens while memory state reflects n-1 processed tokens; tested on multiple model types
Why It Matters
Fixes a subtle session corruption bug, ensuring reliable state restoration in production LLM applications.