llama.cpp b9464 optimizes speculative decoding with new helper function
Speculative decoding gets a n_max fix and reusable helper for faster local inference.
llama.cpp's latest tag b9464, released on June 1, focuses on stability and performance for speculative decoding—a technique to accelerate LLM inference by using a smaller draft model. Key changes include a fix for n_outputs_max (the maximum number of draft tokens) and the removal of the draft-simple auto-enable behaviour, which could cause unintended activation. Developers extracted a new common_speculative_n_max() helper function in common/speculative to centralise max-draft-size logic previously scattered across server code.
Additionally, the draft context now always produces n_parallel outputs, ensuring consistent behaviour in parallel generation scenarios. CI improvements enable server tests on pull requests, catching regressions earlier. Builds cover Apple Silicon (arm64, with KleidiAI optional), Linux (x64/arm64 with Vulkan/ROCm), and Windows (x64/arm64 with CUDA 12/13). These refinements make llama.cpp more reliable for local LLM serving and experimentation.
- Fixes n_outputs_max in speculative decoding to prevent draft token overrun
- Introduces common_speculative_n_max() helper to unify draft-size logic across server and CLI
- Draft context now always uses n_parallel outputs, improving reproducibility
Why It Matters
Speculative decoding unlocks 2-3x faster local inference; this update makes it more robust for production use.