Developer Tools

llama.cpp b9464 optimizes speculative decoding with new helper function

Speculative decoding gets a n_max fix and reusable helper for faster local inference.

Deep Dive

llama.cpp's latest tag b9464, released on June 1, focuses on stability and performance for speculative decoding—a technique to accelerate LLM inference by using a smaller draft model. Key changes include a fix for n_outputs_max (the maximum number of draft tokens) and the removal of the draft-simple auto-enable behaviour, which could cause unintended activation. Developers extracted a new common_speculative_n_max() helper function in common/speculative to centralise max-draft-size logic previously scattered across server code.

Additionally, the draft context now always produces n_parallel outputs, ensuring consistent behaviour in parallel generation scenarios. CI improvements enable server tests on pull requests, catching regressions earlier. Builds cover Apple Silicon (arm64, with KleidiAI optional), Linux (x64/arm64 with Vulkan/ROCm), and Windows (x64/arm64 with CUDA 12/13). These refinements make llama.cpp more reliable for local LLM serving and experimentation.

Key Points
  • Fixes n_outputs_max in speculative decoding to prevent draft token overrun
  • Introduces common_speculative_n_max() helper to unify draft-size logic across server and CLI
  • Draft context now always uses n_parallel outputs, improving reproducibility

Why It Matters

Speculative decoding unlocks 2-3x faster local inference; this update makes it more robust for production use.