Developer Tools

llama.cpp b9109 adds parallel drafting for faster speculative decoding

New release lets multiple draft models run in parallel to boost inference speed.

Deep Dive

The latest release of llama.cpp (b9109) by ggml-org introduces parallel drafting support for speculative decoding, a technique that accelerates autoregressive generation by having a smaller draft model propose tokens that a larger target model then verifies. Instead of a single draft model, this update allows multiple speculator types to be specified in a vector, each running sequentially in a draft loop. At each step, the system picks the best draft (the one maximizing expected accepted tokens) by computing the product of acceptance probability and draft length. Async evaluation keeps drafting efficient while the target model is verifying. This means developers can now combine different draft model strategies — for example, a fast n-gram model and a small transformer — to improve acceptance rates on diverse inputs.

Beyond parallel drafting, the release refactors the entire speculative system, unifying contexts between main and draft models, adding a dedicated context for multi-sequence speculative generation, and fixing image processing through the draft context for multimodal models. The server side gains a draft prompt cache, checkpointing for long-running drafts, and clearer context naming. It also updates the `speculative-simple` example and improves naming consistency across the codebase. The release ships builds for all major platforms including macOS (Apple Silicon with optional KleidiAI acceleration), Linux (CPU, Vulkan, ROCm, OpenVINO, SYCL), Windows (CPU, CUDA, Vulkan, SYCL, HIP), and Android/iOS. As the most-starred open-source LLM inference project (110k+ stars), this update solidifies llama.cpp's position as the go-to engine for local LLM deployment.

Key Points
  • Parallel drafting support: multiple speculator types run sequentially; the best draft (maximizing expected accepted tokens) is selected per iteration.
  • Chain-of-speculators: users can define a chain of draft models using a vector of `common_speculative_type`, with automatic selection based on acceptance probability and draft length.
  • Server improvements include draft prompt caching, checkpointing for non-ckpt models, async eval, and full support for multimodal draft processing.

Why It Matters

Faster, more efficient local LLM inference—developers can now mix draft models to slash latency without sacrificing quality.