Developer Tools

b8786

llama.cpp Releases April 14, 2026

⚡Critical update restores GPU sampling, boosting inference speeds by 30% for popular reasoning models.

Deep Dive

The open-source project llama.cpp, maintained by ggml-org, has released a crucial performance fix in commit b8786. This update resolves a significant 30% speed regression that was plaguing several advanced "reasoning" language models, including Google's Gemma 4, Kimi's K2, and Mistral's Ministral 3. The bug was introduced when support for thinking tags (like `thinking_start_tag`) was added, which inadvertently forced the creation of a "reasoning budget sampler" even when no token budget was set. This sampler, while idle, had the critical side effect of disabling backend sampling—a key optimization where the GPU directly selects the next token, avoiding costly data transfers to the CPU.

The fix is elegantly simple: it now checks if a reasoning budget is explicitly configured (e.g., 0, 128, or 1024 tokens) before creating the sampler. If the budget is unlimited (the default -1 value), the sampler is skipped entirely, allowing backend GPU sampling to remain active. This restores the original inference speeds, such as the reported case where performance jumped back from 70 tokens/second to 98 tokens/second on Vulkan. The commit also preserves the sampler when using "grammar_lazy" mode to ensure tool-calling functionality remains intact. This update is vital for developers and users running these state-of-the-art reasoning models locally, ensuring they get the full computational efficiency llama.cpp is known for.

Key Points

Fixes a 30% speed regression (e.g., 98 to 70 t/s) caused by disabled GPU backend sampling.
Impacts reasoning models like Gemma 4, Kimi K2, LFM2, and Ministral 3 that use thinking tags.
Patch ensures reasoning budget sampler only activates when a token limit is explicitly set, restoring optimal performance.

Why It Matters

Restores full local inference speed for cutting-edge reasoning models, crucial for developers building performant AI applications.

Read Original Article

b8786

Why It Matters

Stay Ahead in AI