Developer Tools

b8964

Multithink blocks in Qwen models now stay on budget with new fix...

Deep Dive

The llama.cpp project has released version b8964, a maintenance update that addresses a critical bug in the reasoning budget system for local LLM inference. The issue affected models that interleave multiple think blocks per response, specifically observed on the unsloth/Qwen3.6-27B-GGUF model. When the DONE state absorbed all tokens including a new start tag, subsequent think blocks after the first would run unbudgeted, breaking reasoning budget tracking.

The fix works by advancing the start_matcher in the DONE branch and re-arming the system to COUNTING state with a fresh budget on match. The release includes a regression test (test-reasoning-budget: test 6) to prevent future breakage. The b8964 release ships with prebuilt binaries for macOS (Apple Silicon and Intel), Linux (x64, arm64, s390x with CPU, Vulkan, ROCm, OpenVINO, SYCL variants), Windows (x64, arm64, CUDA 12/13, Vulkan, SYCL, HIP), Android (arm64), and iOS (XCFramework).

Key Points
  • Fixes reasoning budget bug where multiple think blocks after the first ran unbudgeted in models like unsloth/Qwen3.6-27B-GGUF
  • Fix advances start_matcher in DONE branch and re-arms to COUNTING with fresh budget on match
  • Includes regression test (test-reasoning-budget: test 6) and prebuilt binaries for macOS, Linux, Windows, Android, and iOS

Why It Matters

Local LLM users get reliable reasoning budget tracking for multi-think models, enabling accurate token cost management.