b8590
The latest update introduces a key toggle for AI reasoning costs and expands to 24+ hardware targets.
The open-source community behind the widely-used llama.cpp project has rolled out a significant new release, version b8590. This update, pushed by github-actions on March 31st, introduces a pivotal new control mechanism: the ability to disable backend sampling when a 'reasoning budget' is enabled. This feature, addressed in pull request #21209, allows developers to potentially constrain the computational cost of complex reasoning tasks, offering finer-grained control over resource allocation during AI inference. The change is a direct response to community feedback and represents a step towards more efficient and predictable model behavior for applications relying on chain-of-thought or multi-step reasoning.
Beyond the core feature, the b8590 release dramatically expands the project's out-of-the-box accessibility. The team now provides a comprehensive suite of 24+ pre-compiled binaries targeting a vast array of hardware. This includes native builds for Apple Silicon and Intel Macs, multiple Linux configurations (supporting CPU, Vulkan, and ROCm 7.2 for AMD GPUs), and extensive Windows support covering CPU, CUDA 12/13 for NVIDIA GPUs, Vulkan, SYCL, and even experimental HIP builds. Notably, the release also adds official support for openEuler, Huawei's Linux distribution, with builds optimized for their Ascend 310P and 910B AI accelerators. This broad compatibility lowers the barrier to entry, allowing developers and researchers to deploy efficient LLM inference on nearly any hardware stack without wrestling with complex compilation processes.
- Introduces 'reasoning budget' control to disable backend sampling, optimizing for chain-of-thought tasks (PR #21209).
- Expands to 24+ pre-built binaries covering macOS, Linux, Windows, and openEuler across CPU and 7+ GPU/accelerator backends.
- Adds official support for specialized hardware like AMD ROCm 7.2, Intel OpenVINO, and Huawei Ascend AI processors.
Why It Matters
This update makes efficient, local LLM inference more controllable and accessible across a wider range of professional and research hardware environments.