b9020
108K-starred llama.cpp's new release improves forced tool calls and whitespace parsing.
llama.cpp, the wildly popular open-source C/C++ LLM inference engine, has shipped version b9020. This maintenance release focuses on refinements to the autoparser system, which handles structured output and tool calling—critical for developers building AI agents that require reliable function execution. The core fixes address how newlines are handled during forced tool calls and how whitespace is processed: the optspace() function has been moved to chat-peg-parser, and whitespace is now trimmed on the final apply step rather than during initial parsing. This change invalidates some server tests, which have been commented out, as the new behavior allows content that was previously blocked. The release is a sign of the project's maturation, ensuring that local LLM tool calling works more predictably.
Behind the scenes, the fixes were committed by a community contributor and signed with GitHub's verified signature, reflecting the project's open governance. b9020 is distributed through GitHub Assets, supporting a vast array of platforms: macOS (Apple Silicon with optional KleidiAI, Intel), iOS (XCFramework), Linux (x64, arm64, s390x with CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows (x64, arm64, CUDA 12/13, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler (x86/aarch64 with various backends). For developers using llama.cpp in production or research, this update directly improves reliability when using constrained grammar, JSON mode, or function calling—a key capability for running local AI agents without cloud dependencies.
- Fixes newline handling in forced tool calls within common/autoparser and chat/autoparser modules.
- Whitespace is now trimmed on apply instead of during parsing, changing behavior for previously invalidated server tests.
- Available across 25+ platform configurations including CPU, CUDA, Vulkan, ROCm, SYCL, OpenVINO, HIP, and mobile (Android/iOS).
Why It Matters
Essential update for developers running local LLMs with function calling, improving reliability of tool use.