Developer Tools

llama.cpp b9468 adds real-time reasoning interruption via control endpoint

Stop AI thinking mid-generation with a simple API call — no more waiting for complete output.

Deep Dive

The open‑source LLM inference engine llama.cpp (114k stars on GitHub) released b9468, centered on a new server capability: real‑time reasoning interruption. Developers can now call `POST /v1/chat/completions/control` with an `id_slot` or completion ID and the `reasoning_end` action to force the model to stop generating reasoning tokens mid‑stream. The implementation guards against time‑of‑check/time‑of‑use (TOCTOU) races by targeting the completion ID instead of the raw slot, and only works if the slot’s `reasoning_control` flag is enabled. This gives applications fine‑grained control over AI thinking without waiting for the full output.

Alongside the server change, the WebUI got a streamlined experience: the skip button now only appears during the reasoning phase (detected via a new `isReasoning` stream state in the chat store), not during the full generation. Internal refactors moved control endpoint strings into shared constants, and the completion ID is now relayed through the agentic flow so the button works correctly in multi‑turn conversations. The release also includes platform builds for macOS (arm64 including KleidiAI), Linux (x64, arm64, Vulkan, ROCm, OpenVINO, SYCL), Windows (CPU, CUDA, Vulkan, HIP), and Android arm64.