Developer Tools

llama.cpp b9151 adds prompt timings, log reductions, and platform updates

New release prints per-request sampling params and cuts log clutter.

Deep Dive

llama.cpp, the popular C++ framework for running large language models locally, has tagged version b9151. The release focuses on operational quality: logs have been reduced to cut noise during inference, and a set of fixes addresses the server build, environment variable parsing, and the common module’s initial verbosity print. Most notably, the server now outputs prompt processing timings along with the sampling parameters used for each request. This gives developers visibility into how long token generation takes per prompt and which settings (temperature, top-k, etc.) were applied, making it easier to debug and optimize local LLM performance.

Platform support is a highlight: b9151 ships pre-built binaries for macOS (Apple Silicon with optional KleidiAI, Intel, iOS XCFramework), Linux (x64/arm64/s390x CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows (x64/arm64 CPU, CUDA 12.4/13.1, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler (x86 and aarch64 with 310p/910b ACL Graph). The wide array of GPU backends (CUDA, ROCm, Vulkan, SYCL, HIP) ensures compatibility across NVIDIA, AMD, and Intel hardware. For the local AI community, this release brings better observability and hardware flexibility without sacrificing the simplicity that made llama.cpp a go-to tool.

Key Points
  • Logs reduced and server builds fixed for cleaner operation
  • Server now prints prompt processing timings and sampling parameters per request
  • Expanded platform support: CUDA 12.4/13.1, ROCm 7.2, OpenVINO, SYCL, HIP, openEuler

Why It Matters

Better debugging and wider hardware support make local LLM deployment more production-ready for developers.