Developer Tools

Llama.cpp release b9276 adds prompt token tracking to API

Now clients can monitor prompt evaluation progress with new token counts.

Deep Dive

The open-source llama.cpp project, known for its efficient local inference of large language models, has released version b9276. This update adds critical observability to the server's /slots endpoint by exposing three previously internal token counters: n_prompt_tokens (total tokens in the prompt), n_prompt_tokens_processed (tokens processed so far), and n_prompt_tokens_cache (tokens served from cache). Developers can now monitor prompt evaluation progress in real time, enabling better debugging, progress bars, and performance tuning in client applications that use llama.cpp's API for local LLM deployments.

With over 112k stars on GitHub and 18.6k forks, llama.cpp remains one of the most popular open-source tools for running models like Llama, Mistral, and Gemma on consumer hardware. The b9276 release ships precompiled binaries for macOS (Apple Silicon and Intel), Linux (x64, arm64, s390x), Windows (x64, arm64), Android, and iOS, with support for CPU, CUDA, Vulkan, ROCm, OpenVINO, SYCL, and HIP backends. This diverse platform coverage ensures that developers can integrate prompt monitoring across edge devices, servers, and mobile environments without sacrificing performance or compatibility.

Key Points
  • New /slots endpoint fields: n_prompt_tokens, n_prompt_tokens_processed, n_prompt_tokens_cache
  • Enables real-time tracking of prompt evaluation progress, previously impossible for clients
  • Available on all major platforms (macOS, Linux, Windows, Android, iOS) with multiple acceleration backends

Why It Matters

Improves observability for local LLM inference, enabling better debugging and user feedback in AI applications.