New /slots endpoint fields?

n_prompt_tokens, n_prompt_tokens_processed, n_prompt_tokens_cache

Enables real-time tracking of prompt evaluation progress, previously impossible for clients?

Enables real-time tracking of prompt evaluation progress, previously impossible for clients

Available on all major platforms (macOS, Linux, Windows, Android, iOS) with multiple acceleration backends?

Available on all major platforms (macOS, Linux, Windows, Android, iOS) with multiple acceleration backends

Developer Tools

Llama.cpp release b9276 adds prompt token tracking to API

llama.cpp Releases May 22, 2026

⚡Now clients can monitor prompt evaluation progress with new token counts.

Deep Dive

The open-source llama.cpp project, known for its efficient local inference of large language models, has released version b9276. This update adds critical observability to the server's /slots endpoint by exposing three previously internal token counters: n_prompt_tokens (total tokens in the prompt), n_prompt_tokens_processed (tokens processed so far), and n_prompt_tokens_cache (tokens served from cache). Developers can now monitor prompt evaluation progress in real time, enabling better debugging, progress bars, and performance tuning in client applications that use llama.cpp's API for local LLM deployments.

With over 112k stars on GitHub and 18.6k forks, llama.cpp remains one of the most popular open-source tools for running models like Llama, Mistral, and Gemma on consumer hardware. The b9276 release ships precompiled binaries for macOS (Apple Silicon and Intel), Linux (x64, arm64, s390x), Windows (x64, arm64), Android, and iOS, with support for CPU, CUDA, Vulkan, ROCm, OpenVINO, SYCL, and HIP backends. This diverse platform coverage ensures that developers can integrate prompt monitoring across edge devices, servers, and mobile environments without sacrificing performance or compatibility.

Key Points

New /slots endpoint fields: n_prompt_tokens, n_prompt_tokens_processed, n_prompt_tokens_cache
Enables real-time tracking of prompt evaluation progress, previously impossible for clients
Available on all major platforms (macOS, Linux, Windows, Android, iOS) with multiple acceleration backends

Why It Matters

Improves observability for local LLM inference, enabling better debugging and user feedback in AI applications.

Read Original Article

Llama.cpp release b9276 adds prompt token tracking to API

Why It Matters

Related Articles

🚀 Stay Ahead in AI