Developer Tools

llama.cpp b9100 adds post-sampling probability support for transparency

llama.cpp's latest release now returns post-sampling probabilities—giving developers deeper insight into model decisions.

Deep Dive

llama.cpp, the popular open-source C++ implementation for running large language models locally, has released version b9100. The headline feature is backend sampling support for returning post-sampling probabilities. Previously, after sampling a token, the model only provided the final chosen token without probability details. Now developers can retrieve the full probability distribution over the vocabulary after sampling, giving them insight into the model's confidence and alternatives. Additionally, the server component will never return 0.0 post-sampling probabilities, avoiding misleading zero values. This release includes builds for Linux (x64, arm64, s390x), Windows (x64, arm64) with various backends (CUDA, Vulkan, SYCL, HIP), macOS (Apple Silicon and Intel), iOS, Android, and openEuler with ACL Graph support.

The update addresses a long-standing request from the local AI community for more transparency in local model inference. By exposing post-sampling probabilities, developers can implement custom sampling strategies, better debugging, and more nuanced control over generation quality. This feature is particularly valuable for applications requiring deterministic outputs or fine-grained tuning of temperature and top-k settings. llama.ccp’s release process (over 109,000 stars on GitHub) demonstrates the project's maturity and commitment to continuous improvement. The b9100 tag also includes various bug fixes and performance optimizations across backends, reinforcing llama.cpp's role as a critical tool for running LLMs on consumer hardware.

Key Points
  • Backend sampling now returns post-sampling probabilities (probs) for deeper model insight
  • Server ensures non-zero post-sampling probabilities, avoiding misleading zero values
  • Supports all major platforms: Linux, Windows, macOS, iOS, Android, with multiple GPU backends

Why It Matters

Gives local LLM users probability transparency, enabling better sampling control and debugging in production apps.