b8978
New update discards low-probability tokens for smoother local LLM runs
Deep Dive
The llama.cpp project (108k+ stars) released b8978, a spec (speculative decoding) update that discards the last drafted token with low probability. The release includes binaries for macOS (Apple Silicon, Intel), Linux (x64, arm64, Vulkan, ROCm, OpenVINO, SYCL), Windows (x64, arm64, CUDA 12/13, Vulkan, SYCL, HIP), Android (arm64), iOS, and openEuler.
Key Points
- Speculative decoding fix discards last drafted token with low probability to improve output quality
- Pre-built binaries for macOS, Linux, Windows, Android, iOS, and openEuler across CPU and GPU backends
- GPU support includes CUDA 12/13, Vulkan, ROCm, OpenVINO, SYCL, and HIP for accelerated inference
Why It Matters
Boosts local LLM reliability on consumer hardware, enabling faster, more coherent AI without cloud dependency.