Developer Tools

b8940

Critical bug fix for partial reads and writes in recurrent state serialization

Deep Dive

The ggml-org team has released llama.cpp version b8940, a maintenance update that addresses a critical bug in recurrent state serialization. The previous code only worked for full tensor reads and writes, causing a GGML_ASSERT(size == ggml_nbytes(tensor)) assertion failure when tested with llama-server. This fix ensures partial reads and writes are handled correctly, improving stability for users running recurrent models.

This release includes extensive platform support, with builds for macOS (Apple Silicon, Intel, iOS), Linux (x64, arm64, s390x with CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows (x64, arm64 with CPU, CUDA 12/13, Vulkan, SYCL, HIP), and Android arm64. The update is signed with GitHub's verified signature for security.

Key Points
  • Fixes recurrent state serialization for partial reads and writes that caused crashes with llama-server
  • Previously only worked for full tensor reads and writes, triggering GGML_ASSERT errors
  • Supports macOS, Linux, Windows, Android, and openEuler with multiple backends including CUDA, Vulkan, ROCm, and SYCL

Why It Matters

Ensures stable recurrent model inference across platforms, critical for production deployments using llama.cpp.