llama.cpp b9751 patches memory reporting in multithreaded mode
The popular local LLM runner gets a targeted fix for memory usage tracking.
llama.cpp, the widely adopted open-source project that enables running large language models on local hardware, has released patch version b9751. The update focuses on a single but critical fix: correcting memory usage reporting in the multithreaded memory tracking function (`mtmd_get_memory_usage`). This bug could lead to inaccurate memory consumption data when models are run across multiple CPU or GPU threads, potentially causing performance bottlenecks or allocation errors. The fix, tracked in pull request #24867, ensures developers get reliable metrics for resource management during inference.
The release includes extensive build targets across all major platforms and hardware accelerators. For macOS, builds support both Apple Silicon (arm64) and Intel (x64), with optional KleidiAI acceleration on ARM. Linux covers x64 and ARM64 with Vulkan, ROCm 7.2, OpenVINO, and SYCL backends. Windows provides CPU builds plus GPU support via CUDA (versions 12 and 13), Vulkan, OpenVINO, SYCL, and HIP. Android gets an ARM64 CPU build. Additionally, openEuler (a Chinese Linux distribution) builds are available but currently disabled. This breadth makes llama.cpp the go-to tool for deploying LLMs on everything from local workstations to edge devices.
- Fixes memory usage reporting in multithreaded mode (mtmd_get_memory_usage, PR #24867).
- Builds for 20+ platforms including macOS, Linux, Windows, Android, and openEuler.
- GPU support spans CUDA 12/13, Vulkan, ROCm, OpenVINO, SYCL, HIP, and OpenCL.
Why It Matters
Keeps llama.cpp reliable for professionals deploying local LLMs across diverse hardware environments.