llama.cpp's ggml update fixes memory leaks and expands platform support
Critical memory leak fixed and 20+ platform builds now supported including ARM and Vulkan.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
The llama.cpp project has released a significant update to its underlying ggml tensor library, identified by commit b9320. This patch addresses two critical issues: first, it fixes the ggml context size calculation, which could previously lead to memory corruption or incorrect tensor shapes. Second, it plugs a memory leak that may have caused gradual resource exhaustion in long-running inference sessions. To improve stability, the update moves the split state cache back into the context structure and increases headroom for statically allocated tensors, reducing crashes during large model runs. Additionally, an obsolete include was removed to streamline compilation.
The most visible change is the massive expansion of platform support. The release now provides pre-built binaries for over 20 configurations, including macOS (Apple Silicon with and without KleidiAI optimizations, Intel x64), Linux (x64 CPU, ARM64 CPU, s390x CPU, x64/ARM64 Vulkan, x64 ROCm 7.2, OpenVINO, SYCL FP32/FP16), Android ARM64, Windows (x64 CPU, ARM64 CPU, CUDA 12/13 DLLs, Vulkan, SYCL, HIP), and openEuler (x86 310p, ARM64 310p, with ACL Graph support). This broad compatibility ensures that llama.cpp can run efficiently on everything from old laptops to high-end server GPUs. The update also includes unspecified UI improvements, likely in the default interface. For developers and power users running local LLMs, this is a critical stability and portability upgrade.
- Fixes ggml context size calculation and memory leak, improving inference reliability
- Expands pre-built binaries to 20+ platforms including macOS ARM64/KleidiAI, Linux ROCm, and Windows CUDA 12/13
- Adds static tensor headroom and re-caches split state to reduce crashes during large model runs
Why It Matters
This update makes local LLM inference more stable across nearly every hardware configuration, from laptops to servers.