Fixes ggml context size calculation and memory leak, improving inference reliability?

Fixes ggml context size calculation and memory leak, improving inference reliability

Expands pre-built binaries to 20+ platforms including macOS ARM64/KleidiAI, Linux ROCm, and Windows CUDA 12/13?

Expands pre-built binaries to 20+ platforms including macOS ARM64/KleidiAI, Linux ROCm, and Windows CUDA 12/13

Adds static tensor headroom and re-caches split state to reduce crashes during large model runs?

Adds static tensor headroom and re-caches split state to reduce crashes during large model runs

Developer Tools

llama.cpp's ggml update fixes memory leaks and expands platform support

llama.cpp Releases May 26, 2026

⚡Critical memory leak fixed and 20+ platform builds now supported including ARM and Vulkan.

Deep Dive

The llama.cpp project has released a significant update to its underlying ggml tensor library, identified by commit b9320. This patch addresses two critical issues: first, it fixes the ggml context size calculation, which could previously lead to memory corruption or incorrect tensor shapes. Second, it plugs a memory leak that may have caused gradual resource exhaustion in long-running inference sessions. To improve stability, the update moves the split state cache back into the context structure and increases headroom for statically allocated tensors, reducing crashes during large model runs. Additionally, an obsolete include was removed to streamline compilation.

The most visible change is the massive expansion of platform support. The release now provides pre-built binaries for over 20 configurations, including macOS (Apple Silicon with and without KleidiAI optimizations, Intel x64), Linux (x64 CPU, ARM64 CPU, s390x CPU, x64/ARM64 Vulkan, x64 ROCm 7.2, OpenVINO, SYCL FP32/FP16), Android ARM64, Windows (x64 CPU, ARM64 CPU, CUDA 12/13 DLLs, Vulkan, SYCL, HIP), and openEuler (x86 310p, ARM64 310p, with ACL Graph support). This broad compatibility ensures that llama.cpp can run efficiently on everything from old laptops to high-end server GPUs. The update also includes unspecified UI improvements, likely in the default interface. For developers and power users running local LLMs, this is a critical stability and portability upgrade.

Key Points

Fixes ggml context size calculation and memory leak, improving inference reliability
Expands pre-built binaries to 20+ platforms including macOS ARM64/KleidiAI, Linux ROCm, and Windows CUDA 12/13
Adds static tensor headroom and re-caches split state to reduce crashes during large model runs

Why It Matters

This update makes local LLM inference more stable across nearly every hardware configuration, from laptops to servers.

Read Original Article

llama.cpp's ggml update fixes memory leaks and expands platform support

Why It Matters

Related Articles

🚀 Stay Ahead in AI