Developer Tools

b9028

New release reduces GPU memory usage for larger LLMs on consumer hardware.

Deep Dive

ggml-org's llama.cpp released b9028, adding an option to save memory in device buffers ( #22679 ). The release supports macOS (Apple Silicon & Intel), Linux (x64, ARM, s390x with Vulkan, ROCm, etc.), Windows (CPU, CUDA 12/13, Vulkan), Android, and openEuler.

Key Points
  • Memory-saving option in device buffers reduces GPU VRAM usage, enabling larger models on limited hardware.
  • Supports multiple backends: CUDA, Vulkan, ROCm, OpenVINO, SYCL, HIP, plus CPU on all major OSes.
  • Release b9028 ensures cross-platform stability with automated CI builds for macOS, Linux, Windows, and Android.

Why It Matters

Lowers the hardware bar for running advanced LLMs locally, making local AI inference more accessible.