llama.cpp b9460 cuts VRAM usage with smarter output allocation
New release saves VRAM by dynamically limiting output slots during inference
llama.cpp, the popular open-source C++ library for running large language models locally (114k stars, 19k forks on GitHub), has released version b9460 with key memory optimizations. The update introduces smarter VRAM allocation by limiting the maximum number of output slots per context to only what's needed for active sequences, rather than reserving space for the full potential output. This change, implemented in PR #23861, can significantly reduce GPU memory consumption during inference, especially when running multiple sequences or batch processing.
Additionally, the release adds a new n_outputs_per_seq parameter for finer control, moves n_outputs_max into the server context for better configuration, and changes all ubatch references to batch for consistency. Prebuilt binaries are available for macOS (Apple Silicon, Intel), Linux (CPU, Vulkan, ROCm, OpenVINO, SYCL), Windows (CPU, CUDA 12/13, Vulkan, HIP), and Android (arm64). For developers and hobbyists running models like Llama 3 or Mistral on consumer hardware, this update means being able to fit larger models into available VRAM or run existing models with lower memory overhead.
- Limits max outputs per context to active sequences only, saving VRAM
- Adds n_outputs_per_seq parameter for granular control of output allocation
- Moves n_outputs_max to server context for centralized configuration
Why It Matters
Allows running larger LLMs on consumer hardware with reduced VRAM footprint