Limits max outputs per context to active sequences only, saving VRAM?

Limits max outputs per context to active sequences only, saving VRAM

Adds n_outputs_per_seq parameter for granular control of output allocation?

Adds n_outputs_per_seq parameter for granular control of output allocation

Moves n_outputs_max to server context for centralized configuration?

Moves n_outputs_max to server context for centralized configuration

Developer Tools

llama.cpp b9460 cuts VRAM usage with smarter output allocation

llama.cpp Releases June 02, 2026

⚡New release saves VRAM by dynamically limiting output slots during inference

Deep Dive

llama.cpp, the popular open-source C++ library for running large language models locally (114k stars, 19k forks on GitHub), has released version b9460 with key memory optimizations. The update introduces smarter VRAM allocation by limiting the maximum number of output slots per context to only what's needed for active sequences, rather than reserving space for the full potential output. This change, implemented in PR #23861, can significantly reduce GPU memory consumption during inference, especially when running multiple sequences or batch processing.

Additionally, the release adds a new n_outputs_per_seq parameter for finer control, moves n_outputs_max into the server context for better configuration, and changes all ubatch references to batch for consistency. Prebuilt binaries are available for macOS (Apple Silicon, Intel), Linux (CPU, Vulkan, ROCm, OpenVINO, SYCL), Windows (CPU, CUDA 12/13, Vulkan, HIP), and Android (arm64). For developers and hobbyists running models like Llama 3 or Mistral on consumer hardware, this update means being able to fit larger models into available VRAM or run existing models with lower memory overhead.

Key Points

Limits max outputs per context to active sequences only, saving VRAM
Adds n_outputs_per_seq parameter for granular control of output allocation
Moves n_outputs_max to server context for centralized configuration

Why It Matters

Allows running larger LLMs on consumer hardware with reduced VRAM footprint

Read Original Article

llama.cpp b9460 cuts VRAM usage with smarter output allocation

Why It Matters

Related Articles

🚀 Stay Ahead in AI