Open Source

llama.cpp PR cuts VRAM usage by 1.2GB with smarter logit allocation

A new optimization reserves logits only for active sequences, saving significant GPU memory.

Deep Dive

A new pull request from developer am17an aims to reduce GPU memory usage in llama.cpp by optimizing how logits space is allocated within the llama_context. The change, which continues the work of PR #23764, only reserves logits memory for the number of sequences actually being processed (n_seqs) rather than reserving space for all possible tokens upfront. In testing with the -ub 2048 flag and MTP (Multi-Token Prediction) enabled, this saved an additional 1.2GB of VRAM on the developer's setup. The optimization is especially impactful for local users running large models on consumer GPUs with limited memory.

The PR is still marked as a draft because the developer feels a better API might exist. Their proposed approach involves setting the default reservation to all tokens, but allowing the server context to override it to 1 whenever possible. This would let inference servers maximize memory efficiency while maintaining flexibility for other use cases. Early testing with llama-perplexity shows no regressions. If merged, this change could lower the hardware barrier for running 7B-13B parameter models locally, freeing up VRAM for larger batch sizes or higher context lengths.

Key Points
  • Builds on PR #23764 to further reduce VRAM usage in llama.cpp
  • Saves 1.2GB of VRAM with -ub 2048 and MTP enabled
  • Proposes API to set logits reservation to 1 in server contexts while defaulting to full allocation

Why It Matters

Lowers VRAM requirements for local LLM inference, enabling larger models or longer contexts on consumer GPUs.