Llama.cpp Fix Slashes SSM Model VRAM Use by 8x, Unlocks 1M Context
A simple server fix just unlocked massive context windows for everyone.
A critical bug fix in llama.cpp's server has been merged, drastically reducing VRAM usage for SSM hybrid models like Qwen3Next and Nemotron 3 Nano. Previously, running a 1M context model with 8 parallel threads could waste 48GB of VRAM on cache alone. The fix corrects this, bringing usage down to the expected 6GB. This means a single 48GB GPU can now serve eight simultaneous users with 1M context each, a major efficiency breakthrough.
Why It Matters
This dramatically lowers the cost and hardware barrier to running cutting-edge, long-context AI models at scale for developers.