llama.cpp llama-server running SSM models VRAM fix merged
A simple server fix just unlocked massive context windows for everyone.
Deep Dive
A critical bug fix in llama.cpp's server has been merged, drastically reducing VRAM usage for SSM hybrid models like Qwen3Next and Nemotron 3 Nano. Previously, running a 1M context model with 8 parallel threads could waste 48GB of VRAM on cache alone. The fix corrects this, bringing usage down to the expected 6GB. This means a single 48GB GPU can now serve eight simultaneous users with 1M context each, a major efficiency breakthrough.
Why It Matters
This dramatically lowers the cost and hardware barrier to running cutting-edge, long-context AI models at scale for developers.