vLLM v0.22.0 boosts DeepSeek V4 with 28.9% faster inference
459 commits, 230 contributors, and a new Rust frontend for data-parallel serving.
vLLM v0.22.0 is a massive release with 459 commits from 230 contributors, focusing on production-grade stability and performance. The biggest highlight is DeepSeek V4 maturity: the model was reorganized into a dedicated package (`vllm/models/deepseek_v4/`) and gained NVFP4 fused MoE support, improved CUDA graph handling, and MTP speculative decoding for faster generation. Model Runner V2 (MRv2) takes a step toward becoming the default engine, with an oracle that selects it automatically for Qwen3 dense models, sleep-mode weight reload, and shared KV-cache layers. An experimental Rust frontend integration landed, including a DP Supervisor for data-parallel serving, promising better performance and scalability.
Performance improvements are further driven by batch-invariant inference, which now supports Cutlass FP8 for a 28.9% end-to-end latency reduction on compatible hardware. The new multi-tier KV cache offloading framework allows offloading beyond CPU memory to Python filesystem secondary tiers, DSv4, and Mooncake disk storage, enabling much larger context windows. Model support expanded significantly with new architectures (MiniCPM-V 4.6, InternS2 Preview, OpenVLA, MolmoWeb) and improvements to speculative decoding backends (custom callable proposer, post-norm EAGLE-3, peagle speculators). Tool calling parsers like Apertus and better Qwen3Coder schema resolution round out the release, making vLLM more versatile for production AI serving.
- DeepSeek V4 reorganized into dedicated package with NVFP4 fused MoE, CUDA graph improvements, and MTP speculative decoding for faster inference.
- Batch-invariant inference gains Cutlass FP8 support delivering a 28.9% end-to-end latency improvement.
- New experimental Rust frontend with DP Supervisor enables data-parallel serving, and multi-tier KV cache offloading extends context window beyond CPU memory.
Why It Matters
vLLM v0.22.0 delivers major speed and reliability upgrades for running large language models in production.