DeepSeek V4 reorganized into dedicated package with NVFP4 fused MoE, CUDA graph improvements, and MTP speculative decoding for faster inference?

DeepSeek V4 reorganized into dedicated package with NVFP4 fused MoE, CUDA graph improvements, and MTP speculative decoding for faster inference.

Batch-invariant inference gains Cutlass FP8 support delivering a 28.9% end-to-end latency improvement?

Batch-invariant inference gains Cutlass FP8 support delivering a 28.9% end-to-end latency improvement.

New experimental Rust frontend with DP Supervisor enables data-parallel serving, and multi-tier KV cache offloading extends context window beyond CPU memory?

New experimental Rust frontend with DP Supervisor enables data-parallel serving, and multi-tier KV cache offloading extends context window beyond CPU memory.

Developer Tools

vLLM v0.22.0 boosts DeepSeek V4 with 28.9% faster inference

vLLM Releases May 30, 2026

⚡459 commits, 230 contributors, and a new Rust frontend for data-parallel serving.

Deep Dive

vLLM v0.22.0 is a massive release with 459 commits from 230 contributors, focusing on production-grade stability and performance. The biggest highlight is DeepSeek V4 maturity: the model was reorganized into a dedicated package (`vllm/models/deepseek_v4/`) and gained NVFP4 fused MoE support, improved CUDA graph handling, and MTP speculative decoding for faster generation. Model Runner V2 (MRv2) takes a step toward becoming the default engine, with an oracle that selects it automatically for Qwen3 dense models, sleep-mode weight reload, and shared KV-cache layers. An experimental Rust frontend integration landed, including a DP Supervisor for data-parallel serving, promising better performance and scalability.

Performance improvements are further driven by batch-invariant inference, which now supports Cutlass FP8 for a 28.9% end-to-end latency reduction on compatible hardware. The new multi-tier KV cache offloading framework allows offloading beyond CPU memory to Python filesystem secondary tiers, DSv4, and Mooncake disk storage, enabling much larger context windows. Model support expanded significantly with new architectures (MiniCPM-V 4.6, InternS2 Preview, OpenVLA, MolmoWeb) and improvements to speculative decoding backends (custom callable proposer, post-norm EAGLE-3, peagle speculators). Tool calling parsers like Apertus and better Qwen3Coder schema resolution round out the release, making vLLM more versatile for production AI serving.

Key Points

DeepSeek V4 reorganized into dedicated package with NVFP4 fused MoE, CUDA graph improvements, and MTP speculative decoding for faster inference.
Batch-invariant inference gains Cutlass FP8 support delivering a 28.9% end-to-end latency improvement.
New experimental Rust frontend with DP Supervisor enables data-parallel serving, and multi-tier KV cache offloading extends context window beyond CPU memory.

Why It Matters

vLLM v0.22.0 delivers major speed and reliability upgrades for running large language models in production.

Read Original Article

vLLM v0.22.0 boosts DeepSeek V4 with 28.9% faster inference

Why It Matters

Related Articles

🚀 Stay Ahead in AI