Developer Tools

v0.18.0

The high-performance inference engine's latest release features 445 commits from 213 contributors, enabling new deployment architectures.

Deep Dive

The vLLM Project, a leading open-source high-performance inference engine for LLMs, has launched version 0.18.0. This major release, comprising 445 commits from 213 contributors, introduces several enterprise-grade features. Most notably, it now supports gRPC serving alongside its existing HTTP/REST API, enabling high-performance, low-latency RPC-based deployments. A new `vllm launch render` command allows for GPU-less preprocessing and rendering of multimodal inputs, separating compute-intensive tasks from GPU inference. Furthermore, NGram speculative decoding has been optimized to run entirely on the GPU and is now compatible with the async scheduler, drastically reducing the overhead of this speed-up technique.

Under the hood, significant improvements target memory management and model support. KV cache offloading is now smarter, with a system that stores only frequently-reused attention blocks in CPU memory. A new FlexKV backend and support for multiple KV groups provide more offloading options. The update also marks a milestone for Elastic Expert Parallelism (EP), integrating with NIXL-EP to enable dynamic GPU scaling for Mixture-of-Experts (MoE) models. On the model front, v0.18.0 adds support for new architectures like Sarvam MoE and OLMo Hybrid, along with speculative decoding targets for models like Qwen3.5 and Kimi K2.5. Performance fixes and updates to dependencies like FlashInfer 0.6.6 round out a release focused on scalability, efficiency, and broader model compatibility.

Key Points
  • Adds gRPC serving support via a new `--grpc` flag for high-performance RPC-based model deployment.
  • Introduces GPU-less rendering for multimodal preprocessing, decoupling it from GPU inference workloads.
  • Enhances KV cache offloading with frequency-based CPU storage and adds a new FlexKV backend.

Why It Matters

This release makes deploying and scaling production AI applications more efficient and flexible, reducing costs and latency.