Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub
Achieved record 1.1M tok/s inference speed using 96 B200 GPUs and novel MTP-1 speculative decoding.
A Google Cloud team has demonstrated a landmark achievement in large language model inference, pushing the Qwen 3.5 27B model to generate over 1.1 million tokens per second. This feat was accomplished using a 12-node cluster with 96 of NVIDIA's latest B200 GPUs, orchestrated via Google Kubernetes Engine (GKE). Crucially, the team used the standard vLLM v0.18.0 serving framework without custom kernels, proving the performance is accessible with existing open-source tooling. The results, detailed in a Medium post, highlight the raw throughput potential of modern AI hardware when paired with optimized software configurations.
The breakthrough hinged on four specific optimizations that collectively boosted performance from 9,500 to 95,000 tokens/sec per node. The most critical was the implementation of MTP-1 speculative decoding, a technique where a smaller, faster model drafts tokens for the larger model to verify. Without this, GPU utilization was effectively 0%. Other key changes included using a data parallelism (DP=8) over tensor parallelism (TP=8) strategy, reducing the context window from 131K to 4K tokens, and employing FP8 precision for the KV cache. The setup showed impressive 96.5% scaling efficiency at 12 nodes, though the team noted that using an Inference Gateway with KV-cache-aware routing added a 35% overhead.
This benchmark is significant because it moves the goalposts for what's considered possible in production-scale AI inference. It demonstrates that with the right combination of hardware (B200s), orchestration (GKE), and serving techniques (vLLM with speculative decoding), organizations can now serve dense 27B-parameter models at unprecedented speeds for applications requiring massive, real-time text generation.
- Achieved 1,103,941 tokens/sec using 96 NVIDIA B200 GPUs on 12 nodes with standard vLLM.
- MTP-1 speculative decoding was the key innovation, turning 0% GPU utilization into viable performance.
- Maintained 96.5% scaling efficiency at 12 nodes, showcasing near-linear performance gains with added hardware.
Why It Matters
This benchmark proves ultra-fast, large-model inference is commercially viable today, unlocking real-time applications for summarization, translation, and code generation at scale.