Research & Papers

[D] - 1M tokens/second serving Qwen 3.5 27B on B200 GPUs, benchmark results and findings

Achieving 1.1 million tokens per second with 97% scaling efficiency on new B200 hardware.

Deep Dive

Google Cloud engineers have demonstrated a landmark inference speed of 1.1 million tokens per second using the Qwen 3.5 27B model. The benchmark was run on a cluster of 96 new NVIDIA B200 GPUs orchestrated by Google Kubernetes Engine (GKE), utilizing the vLLM v0.18.0 serving framework. A key technical finding was that data parallelism (DP=8) proved far more effective than tensor parallelism for this model size on B200s, boosting throughput by nearly 4x. The team's 'InferenceMAX' methodology tested a worst-case scenario with 1024-input and 512-output tokens and 0% prefix cache hit rate.

The architecture showed exceptional scaling efficiency, reaching 97.1% with 8 nodes and 96.5% with 12 nodes, while Time Per Output Token (TPOT) remained flat at ~46ms. A critical discovery was the necessity of Multi-Process Service (MPS) mode; GPU utilization was 0% without it (MTP-1), and higher levels (MTP-5) caused crashes. The test also compared routing overhead, finding that a KV-cache-aware Inference Gateway added about 35% more latency than simple ClusterIP round-robin routing, with a single Endpoint (EPP) pod identified as the system bottleneck.

Key Points
  • Achieved 1.1M tokens/sec serving Qwen 3.5 27B using 96 NVIDIA B200 GPUs on GKE.
  • Data parallelism (DP=8) delivered nearly 4x higher throughput than tensor parallelism for this model on B200s.
  • System showed 97.1% scaling efficiency with 8 nodes; MPS mode was essential for any GPU utilization.

Why It Matters

This benchmark sets a new public performance standard for large-scale AI inference, proving the scalability of next-gen hardware for enterprise deployment.