vLLM + Gemma 4 hit 132.52 tok/s, a 3.34x speedup over standard decoding at 39.69 tok/s?

vLLM + Gemma 4 hit 132.52 tok/s, a 3.34x speedup over standard decoding at 39.69 tok/s.

Optimal speculative token count varies?

n=5 for Gemma 4 on vLLM, n=3 for Qwen 3.6 on llama.cpp.

MTP uses a tiny draft model (76M params) and shows negligible VRAM increase; quality is preserved via token verification?

MTP uses a tiny draft model (76M params) and shows negligible VRAM increase; quality is preserved via token verification.

Open Source

Community benchmark: MTP delivers 3.34x faster inference on Gemma 4 and Qwen 3.6

r/LocalLLaMA May 30, 2026

⚡vLLM hits 132 tok/s on Gemma 4 with multi-token prediction – 3.34x speedup.

Deep Dive

A developer recently benchmarked Multi-Token Prediction (MTP) on two dense models — Google's Gemma 4 31B (FP8) and Alibaba's Qwen 3.6 27B (GGUF Q8_0) — using both vLLM and llama.cpp on an RTX PRO 6000 Blackwell with 96GB VRAM. The results are striking: vLLM + Gemma 4 achieved 132.52 tok/s with n=5 speculative tokens, a 3.34x speedup over standard decoding (39.69 tok/s). For Qwen 3.6 on llama.cpp, the best was 117.70 tok/s with n=3, a 2.59x improvement. The tester notes that vLLM's MTP support is more mature, and the optimal number of speculative tokens is not simply the highest — it depends on the model and engine combination. For Gemma 4, n=5 was best; for Qwen 3.6, n=3 hit the sweet spot, with performance oscillating at higher values.

Despite the speed gains, the tester observed negligible VRAM overhead (the draft model is only 76M parameters for Gemma 4) and no quality degradation, since the target model verifies every token before acceptance. The findings confirm that dense models benefit significantly from MTP due to their uniform forward pass. While not a rigorous academic evaluation, the directional data suggests MTP is a practical optimization for production inference. Developers using vLLM or llama.cpp should benchmark their own models to find the optimal speculative token count for their hardware.

Key Points

vLLM + Gemma 4 hit 132.52 tok/s, a 3.34x speedup over standard decoding at 39.69 tok/s.
Optimal speculative token count varies: n=5 for Gemma 4 on vLLM, n=3 for Qwen 3.6 on llama.cpp.
MTP uses a tiny draft model (76M params) and shows negligible VRAM increase; quality is preserved via token verification.

Why It Matters

MTP enables 2–3x faster dense model inference without quality loss, cutting costs for real-time applications.

Read Original Article

Community benchmark: MTP delivers 3.34x faster inference on Gemma 4 and Qwen 3.6

Why It Matters

Related Articles

🚀 Stay Ahead in AI