Open Source

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

r/LocalLLaMA May 08, 2026

⚡2.56x speedup over baseline with optimal speculative token count.

Deep Dive

A recent Reddit benchmark tested DFlash speculative decoding with Google's Gemma 4 26B (A4B) model on an RTX 5090 (32GB VRAM) using vLLM 0.19.2rc1. The main model was quantized to AWQ 4-bit, paired with a DFlash draft model. With a workload of 256 input and 1024 output tokens, the baseline without speculative decoding achieved ~228 output tok/s and 4455ms mean E2E latency. Enabling DFlash with num_speculative_tokens=13 and max_num_batched_tokens=8192 pushed performance to ~578 tok/s and 1738ms mean latency—a 2.56x speedup.

The benchmark also revealed an interesting nuance: the fastest average setting wasn't always the best for tail latency. While num_speculative_tokens=13 with max_num_batched_tokens=4096 gave slightly better mean latency, the p95 (tail) was worse. Moving to 8192 batched tokens smoothed out the tail, making it more suitable for serving. The results include charts and a recommended vLLM command. The author is curious if similar optimal speculative-token counts hold on RTX 4090 or other Gemma/Qwen models.

Key Points

Baseline: 228 tok/s; with DFlash speculative decoding: 578 tok/s (2.56x improvement).
Optimal settings: num_speculative_tokens=13, max_num_batched_tokens=8192 for clean tail latency (p95).
Setup: RTX 5090 32GB, vLLM 0.19.2rc1, Gemma 4 26B AWQ 4-bit, draft model z-lab/gemma-4-26B-A4B-it-DFlash.

Why It Matters

Speculative decoding can dramatically boost local LLM inference on high-end consumer GPUs, enabling faster, cost-effective serving.

Read Original Article

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Why It Matters

Stay Ahead in AI