Google's Gemma 4: MTP vs DFlash on H100 reveals MoE tradeoffs
MTP hits 3.11x speedup on dense 31B, DFlash leads on MoE at 1.73x
A recent benchmark by a Reddit user compared two speculative decoding methods—Google's MTP (multi-token prediction) and z-lab's DFlash—on Gemma 4 models using a single H100 80GB, vLLM, and the SPEED-Bench qualitative dataset with 880 prompts across 11 categories. For the dense 31B model, MTP achieved a 3.11x speedup over baseline (125.3 output tok/s) at concurrency 1, while DFlash reached 3.03x (122.1 tok/s). At concurrency 16, MTP hit 953 tok/s versus DFlash's 725 tok/s. The results flipped for the MoE 26B-A4B model: DFlash delivered a 1.73x speedup (306.4 tok/s) vs. MTP's 1.49x (264.2 tok/s). At concurrency 16, DFlash reached 1,957 tok/s and MTP hit 1,808 tok/s. The smaller speedups on MoE are due to its lower active parameter count (3.8B vs 25.2B total), meaning less target compute is saved by speculation.
Across workloads, coding, math, STEM, and reasoning showed greater gains because of more predictable token patterns, while writing, summarization, and roleplay improved less. Acceptance rates declined sharply after the first few positions: position-1 acceptance was ~80% for MTP and ~75% for DFlash, dropping below 20% by position 8. Interestingly, higher acceptance didn't guarantee higher throughput—MTP accepted more tokens but DFlash outperformed on the MoE model due to its single-pass draft generation. The benchmark highlights that optimal speculative decoding depends on model architecture, concurrency, and task type; users should test both approaches in their own setups rather than assume one is universally better.
- On dense 31B model: MTP = 3.11x speedup (125.3 tok/s), DFlash = 3.03x (122.1 tok/s) at concurrency 1
- On MoE 26B-A4B: DFlash = 1.73x speedup (306.4 tok/s), MTP = 1.49x (264.2 tok/s) due to lower active parameters (3.8B)
- At concurrency 16, dense MTP hit 953 tok/s; MoE DFlash hit a staggering 1,957 tok/s
Why It Matters
Speculative decoding selection now hinges on model architecture—dense favors MTP, MoE favors DFlash—impacting production LLM serving costs and latency.