Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB
Run a 35B model on an 8GB GPU? DFlash makes it 33% faster.
A developer has demonstrated that DFlash speculative decoding in llama.cpp can dramatically improve generation speed for large MoE models on severely VRAM-limited hardware. Using a Qwen3.5-35B-A3B target model (Q5_K_M, 24.44 GiB) on an RTX 2080 SUPER with only 8GB VRAM, they paired it with a tiny DFlash draft model (474M params, Q4_K_M, just 267.8 MiB). By cleverly combining MoE expert CPU offload (setting -ncmoe to 34) and DFlash, the 35B model was able to run entirely on consumer-grade hardware.
The results are striking: baseline non-DFlash inference ran at ~26.8 tok/s, but with DFlash and a draft length of 6 tokens, speed jumped to ~35.6-35.8 tok/s — a 33-34% improvement. The draft acceptance rate peaked at 99.3% with draft-max 6, meaning nearly every drafted token was accepted, minimizing wasted compute. Higher draft lengths actually reduced performance due to lower acceptance rates. This technique unlocks viable local inference for state-of-the-art MoE models on budget GPUs, bypassing the need for expensive professional hardware.
- DFlash speculative decoding with Qwen3.5-35B-A3B achieves 35.6-35.8 tok/s on an 8GB RTX 2080 SUPER, a 33% gain over baseline 26.8 tok/s.
- Uses a tiny 474M parameter draft model (267.8 MiB) with Q4_K_M quantization, achieving 99.3% acceptance at draft-max 6.
- Combined MoE expert CPU offload (-ncmoe 34) and DFlash enabled a 24.44 GiB model to run on 8GB VRAM.
Why It Matters
Enables running large MoE models on consumer GPUs, democratizing access to high-performance AI inference.