Open Source

Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB

r/LocalLLaMA May 01, 2026

⚡Run a 35B model on an 8GB GPU? DFlash makes it 33% faster.

Deep Dive

A developer has demonstrated that DFlash speculative decoding in llama.cpp can dramatically improve generation speed for large MoE models on severely VRAM-limited hardware. Using a Qwen3.5-35B-A3B target model (Q5_K_M, 24.44 GiB) on an RTX 2080 SUPER with only 8GB VRAM, they paired it with a tiny DFlash draft model (474M params, Q4_K_M, just 267.8 MiB). By cleverly combining MoE expert CPU offload (setting -ncmoe to 34) and DFlash, the 35B model was able to run entirely on consumer-grade hardware.

The results are striking: baseline non-DFlash inference ran at ~26.8 tok/s, but with DFlash and a draft length of 6 tokens, speed jumped to ~35.6-35.8 tok/s — a 33-34% improvement. The draft acceptance rate peaked at 99.3% with draft-max 6, meaning nearly every drafted token was accepted, minimizing wasted compute. Higher draft lengths actually reduced performance due to lower acceptance rates. This technique unlocks viable local inference for state-of-the-art MoE models on budget GPUs, bypassing the need for expensive professional hardware.

Key Points

DFlash speculative decoding with Qwen3.5-35B-A3B achieves 35.6-35.8 tok/s on an 8GB RTX 2080 SUPER, a 33% gain over baseline 26.8 tok/s.
Uses a tiny 474M parameter draft model (267.8 MiB) with Q4_K_M quantization, achieving 99.3% acceptance at draft-max 6.
Combined MoE expert CPU offload (-ncmoe 34) and DFlash enabled a 24.44 GiB model to run on 8GB VRAM.

Why It Matters

Enables running large MoE models on consumer GPUs, democratizing access to high-performance AI inference.

Read Original Article

Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB

Why It Matters

Stay Ahead in AI