z-lab released gemma-4-26B-A4B-it-DFlash. Anybody tried it yet?
New speculative decoding method promises 2x+ speed on sparse 26B models for long-context inference.
Z-LAB quietly dropped the Gemma-4-26B-A4B-it-DFlash, a sparse Mixture-of-Experts model fine-tuned from Google's Gemma 2. The headline feature is DFlash, a new speculative decoding technique that replaces the traditional MTP (multi-token prediction) approach. DFlash uses parallel block diffusion drafting—generating multiple candidate tokens in one shot rather than sequentially—and keeps a persistent state across iterations, maintaining KV cache positions and RoPE offsets. This means the model's inference speed doesn't degrade as the conversation length grows, unlike MTP where the KV cache balloons rapidly. Preliminary benchmarks suggest DFlash can deliver up to 2x speed gains on long-context tasks for sparse models of this size.
The catch: DFlash is currently supported only in vllm, the high-throughput inference engine, and not yet in llama.cpp (llcpp). That limits testing to users with compatible GPU setups. The community is eagerly awaiting porting efforts, as sparse models like Gemma 4 26B and the upcoming Qwen 3.6 35B stand to benefit most from this technique. If DFlash proves robust, it could become the default speculative decoding method for large language models, slashing latency and memory usage for production deployments that rely on long-context reasoning.
- DFlash uses parallel block diffusion drafting instead of MTP's sequential token prediction.
- Persistent state across iterations prevents KV cache ballooning in long-context sessions.
- Currently exclusive to vllm; llama.cpp support is not yet available.
Why It Matters
Faster long-context inference on sparse models means cheaper, more scalable AI for enterprise applications like document analysis and chat.