164 tps (4.40x) on single RTX 3090 with DFlash; Gemma 4 31B: 177.8 tps (4.93x)

Prompt processing speed unchanged (0.99x) — no trade-off for generation speedup?

Prompt processing speed unchanged (0.99x) — no trade-off for generation speedup

vision support, GGUFs, tighter reasoning/tool-call boundaries, safer fallback logic

Open Source

BeeLlama v0.2.0: DFlash boosts Qwen 3.6 27B 4.4x, Gemma 4 31B 4.9x on RTX 3090

r/LocalLLaMA May 22, 2026

⚡Single RTX 3090 hits 178 t/s on Gemma 4 31B with near-zero prompt processing overhead.

Deep Dive

BeeLlama v0.2.0 brings a major overhaul to its DFlash speculative decoding engine, delivering dramatic speedups for large language models on consumer hardware. In benchmarks on a single RTX 3090 (24GB), the Qwen 3.6 27B model achieved up to 164 tokens per second — a 4.40x improvement over llama.cpp baseline — while Gemma 4 31B reached 177.8 tps (4.93x). Critically, prompt processing speed remained virtually unchanged (0.99x of baseline), meaning users get massive generation speedups without sacrificing context ingestion.

Beyond raw speed, the update adds full Gemma 4 31B support with efficient DFlash and vision capabilities, plus GGUFs for upstream architecture compatibility. Improvements include lower DFlash overhead, cleaner prefill handling, drafter K/V projection caching, safer CUDA execution, and stricter validation of draft/target models. Reasoning and tool-call boundaries were tightened, with safer fallbacks when grammar or sampler state requires full logits. The release also adaptively probes around baseline profit behavior and reduces verifier path strictness. For developers running large models locally, this is a significant leap in inference efficiency.

Key Points

Qwen 3.6 27B: 164 tps (4.40x) on single RTX 3090 with DFlash; Gemma 4 31B: 177.8 tps (4.93x)
Prompt processing speed unchanged (0.99x) — no trade-off for generation speedup
New features: vision support, GGUFs, tighter reasoning/tool-call boundaries, safer fallback logic

Why It Matters

DFlash brings server-grade speculative decoding efficiency to consumer GPUs, enabling real-time local inference for 27B+ models.

Read Original Article

BeeLlama v0.2.0: DFlash boosts Qwen 3.6 27B 4.4x, Gemma 4 31B 4.9x on RTX 3090

Why It Matters

Related Articles

🚀 Stay Ahead in AI