BeeLlama v0.2.0: DFlash boosts Qwen 3.6 27B 4.4x, Gemma 4 31B 4.9x on RTX 3090
Single RTX 3090 hits 178 t/s on Gemma 4 31B with near-zero prompt processing overhead.
BeeLlama v0.2.0 brings a major overhaul to its DFlash speculative decoding engine, delivering dramatic speedups for large language models on consumer hardware. In benchmarks on a single RTX 3090 (24GB), the Qwen 3.6 27B model achieved up to 164 tokens per second — a 4.40x improvement over llama.cpp baseline — while Gemma 4 31B reached 177.8 tps (4.93x). Critically, prompt processing speed remained virtually unchanged (0.99x of baseline), meaning users get massive generation speedups without sacrificing context ingestion.
Beyond raw speed, the update adds full Gemma 4 31B support with efficient DFlash and vision capabilities, plus GGUFs for upstream architecture compatibility. Improvements include lower DFlash overhead, cleaner prefill handling, drafter K/V projection caching, safer CUDA execution, and stricter validation of draft/target models. Reasoning and tool-call boundaries were tightened, with safer fallbacks when grammar or sampler state requires full logits. The release also adaptively probes around baseline profit behavior and reduces verifier path strictness. For developers running large models locally, this is a significant leap in inference efficiency.
- Qwen 3.6 27B: 164 tps (4.40x) on single RTX 3090 with DFlash; Gemma 4 31B: 177.8 tps (4.93x)
- Prompt processing speed unchanged (0.99x) — no trade-off for generation speedup
- New features: vision support, GGUFs, tighter reasoning/tool-call boundaries, safer fallback logic
Why It Matters
DFlash brings server-grade speculative decoding efficiency to consumer GPUs, enabling real-time local inference for 27B+ models.