Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090
Speculative decoding runs Qwen3.6-27B at 78 tok/s on consumer hardware.
The Luce team has released Luce DFlash, a new open-source engine that brings speculative decoding to consumer GPUs. Built as a standalone C++/CUDA stack on top of ggml, it runs entirely on a single 24 GB RTX 3090 and hosts the Qwen3.6-27B model. The engine achieves a mean 1.98x speedup over autoregressive generation across HumanEval, GSM8K, and Math500 benchmarks, with peak throughput of 78 tok/s on HumanEval. It uses a matched DFlash draft model (bf16, ~3.46 GB) published by z-lab, and employs DDTree tree-verify speculative decoding with a block size of 16 and default budget of 22. The KV cache is compressed to TQ3_0 (3.5 bpv, ~9.7x vs F16) and a 4096-slot ring buffer allows up to 256K context to fit in 24 GB. For long prompts, prefill ubatch auto-bumps from 16 to 192, reaching ~913 tok/s on 13K-token prompts. Sliding-window flash attention at decode maintains 89.7 tok/s at 60K context instead of 25.8 tok/s without it. The engine serves an OpenAI-compatible HTTP endpoint or a local chat REPL, and requires only CUDA 12+ and an NVIDIA GPU (RTX 3090/4090/5090, DGX Spark, or Jetson AGX Thor). No Python runtime, vLLM, or llama.cpp is needed. The project is MIT-licensed and available on GitHub.
- Mean 1.98x speedup over autoregressive on Qwen3.6-27B across HumanEval, GSM8K, Math500
- Runs on single RTX 3090 (24 GB) with 256K context via TQ3_0 KV cache compression
- Standalone C++/CUDA engine, no Python runtime, vLLM, or llama.cpp required
Why It Matters
Speculative decoding on consumer GPUs makes large models practical for real-time applications.