BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)
200k context, speculative decoding, and vision on one GPU — no VRAM tricks.
Deep Dive
Anbeeld released BeeLlama.cpp, a performance-focused fork of llama.cpp that runs large GGUF models like Qwen 3.6 27B at Q5 with 200k of practically lossless KV cache and vision on a single RTX 3090 or 4090. It uses DFlash speculative decoding, TurboQuant KV-cache compression (up to 7.5x), adaptive draft control, and reasoning-loop protection. The server supports full multimodal and CPU offloading.
Key Points
- 2-3x speedup over baseline llama.cpp with DFlash speculative decoding and peak 135 tps on a single RTX 3090.
- Up to 7.5x KV-cache compression via TurboQuant/TCQ, enabling 200k context at Q5 with minimal quality loss.
- Full multimodal support (text+vision), adaptive draft control, and CPU offloading — all in a Windows-friendly fork.
Why It Matters
BeeLama lets pros run large context models locally on consumer GPUs, unlocking privacy and cost savings.