Open Source

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

r/LocalLLaMA May 10, 2026

⚡200k context, speculative decoding, and vision on one GPU — no VRAM tricks.

Deep Dive

Anbeeld released BeeLlama.cpp, a performance-focused fork of llama.cpp that runs large GGUF models like Qwen 3.6 27B at Q5 with 200k of practically lossless KV cache and vision on a single RTX 3090 or 4090. It uses DFlash speculative decoding, TurboQuant KV-cache compression (up to 7.5x), adaptive draft control, and reasoning-loop protection. The server supports full multimodal and CPU offloading.

Key Points

2-3x speedup over baseline llama.cpp with DFlash speculative decoding and peak 135 tps on a single RTX 3090.
Up to 7.5x KV-cache compression via TurboQuant/TCQ, enabling 200k context at Q5 with minimal quality loss.
Full multimodal support (text+vision), adaptive draft control, and CPU offloading — all in a Windows-friendly fork.

Why It Matters

BeeLama lets pros run large context models locally on consumer GPUs, unlocking privacy and cost savings.

Read Original Article

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

Why It Matters

Stay Ahead in AI