Open Source

BeeLlama.cpp runs Qwen 3.6 27B at 135 tps on a single 3090

200k context, speculative decoding, and vision on one GPU — no VRAM tricks.

Deep Dive

Anbeeld released BeeLlama.cpp, a performance-focused fork of llama.cpp that runs large GGUF models like Qwen 3.6 27B at Q5 with 200k of practically lossless KV cache and vision on a single RTX 3090 or 4090. It uses DFlash speculative decoding, TurboQuant KV-cache compression (up to 7.5x), adaptive draft control, and reasoning-loop protection. The server supports full multimodal and CPU offloading.

Key Points
  • 2-3x speedup over baseline llama.cpp with DFlash speculative decoding and peak 135 tps on a single RTX 3090.
  • Up to 7.5x KV-cache compression via TurboQuant/TCQ, enabling 200k context at Q5 with minimal quality loss.
  • Full multimodal support (text+vision), adaptive draft control, and CPU offloading — all in a Windows-friendly fork.

Why It Matters

BeeLama lets pros run large context models locally on consumer GPUs, unlocking privacy and cost savings.

📬 Get the top 10 AI stories daily