Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB
Run a minimally quantized 27B model at 80 tokens per second on a single $10k GPU build.
A Reddit user (u/__JockY__) presents a detailed recipe for running Qwen3.6-27B-FP8 — the official FP8 quantized variant from Qwen — on a single RTX 5000 PRO 48GB GPU. The hardware includes 64GB system RAM and a decent CPU/motherboard, targeting the common $10k build question. Using vLLM 0.20.1 with CUDA 12.9, the setup leverages Blackwell-accelerated FP8 for the model weights while keeping the KV cache in bfloat16 (non-quantized). The environment configuration includes flashinfer attention backend, MTP (multi-token prediction) with 2 speculative tokens, and custom compilation flags. This approach yields roughly 80 tokens per second at 200k context length, with a concurrency factor of 1.09x. The developer emphasizes that the non-quantized KV cache avoids early compaction and endless loops seen in longer Claude sessions, making it suitable for agentic coding where reliability matters.
Performance benchmarks are still running, but initial results show 60–90 TPS with MTP=2 in code-writing tasks. The key advantage is minimal quantization: only the model weights use FP8 (with Blackwell-native acceleration), while the KV cache stays at full bfloat16. This significantly reduces error accumulation over long contexts compared to fully quantized models like those forced onto 24GB cards. The developer positions this as a definitive answer for anyone wondering what to buy for $10k, claiming the RTX 5000 PRO with this setup offers a quiet, cool, and fast environment for running state-of-the-art open-weight LLMs locally. The full recipe and vLLM server command are shared, making it reproducible for other enthusiasts.
- Achieves 60–90 TPS for code writing using MTP=2, with 80 TPS typical, and supports 200k token context
- Uses Qwen's official FP8 quant (Qwen3.6-27B-FP8) with bfloat16 KV cache to avoid error compounding
- Runs on a single RTX 5000 PRO 48GB GPU with vLLM 0.20.1, CUDA 12.9, and flashinfer attention backend
Why It Matters
Shows a practical, minimal-quantization 27B model can run locally on a single GPU, enabling reliable agentic coding at scale.