Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM
32GB VRAM delivers 73 tok/s generation at 200k context depth
A developer tested Alibaba's Qwen3.6-27B model using the NVFP4 quantization format on a single NVIDIA RTX 5090 (32GB VRAM) with the vLLM serving framework. The setup employed compressed-tensors quantization, FP8_E4M3 KV cache, FlashInfer attention backend, and multi-token prediction (MTP) with 3 speculative tokens. The model was run in text-only mode with a maximum context length of 230,400 tokens, validated up to 200k. Key vLLM parameters included `--kv-cache-dtype fp8_e4m3`, `--enable-chunked-prefill`, and `--speculative-config '{"method":"mtp","num_speculative_tokens":3}'`. GPU memory peaked at 30.5 GiB out of 32.6 GiB, leaving headroom for stability.
Performance benchmarks using llama-benchy showed consistent results across a context ladder. At 200k context depth, prefill speed averaged 2,883 tok/s with a time-to-first-token (TTFT) of ~70 seconds. Generation speed averaged 73.6 tok/s across 10 runs with low variance (stddev 13.5). Even at 1k context, prefill reached 20,901 tok/s. The stability run confirmed 10/10 passes with no failures. This demonstrates that large-context (200k) inference is feasible on consumer hardware with proper quantization and speculative decoding, opening doors for local deployment of advanced reasoning models.
- Model: Qwen3.6-27B-NVFP4, quantized with compressed-tensors, on a single RTX 5090 (32GB VRAM).
- Achieves 200k context length using FP8 KV cache and MTP (3 speculative tokens) in vLLM.
- Performance at 200k: prefill ~2,883 tok/s, generation ~73.6 tok/s, TTFT ~70s, validated over 10 runs.
Why It Matters
Proves large-context AI models can run efficiently on consumer GPUs, enabling local, private deployment.