Open Source

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19

New NVFP4 quantized model hits 80 tokens/sec on consumer GPU.

Deep Dive

Qwen3.6-27B, the latest iteration of Alibaba's Qwen model family, is now capable of running at approximately 80 tokens per second on a single NVIDIA RTX 5090 GPU, thanks to a new NVFP4 quantized version with multi-token prediction (MTP). The quantized model, uploaded to Hugging Face by user sakamakismile, leverages the vLLM 0.19.1rc1 inference engine to achieve this performance while maintaining a massive 218k token context window. This represents a significant leap in local LLM inference efficiency, as previous versions required multiple GPUs or slower speeds for similar context lengths.

The achievement builds on earlier work with Qwen3.5-27B, which reached 77 tps on the same hardware, and demonstrates the rapid optimization of open-source models for consumer-grade GPUs. The NVFP4 quantization reduces memory footprint without major accuracy loss, while MTP improves throughput by predicting multiple tokens in parallel. This setup allows developers and researchers to run large language models locally for tasks like long-document analysis, code generation, and chat applications, all without cloud dependency. The recipe is publicly available, enabling others to replicate or adapt it for different models.

Key Points
  • Qwen3.6-27B achieves ~80 tokens per second on a single RTX 5090
  • Supports a 218k token context window using vLLM 0.19.1rc1
  • NVFP4 quantization with MTP enables efficient local inference

Why It Matters

Enables high-speed, large-context LLM inference on consumer hardware, democratizing AI access.