Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19
100+ tokens per second and full 256k context on one consumer GPU.
A new community-driven optimization pushes Qwen3.6-27B to remarkable performance: 105-108 tokens per second (TG) while retaining the full native 256,000-token context window on a single RTX 5090 with 24GB VRAM. The breakthrough uses Lorbus's INT4 AutoRound quantized model (available on HuggingFace) paired with vllm 0.19. Key configuration choices include flashinfer attention backend, fp8_e4m3 KV cache dtype, and MTP (multi-token prediction) speculative decoding with 3 speculative tokens, which accelerates generation without compromising output quality.
The community report highlights that this INT4 quantization delivers "decent KLD" (Kullback-Leibler divergence) — significantly better than NVFP4 alternatives — while being the smallest model size. The compact footprint allows the model to natively handle the full 256k context without needing token quantization (TQ) tricks. The setup uses gpu-memory-utilization of 0.93, max 2 concurrent sequences, and enables chunked prefill, prefix caching, and auto-tool calling. This makes the Qwen3.6-27B a practical, high-throughput option for developers needing long-context AI inference on consumer hardware.
- Achieves 105-108 tokens per second generation speed on 1x RTX 5090 (24GB VRAM)
- Runs full native 256k context window without token compression techniques
- Uses Lorbus INT4 AutoRound quantized model with MTP speculative decoding (3 tokens)
Why It Matters
Long-context, high-speed AI inference is now viable on a single consumer GPU, democratizing advanced LLM deployment.