Open Source

Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19

r/LocalLLaMA April 26, 2026

⚡100+ tokens per second and full 256k context on one consumer GPU.

Deep Dive

A new community-driven optimization pushes Qwen3.6-27B to remarkable performance: 105-108 tokens per second (TG) while retaining the full native 256,000-token context window on a single RTX 5090 with 24GB VRAM. The breakthrough uses Lorbus's INT4 AutoRound quantized model (available on HuggingFace) paired with vllm 0.19. Key configuration choices include flashinfer attention backend, fp8_e4m3 KV cache dtype, and MTP (multi-token prediction) speculative decoding with 3 speculative tokens, which accelerates generation without compromising output quality.

The community report highlights that this INT4 quantization delivers "decent KLD" (Kullback-Leibler divergence) — significantly better than NVFP4 alternatives — while being the smallest model size. The compact footprint allows the model to natively handle the full 256k context without needing token quantization (TQ) tricks. The setup uses gpu-memory-utilization of 0.93, max 2 concurrent sequences, and enables chunked prefill, prefix caching, and auto-tool calling. This makes the Qwen3.6-27B a practical, high-throughput option for developers needing long-context AI inference on consumer hardware.

Key Points

Achieves 105-108 tokens per second generation speed on 1x RTX 5090 (24GB VRAM)
Runs full native 256k context window without token compression techniques
Uses Lorbus INT4 AutoRound quantized model with MTP speculative decoding (3 tokens)

Why It Matters

Long-context, high-speed AI inference is now viable on a single consumer GPU, democratizing advanced LLM deployment.

Read Original Article

Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19

Why It Matters

Stay Ahead in AI