Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests)
Developer achieves record 585 tokens/sec throughput for 8 simultaneous AI requests using custom optimizations.
A developer has demonstrated remarkable performance running Alibaba's Qwen3.5-27B model locally, achieving 100+ tokens/second decode speeds with a massive 170k context window on just two RTX 3090 GPUs. The setup leverages vLLM with tensor parallelism and NVLink connectivity between GPUs, plus custom optimizations including multi-token prediction (MTP) set to 5 tokens instead of the typical 3. The developer compiled vLLM from source and used a specialized quantization approach that keeps linear attention layers at full precision while quantizing full attention layers to int4 - a format that RTX 3090s handle efficiently via hardware support.
The configuration delivers exceptional throughput, reaching 585 tokens/second across 8 simultaneous requests while maintaining decode speeds above 60 tokens/second even in worst-case scenarios. Prefill performance is equally impressive at approximately 1500 tokens/second. The developer shared their build and launch scripts, along with fixes for tool-calling bugs in Qwen3.5 when using MTP. This achievement shows how careful optimization of open-source inference engines like vLLM can make large language models with extensive context windows practical for local deployment, potentially reducing reliance on expensive cloud APIs while maintaining competitive performance.
- Achieved 100+ tokens/sec decode with 170k context on dual RTX 3090s using vLLM and tensor parallelism
- Reached 585 tokens/sec throughput across 8 simultaneous requests with custom int4 quantization for attention layers
- Used MTP (multi-token prediction) with 5 tokens instead of standard 3, improving mean acceptance length
Why It Matters
Shows how optimized local deployment can deliver cloud-level AI performance at significantly lower operational costs.