Open Source

Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests)

r/LocalLLaMA March 02, 2026

⚡Developer achieves record 585 tokens/sec throughput for 8 simultaneous AI requests using custom optimizations.

Deep Dive

A developer has demonstrated remarkable performance running Alibaba's Qwen3.5-27B model locally, achieving 100+ tokens/second decode speeds with a massive 170k context window on just two RTX 3090 GPUs. The setup leverages vLLM with tensor parallelism and NVLink connectivity between GPUs, plus custom optimizations including multi-token prediction (MTP) set to 5 tokens instead of the typical 3. The developer compiled vLLM from source and used a specialized quantization approach that keeps linear attention layers at full precision while quantizing full attention layers to int4 - a format that RTX 3090s handle efficiently via hardware support.

The configuration delivers exceptional throughput, reaching 585 tokens/second across 8 simultaneous requests while maintaining decode speeds above 60 tokens/second even in worst-case scenarios. Prefill performance is equally impressive at approximately 1500 tokens/second. The developer shared their build and launch scripts, along with fixes for tool-calling bugs in Qwen3.5 when using MTP. This achievement shows how careful optimization of open-source inference engines like vLLM can make large language models with extensive context windows practical for local deployment, potentially reducing reliance on expensive cloud APIs while maintaining competitive performance.

Key Points

Achieved 100+ tokens/sec decode with 170k context on dual RTX 3090s using vLLM and tensor parallelism
Reached 585 tokens/sec throughput across 8 simultaneous requests with custom int4 quantization for attention layers
Used MTP (multi-token prediction) with 5 tokens instead of standard 3, improving mean acceptance length

Why It Matters

Shows how optimized local deployment can deliver cloud-level AI performance at significantly lower operational costs.

Read Original Article

Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests)

Why It Matters

Stay Ahead in AI