Qwen3.6 27B hits 1000 tps on V100s with batching
128 concurrent requests push Qwen3.6 to 1000 tokens/sec on older V100 GPUs
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Deep Dive
A Reddit user reported hitting ~80 tokens per second for single-user generation and 3000 t/s processing without multi-token prediction, while 128 concurrent requests produced a "big number" that's far from their actual needs.
Key Points
- 1000 tokens per second generation achieved with 128 concurrent requests on V100 GPUs
- Single-user throughput is ~80 t/s generation and 3000 t/s prefill without MTP
- Highlights Qwen3.6 27B's efficiency on older hardware, enabling cost-effective inference
Why It Matters
Demonstrates that modern 27B models can run efficiently on older V100s, reducing hardware costs for AI inference.