1000 tokens per second generation achieved with 128 concurrent requests on V100 GPUs?

1000 tokens per second generation achieved with 128 concurrent requests on V100 GPUs

Single-user throughput is ~80 t/s generation and 3000 t/s prefill without MTP?

Single-user throughput is ~80 t/s generation and 3000 t/s prefill without MTP

Highlights Qwen3.6 27B's efficiency on older hardware, enabling cost-effective inference?

Highlights Qwen3.6 27B's efficiency on older hardware, enabling cost-effective inference

Open Source

Qwen3.6 27B hits 1000 tps on V100s with batching

r/LocalLLaMA May 25, 2026

⚡128 concurrent requests push Qwen3.6 to 1000 tokens/sec on older V100 GPUs

Deep Dive

A Reddit user reported hitting ~80 tokens per second for single-user generation and 3000 t/s processing without multi-token prediction, while 128 concurrent requests produced a "big number" that's far from their actual needs.

Key Points

1000 tokens per second generation achieved with 128 concurrent requests on V100 GPUs
Single-user throughput is ~80 t/s generation and 3000 t/s prefill without MTP
Highlights Qwen3.6 27B's efficiency on older hardware, enabling cost-effective inference

Why It Matters

Demonstrates that modern 27B models can run efficiently on older V100s, reducing hardware costs for AI inference.

Read Original Article

Qwen3.6 27B hits 1000 tps on V100s with batching

Why It Matters

Related Articles

🚀 Stay Ahead in AI