Qwen 3.6 27B MTP on v100 32GB: 54 t/s
Old V100 GPU runs a 27B model at 54 tokens per second using new multi-token prediction trick.
Deep Dive
A developer achieved 54-55 tokens per second on a V100 32GB GPU using am17an's MTP branch of llama.cpp. Without MTP, speed was 29-30 t/s. The setup used q8_0 KV cache, a 200k cache limit, and a 150W power limit. The model performed well as a VSCode copilot for tool calls, sub-agents, and code reviews, though speed fell to 40-45 t/s after processing 50k tokens.
Key Points
- Achieved 54-55 t/s on a V100 32GB GPU using am17an's MTP branch of llama.cpp (vs 29-30 t/s without MTP).
- Used q8_0 KV cache with 200k token limit and 150W power cap; speed dropped to 40-45 t/s after 50k tokens.
- Successfully handled tool calls, sub-agents, and code reviews as a local VSCode copilot.
Why It Matters
Enables powerful local AI coding assistants on older GPUs, reducing hardware costs for developers.