Open Source

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Old V100 GPU runs a 27B model at 54 tokens per second using new multi-token prediction trick.

Deep Dive

A developer achieved 54-55 tokens per second on a V100 32GB GPU using am17an's MTP branch of llama.cpp. Without MTP, speed was 29-30 t/s. The setup used q8_0 KV cache, a 200k cache limit, and a 150W power limit. The model performed well as a VSCode copilot for tool calls, sub-agents, and code reviews, though speed fell to 40-45 t/s after processing 50k tokens.

Key Points
  • Achieved 54-55 t/s on a V100 32GB GPU using am17an's MTP branch of llama.cpp (vs 29-30 t/s without MTP).
  • Used q8_0 KV cache with 200k token limit and 150W power cap; speed dropped to 40-45 t/s after 50k tokens.
  • Successfully handled tool calls, sub-agents, and code reviews as a local VSCode copilot.

Why It Matters

Enables powerful local AI coding assistants on older GPUs, reducing hardware costs for developers.