Open Source

Some tests of Qwen3.5 on V100s

Open-source AI model achieves 80 tokens/second on 7-year-old V100 GPUs, making high-speed inference affordable.

Deep Dive

Alibaba's Qwen3.5 open-source language models are demonstrating that cutting-edge AI inference doesn't require cutting-edge hardware. Community tests reveal the 27B and 35B parameter versions of Qwen3.5 achieving impressive speeds of 40 tokens per second in standard dense mode and a blistering 80 t/s when utilizing its Mixture-of-Experts (MoE) architecture. This performance was recorded on a setup using two 7-year-old NVIDIA V100 GPUs connected via NVLink, challenging the assumption that only the newest H100 or Blackwell GPUs can deliver usable inference speeds for large models. The results suggest efficient model architecture and software optimization can breathe new life into existing data center infrastructure.

Technically, the tests employed a 'graph split' technique to distribute the computational load efficiently across the two GPUs. The near doubling of speed in MoE mode—where only a subset of the model's 'experts' are activated per token—highlights the architecture's efficiency gains. For enterprises and researchers, this translates to the ability to deploy capable, modern AI assistants, coding copilots, or analytical tools on hardware that is often considered legacy. It dramatically reduces the total cost of ownership for AI workloads and could accelerate adoption in cost-sensitive sectors like academia and smaller tech firms, proving that the AI hardware arms race has a compelling software counterpoint.

Key Points
  • Qwen3.5 27B/35B models hit 80 tokens/sec in MoE mode on dual V100s
  • Performance doubles from 40 t/s (dense) to 80 t/s (MoE) via efficient activation
  • Uses graph splitting across NVLink for optimal load distribution on older hardware

Why It Matters

Lowers AI deployment costs by maximizing legacy hardware, enabling wider access to advanced models without massive GPU investments.