Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results
A custom 2x RTX PRO 6000 Blackwell build achieves record 198 tokens/second on a 122B parameter model using a PCIe switch.
A developer known as Visual Synthesizer has demonstrated a highly optimized 2-GPU inference server that achieves a remarkable 198 tokens/second on the Qwen3.5-122B model. The build uses two NVIDIA RTX PRO 6000 Blackwell GPUs (each with 96GB GDDR7), an AMD EPYC 4564P CPU, 128GB DDR5 ECC RAM, and a c-payne PM50100 Gen5 PCIe switch on an AsRock Rack B650D4U server board. The performance was achieved using the SGLang inference engine with b12x kernels and speculative decoding, running a modelopt_fp4 checkpoint of the 122-billion-parameter model. Three verification runs showed consistent results of 197, 200, and 198 tok/s, and a curl test confirmed 2000 tokens generated in 12.7 seconds.
The secret to this performance lies in the PCIe switch topology (PIX), which routes peer-to-peer GPU communication directly through the switch fabric at sub-microsecond latency, bypassing the CPU root complex. This is critical for Mixture-of-Experts (MoE) tensor parallelism, where each forward pass involves dozens of small all-reduce operations that are latency-sensitive, not bandwidth-bound. The setup also employs performance governors and specific kernel parameters (pci=noacs, uvm_disable_hmm=1) to prevent P2P hangs. The server maintains performance across context lengths, with time-to-first-token (TTFT) scaling from 1.8 seconds at 4K context to 23.3 seconds at 150K context, while decode speed remains steady at ~198 tok/s. All benchmark data, raw JSONs, and configuration details are publicly available on GitHub.
- Achieved 198 tokens/second on Qwen3.5-122B using two RTX PRO 6000 Blackwell GPUs and a PCIe switch for direct GPU communication.
- PCIe switch topology (PIX) reduces latency for MoE tensor parallelism, making the build 18% faster than a comparable Threadripper setup.
- Full methodology and verification data (three runs: 197, 200, 198 tok/s) are publicly available on GitHub for replication.
Why It Matters
This demonstrates how optimized hardware topology, not just raw GPU power, can drastically improve inference speed for large language models, making high-performance AI more accessible.