55 → 282 tok/s: How I got Qwen3.5-397B running at speed on 4x RTX PRO 6000 Blackwell
A custom CUTLASS kernel fixed SM120's broken MoE tiles, pushing throughput from 55 to 282 tokens/second.
An engineer from Verdict AI has released a critical optimization that dramatically speeds up large Mixture-of-Experts (MoE) models like Qwen3.5-397B on NVIDIA's new Blackwell workstation GPUs. The problem stemmed from a mismatch in CUTLASS, NVIDIA's core linear algebra library. The library's autotuner was designed for datacenter Blackwell chips (B200) with 228KB of shared memory per streaming multiprocessor (SM), but workstation GPUs like the RTX PRO 6000 (SM120) only have 99KB. This caused it to skip all optimized "tiles" for computation, forcing models to use slow fallback kernels and leaving over 50% of potential performance unused.
The fix was a targeted patch to CUTLASS's `sm120_blockscaled_mma_builder.inl` file. It corrected a layout mismatch that occurred when using smaller K=64 tile shapes, which fit within the 99KB SMEM limit, instead of the default K=128 tiles. The patch adjusts how scale factors are computed and folded into the computation block. The result is a massive performance uplift: on a 4x RTX PRO 6000 setup running the Qwen3.5-397B-A17B-NVFP4 model, throughput soared from a baseline of 55 tokens/second (in WSL2) to 282 tokens/second—a 5x improvement.
Verdict AI has submitted a pull request to the FlashInfer project and provides a pre-built Docker image (`verdictai/vllm-blackwell-k64`) for easy deployment. The optimization is crucial for professionals running state-of-the-art MoE models on high-end workstation hardware, effectively bridging the performance gap with far more expensive datacenter systems.
- Patched CUTLASS to fix SM120's MoE GEMM tile overflow, enabling K=64 shapes on 99KB shared memory.
- Boosted Qwen3.5-397B inference speed from 55 to 282 tokens/sec (5x faster) on 4x RTX PRO 6000 GPUs.
- Pre-built Docker image available, with PR submitted to FlashInfer for community-wide Blackwell optimization.
Why It Matters
Unlocks datacenter-level inference speeds for massive 397B-parameter MoE models on affordable workstation GPUs, democratizing high-performance AI.