Open Source

I spent 8+ hours benchmarking every MoE backend for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 (SM120). Here's what I found.

Deep dive reveals 50.5 tok/s max, not 130+, due to broken NVIDIA kernels on new Blackwell workstation GPUs.

Deep Dive

A comprehensive technical benchmark has exposed a critical performance bottleneck for running large MoE (Mixture of Experts) models on NVIDIA's new workstation GPUs. Testing Alibaba's 397-billion parameter Qwen3.5-397B NVFP4 model on a 4x RTX PRO 6000 Blackwell setup (SM120), the researcher achieved a sustained decode rate of only 50.5 tokens/second using the Marlin W4A16 backend. This result contradicts community claims of 130+ tok/s and is the best performance possible due to a fundamental bug in NVIDIA's software stack.

The root cause is that NVIDIA's CUTLASS library—specifically its grouped GEMM kernels designed to utilize the GPU's native FP4 tensor cores for efficient MoE inference—fails to initialize on the SM120 architecture. This forces users to fall back to the Marlin backend, which dequantizes FP4 weights to FP16, effectively leaving half the theoretical hardware performance on the table. The bug is isolated to the workstation (SM120) variant, as the SM121 architecture in data center GPUs like DGX Spark works correctly. Attempts to use features like Multi-Token Prediction (MTP) with Marlin resulted in a 22% performance regression, further limiting optimization options. An issue (#3096) has been filed with NVIDIA's CUTLASS team, but remains unresolved.

Key Points
  • Max performance of 50.5 tok/s on 4x RTX PRO 6000 GPUs using Marlin backend, far below claimed 130+ tok/s.
  • NVIDIA's CUTLASS kernels fail on SM120 architecture, blocking access to native FP4 tensor cores and cutting theoretical throughput in half.
  • Multi-Token Prediction (MTP) causes a 22% performance regression with Marlin due to activation mismatches from FP4-to-FP16 dequantization.

Why It Matters

Professionals investing in high-end NVIDIA workstation GPUs for local AI inference cannot achieve advertised performance for cutting-edge MoE models due to a core software bug.