MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — bench numbers
Benchmark shows new 96GB Blackwell GPUs deliver 2800 tokens/sec at high concurrency, a 22x speedup over single requests.
A detailed technical benchmark has revealed the impressive inference performance of MiniMax's M2.7 NVFP4 model on NVIDIA's newly released professional-grade hardware. The test, run by a community member, utilized a system with two RTX PRO 6000 Blackwell GPUs, each with 96GB of VRAM, connected via a PCIe Gen5 switch. Using the SGLang runtime and a 4-bit quantized version of the model, the setup achieved a peak aggregate throughput of 2800 tokens per second when processing 128 concurrent requests. This represents a massive 22x scaling from the single-request speed of 127.7 tokens/sec, showcasing the hardware's ability to handle high-volume inference workloads efficiently.
The benchmark provides granular data on both decode throughput and prefill latency across different context lengths. For prefill (initial prompt processing), the system maintained speeds over 17,000 tokens/sec for 8K contexts, dropping to ~9,900 tokens/sec for a full 128K context. The results highlight the trade-offs between concurrency, latency, and context length, noting that the 128K context is practical only for single requests due to KV cache memory limits. This data is crucial for developers and enterprises evaluating the cost-performance of the new Blackwell architecture for deploying medium-sized language models in production environments, offering a real-world preview before widespread availability.
- Peak throughput of 2800 tokens/sec achieved with 128 concurrent requests on dual RTX PRO 6000 Blackwell GPUs.
- Hardware setup includes two 96GB GPUs and uses a 4-bit quantized version of MiniMax's 7B-parameter M2.7 model.
- Benchmark details performance trade-offs, showing 128K context processing is limited to single requests due to KV cache memory constraints.
Why It Matters
Provides the first real-world performance data for NVIDIA's flagship Blackwell GPUs, guiding enterprise AI infrastructure decisions.