Open Source

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings

A deep-dive review reveals the RTX 6000 Max-Q's blistering 120+ tp/s inference speed but a problematic 15-minute VLLM startup time.

Deep Dive

An in-depth user review of NVIDIA's professional-grade RTX 6000 Max-Q GPU highlights its formidable capabilities and significant quirks for running large language models locally. The card excels at inference, running the 262k-context Qwen3-Coder-Next model entirely in VRAM at over 120 tokens per second using FP8 quantization. However, the experience is marred by severe operational issues, including a painfully long 15-minute startup time when using the popular VLLM framework inside Docker containers and an extremely audible, high-pitched coil whine that the reviewer notes would be intolerable in a shared workspace. The power draw is another critical consideration, with the full system pulling over 600 watts under load, which tripped a 900w UPS breaker despite being plugged into a non-battery-backed port.

The technical deep dive reveals specific workarounds and performance characteristics. Using NVIDIA's open drivers instead of proprietary ones and applying a custom fan curve are recommended, as the default Max-Q cooling is insufficient. While VLLM in Docker has major startup delays, running it on the host OS or using llama.cpp avoids this problem. The reviewer also tested the Qwen3.5-122B model with a 4-bit quant and found it superior, though the card's 48GB of VRAM is sufficient that a second GPU wasn't deemed necessary. For professionals, the card represents a powerful but finicky tool—offering state-of-the-art local inference speed for large models but requiring careful setup and tolerance for its substantial noise and power footprint.

Key Points
  • Achieves 120+ tokens/sec running Qwen3-Coder-Next with FP8 quantization and full 262k context window.
  • Suffers from a major 15-minute startup delay in VLLM Docker containers and severe, audible coil whine.
  • System power draw exceeds 600w, requiring careful UPS management to avoid tripping breakers.

Why It Matters

Provides real-world performance and pitfalls for professionals considering high-end NVIDIA GPUs for cost-effective, local AI inference.