Open Source

Qwen3.6-27B-NVFP4 - images

Blackwell FP4 tensor cores deliver speed but compromise SVG generation quality

Deep Dive

A developer (u/Usual-Carrot6352) compiled llama.cpp with NVFP4 support on a Lenovo Legion 7i Gen10 equipped with an NVIDIA RTX 5090 (Blackwell), Intel Core Ultra 9 275HX, and 32GB RAM. They used Qwen3.6-27B (a 27B parameter model from Alibaba's Qwen team) in the NVFP4 GGUF format. Build flags included -DGGML_CUDA_NVFP4=ON and -DGGML_CUDA_GRAPHS=ON, enabling Blackwell's FP4 tensor cores and MX FP4 support. The model ran at 37 tokens/second with 131,072 token context, -ngl 99 (all layers on GPU), and 16 CPU threads.

The developer tested SVG image generation using prompts such as 'pelican riding a bicycle,' 'capybara wearing a kimono drinking matcha tea,' 'flamingo knitting a colorful sweater,' and 'Victorian-era robot reading a newspaper.' While the NVFP4 quant delivered fast inference, the generated images were judged as 'kinda looking kids cartoons' with lower creativity and detail compared to the Q6_K quant (which also wasn't perfect but was preferred). The NVFP4 format prioritizes speed and memory efficiency over output quality for generative tasks.

Key Points
  • Llama.cpp built with NVFP4 support achieved 37 t/s on RTX 5090 with Qwen3.6-27B, using 131K context
  • NVFP4 quant uses Blackwell FP4 tensor cores (architecture 120a) for efficient inference on 27B parameter model
  • SVG image generation with NVFP4 produced simpler, cartoonish outputs vs Q6_K; user prefers Q6_K for quality over speed

Why It Matters

NVFP4 boosts inference speed on Blackwell GPUs but reduces output quality for creative tasks - a speed vs quality tradeoff.