Open Source

Qwen3.6-27B-NVFP4 - images

r/LocalLLaMA May 02, 2026

⚡Blackwell FP4 tensor cores deliver speed but compromise SVG generation quality

Deep Dive

A developer (u/Usual-Carrot6352) compiled llama.cpp with NVFP4 support on a Lenovo Legion 7i Gen10 equipped with an NVIDIA RTX 5090 (Blackwell), Intel Core Ultra 9 275HX, and 32GB RAM. They used Qwen3.6-27B (a 27B parameter model from Alibaba's Qwen team) in the NVFP4 GGUF format. Build flags included -DGGML_CUDA_NVFP4=ON and -DGGML_CUDA_GRAPHS=ON, enabling Blackwell's FP4 tensor cores and MX FP4 support. The model ran at 37 tokens/second with 131,072 token context, -ngl 99 (all layers on GPU), and 16 CPU threads.

The developer tested SVG image generation using prompts such as 'pelican riding a bicycle,' 'capybara wearing a kimono drinking matcha tea,' 'flamingo knitting a colorful sweater,' and 'Victorian-era robot reading a newspaper.' While the NVFP4 quant delivered fast inference, the generated images were judged as 'kinda looking kids cartoons' with lower creativity and detail compared to the Q6_K quant (which also wasn't perfect but was preferred). The NVFP4 format prioritizes speed and memory efficiency over output quality for generative tasks.

Key Points

Llama.cpp built with NVFP4 support achieved 37 t/s on RTX 5090 with Qwen3.6-27B, using 131K context
NVFP4 quant uses Blackwell FP4 tensor cores (architecture 120a) for efficient inference on 27B parameter model
SVG image generation with NVFP4 produced simpler, cartoonish outputs vs Q6_K; user prefers Q6_K for quality over speed

Why It Matters

NVFP4 boosts inference speed on Blackwell GPUs but reduces output quality for creative tasks - a speed vs quality tradeoff.

Read Original Article

Qwen3.6-27B-NVFP4 - images

Why It Matters

Stay Ahead in AI