Open Source

llama.cpp benchmark native vs. non native NVFP4 on Blackwell - summary

Native NVFP4 in llama.cpp b8967 speeds up prefill by up to 68% on Blackwell.

Deep Dive

A detailed benchmark of llama.cpp on an NVIDIA RTX 5090 with an AMD Ryzen 9 9950X3D tested two builds: b8966 (without native NVFP4) and b8967 (with native NVFP4 support) using the Qwen3.6-27B-NVFP4 model (17.50 GiB, 26.90B parameters). The results show a significant improvement in prompt processing performance with native NVFP4, ranging from 42.6% to 68.3% faster depending on prompt length and context size. For example, at pp512 (512 tokens processed), throughput jumped from 3295.10 t/s to 5546.93 t/s, a 68.3% gain. At longer contexts like d32768, the advantage was still substantial at 43.6% faster (2479.39 t/s vs 3560.58 t/s). The average uplift across all tested scenarios was roughly 57%.

Token generation speed, however, remained effectively identical between the two builds, with both achieving around 73 t/s at base context and gradually declining to ~67 t/s at d32768. The tiny differences were within benchmark noise (e.g., -0.1% to 0.0%). This indicates that native NVFP4 primarily accelerates the prefill phase (prompt ingestion) without affecting autoregressive decoding. Practical impacts include faster responses for RAG workloads, document analysis, and code-heavy prompts where large contexts are processed upfront, while normal chat generation feels unchanged once generation starts. Both builds showed similar context length scaling, with generation speed dropping about 9% from base to d32768.

Key Points
  • Native NVFP4 in llama.cpp b8967 improves prompt processing by 43–68% (avg 57%) on RTX 5090.
  • Token generation speed unchanged at ~73 t/s, with no meaningful difference between builds.
  • Largest gains at short/medium contexts (up to 1.7x faster); still 1.43x faster at long contexts (d32768).

Why It Matters

Faster prompt processing on Blackwell GPUs boosts RAG and document analysis without sacrificing generation speed.