Open Source

llama.cpp benchmark native vs. non native NVFP4 on Blackwell - summary

r/LocalLLaMA April 29, 2026

⚡Native NVFP4 in llama.cpp b8967 speeds up prefill by up to 68% on Blackwell.

Deep Dive

A detailed benchmark of llama.cpp on an NVIDIA RTX 5090 with an AMD Ryzen 9 9950X3D tested two builds: b8966 (without native NVFP4) and b8967 (with native NVFP4 support) using the Qwen3.6-27B-NVFP4 model (17.50 GiB, 26.90B parameters). The results show a significant improvement in prompt processing performance with native NVFP4, ranging from 42.6% to 68.3% faster depending on prompt length and context size. For example, at pp512 (512 tokens processed), throughput jumped from 3295.10 t/s to 5546.93 t/s, a 68.3% gain. At longer contexts like d32768, the advantage was still substantial at 43.6% faster (2479.39 t/s vs 3560.58 t/s). The average uplift across all tested scenarios was roughly 57%.

Token generation speed, however, remained effectively identical between the two builds, with both achieving around 73 t/s at base context and gradually declining to ~67 t/s at d32768. The tiny differences were within benchmark noise (e.g., -0.1% to 0.0%). This indicates that native NVFP4 primarily accelerates the prefill phase (prompt ingestion) without affecting autoregressive decoding. Practical impacts include faster responses for RAG workloads, document analysis, and code-heavy prompts where large contexts are processed upfront, while normal chat generation feels unchanged once generation starts. Both builds showed similar context length scaling, with generation speed dropping about 9% from base to d32768.

Key Points

Native NVFP4 in llama.cpp b8967 improves prompt processing by 43–68% (avg 57%) on RTX 5090.
Token generation speed unchanged at ~73 t/s, with no meaningful difference between builds.
Largest gains at short/medium contexts (up to 1.7x faster); still 1.43x faster at long contexts (d32768).

Why It Matters

Faster prompt processing on Blackwell GPUs boosts RAG and document analysis without sacrificing generation speed.

Read Original Article

llama.cpp benchmark native vs. non native NVFP4 on Blackwell - summary

Why It Matters

Stay Ahead in AI