Open Source

Qwen3.6-35B-A3B KLDs - INTs and NVFPs

Real GPU benchmarks show INT8 outperforms FP8 in accuracy for large models

Deep Dive

Phaelon74 has released detailed Kullback-Leibler Divergence (KLD) benchmarks for the Qwen3.6-35B-A3B model, comparing INT8, FP8, and NVFP4 quantization formats. Using real logits on RTX 6000 GPUs within VLLM, the tests reveal that INT8 quantization delivers better quality than FP8, despite FP8's potential speed advantage in 8-bit mode. The NVFP4 format shows significant divergence, with NVFP4A16 offering higher accuracy than NVFP4A4 but potentially slower performance.

The benchmarks emphasize that KLD measures raw mathematical divergence from full-precision logits, not task-specific accuracy. A quant with worse KLD might still perform better on specific evaluations, depending on use case. The author advises choosing quantization based on your specific workload: FP8 for speed-critical tasks, INT8 for quality-sensitive applications, and NVFP4 for memory-constrained scenarios where trade-offs are acceptable. All tests are reproducible via the author's VLLM fork.

Key Points
  • INT8 quantization achieves better quality than FP8 on Qwen3.6-35B-A3B, confirmed by real GPU benchmarks
  • NVFP4 shows significant KLD divergence, with NVFP4A16 outperforming NVFP4A4 in accuracy
  • FP8 should be faster in 8-bit mode due to native kernel support, but quality is lower than INT8
  • Benchmarks use real logits on RTX 6000 GPUs in VLLM, taking 3-5 minutes per test

Why It Matters

Practical guidance for selecting quantization formats to balance speed, memory, and accuracy in production.