Open Source

Qwen3.5-35B GGUF quants (16–22 GiB) - KLD + speed comparison

Independent testing reveals massive 36.2% token generation speed differences between quantized versions of Alibaba's 35B model.

Deep Dive

An independent benchmarking analysis of Alibaba's Qwen3.5-35B model quantizations reveals significant performance variations that challenge the notion of a single 'best' version. The researcher measured Kullback-Leibler divergence (KLD) across multiple GGUF quantizations ranging from 16-22 GiB, using a multilingual dataset from FLORES 200 combined with technical calibration data. This approach provides more realistic accuracy measurements than English-only benchmarks, showing how different quantization methods affect the model's probability distributions compared to the FP16 baseline.

Performance testing on RTX 3090 hardware revealed dramatic speed differences, with token generation rates varying by 36.2% between the slowest (Unsloth's UD-Q3_K_XL at ~105 tokens/second) and fastest (Mungert's iq4_nl at ~143 tokens/second) quantizations. The analysis includes detailed tables sorted by both KLD mean and KLD 99% values, allowing users to make informed decisions based on their specific accuracy versus speed trade-offs. Notably, the researcher avoided declaring a winner, emphasizing that GPU-constrained users must consider their unique requirements when selecting a quantization.

The benchmark results highlight how different quantization techniques from providers like Unsloth, AesSedai, and bartowski affect both model accuracy and inference speed. The comprehensive testing methodology, which includes both prompt processing (PP/s) and token generation (TG/s) metrics, provides valuable data for developers deploying Qwen3.5-35B in production environments where both accuracy and latency matter.

Key Points
  • 36.2% token generation speed variation between fastest and slowest Qwen3.5-35B quantizations
  • Multilingual KLD testing using FLORES 200 dataset plus technical calibration data
  • Quantizations range from 16-22 GiB with different accuracy/speed trade-offs

Why It Matters

Helps developers choose optimal quantizations for production deployments by balancing accuracy, speed, and hardware constraints.