PSA: Qwen 3.5 requires bf16 KV cache, NOT f16!!
Using the wrong cache format degrades model performance, a critical fix for developers running Qwen 3.5 locally.
A critical configuration issue has been identified for developers running Alibaba's Qwen 3.5 35B A3B model locally. When using popular inference engines like llama.cpp, the default Key-Value (KV) cache format is set to fp16 (float16), which leads to degraded model performance. The official Qwen-team implementations, such as vLLM, correctly default to using bf16 (brain float16) for the cache. This mismatch means users not manually specifying `-ctk bf16 -ctv bf16` in llama.cpp are unknowingly running a less accurate version of the model, as proven by perplexity (PPL) benchmarks on the wikitext-2-raw dataset.
Technical tests using the Qwen3.5-35B-A3B-UD-Q5_K_XL.gguf file showed that while fp16 and fp32 caches yielded a PPL of 6.5511, the correct bf16 cache achieved a slightly better PPL of 6.5497. Although the numerical difference seems small, it confirms the model was optimized for bf16, and using fp16 introduces inference errors. This finding is vital for the open-source AI community, as it ensures accurate benchmarking, fine-tuning, and application development. Developers must now add this manual flag to their llama.cpp commands, a step that highlights the growing complexity of deploying state-of-the-art models and the importance of aligning local inference settings with the original training framework.
- Qwen 3.5 35B A3B requires bf16 KV cache in llama.cpp, not the default fp16, for accurate results.
- Perplexity tests show a measurable performance gap (6.5497 vs 6.5511 PPL) when using the correct cache format.
- Official vLLM implementations default to bf16, making this a crucial fix for local deployment and benchmarking.
Why It Matters
Ensures developers and researchers get the true, intended performance from Alibaba's flagship open-source model for accurate applications and benchmarks.