llama.cpp b9434 fixes Qwen 3.5/3.6 multi-GPU inference
Tensor parallelism granularity fix for Qwen models on 3 GPUs lands in llama.cpp
The open-source llama.cpp project, maintained by ggml-org and boasting 114k stars on GitHub, has released version b9434. This patch specifically addresses tensor parallelism (TP) granularity for Qwen 3.5 and Qwen 3.6 models when distributed across exactly 3 GPUs. Tensor parallelism splits model layers across multiple GPUs to reduce memory per device and speed up inference, but granularity issues can cause load imbalances or crashes. The fix ensures proper sharding for these newer Qwen architectures.
Alongside the TP fix, the release also resolves an issue related to afmoe (likely an attention/mixture-of-experts feature). Build artifacts are available for all major platforms: macOS (Apple Silicon and Intel, with KleidiAI support disabled), Linux (x64/arm64 with Vulkan, ROCm, OpenVINO, SYCL), Windows (CPU, CUDA 12/13, Vulkan, HIP), and Android arm64. This makes llama.cpp a versatile choice for developers deploying LLMs locally or on edge devices. The fix targets users running Qwen's latest models on multi-GPU setups, improving reliability for both inference servers and experimental AI applications.
- Fixes tensor parallelism granularity for Qwen 3.5 and 3.6 models on 3-GPU configurations
- Resolves issues with afmoe (attention/mixture-of-experts) in large language model inference
- Released for all major platforms including Windows CUDA 12/13, Linux ROCm/Vulkan, macOS, and Android
Why It Matters
Critical fix for developers running Qwen's latest 3.5/3.6 models across multiple GPUs with llama.cpp