Developer Tools

llama.cpp b9434 fixes Qwen 3.5/3.6 multi-GPU inference

Tensor parallelism granularity fix for Qwen models on 3 GPUs lands in llama.cpp

Deep Dive

The open-source llama.cpp project, maintained by ggml-org and boasting 114k stars on GitHub, has released version b9434. This patch specifically addresses tensor parallelism (TP) granularity for Qwen 3.5 and Qwen 3.6 models when distributed across exactly 3 GPUs. Tensor parallelism splits model layers across multiple GPUs to reduce memory per device and speed up inference, but granularity issues can cause load imbalances or crashes. The fix ensures proper sharding for these newer Qwen architectures.

Alongside the TP fix, the release also resolves an issue related to afmoe (likely an attention/mixture-of-experts feature). Build artifacts are available for all major platforms: macOS (Apple Silicon and Intel, with KleidiAI support disabled), Linux (x64/arm64 with Vulkan, ROCm, OpenVINO, SYCL), Windows (CPU, CUDA 12/13, Vulkan, HIP), and Android arm64. This makes llama.cpp a versatile choice for developers deploying LLMs locally or on edge devices. The fix targets users running Qwen's latest models on multi-GPU setups, improving reliability for both inference servers and experimental AI applications.

Key Points
  • Fixes tensor parallelism granularity for Qwen 3.5 and 3.6 models on 3-GPU configurations
  • Resolves issues with afmoe (attention/mixture-of-experts) in large language model inference
  • Released for all major platforms including Windows CUDA 12/13, Linux ROCm/Vulkan, macOS, and Android

Why It Matters

Critical fix for developers running Qwen's latest 3.5/3.6 models across multiple GPUs with llama.cpp