Developer Tools

b8377

The latest release enables efficient Qwen 3.5 inference across 15+ platforms including Windows CUDA and Apple Silicon.

Deep Dive

The ggml-org team behind the widely-used llama.cpp project has released version b8377, marking a significant update to the open-source inference engine that powers local AI model execution. This release introduces official support for the Qwen 3.5 model family from Alibaba Cloud, enabling developers to run these competitive models efficiently on consumer hardware. The update includes a critical fix for the Qwen 3.5 implementation, replacing 'cont' with 'reshape' for alpha calculations to ensure proper tensor operations and model accuracy.

What makes this release particularly noteworthy is its comprehensive cross-platform support. The team has provided pre-built binaries for 15+ different platform configurations, including Windows with CUDA 12.4 and 13.1 support for NVIDIA GPU acceleration, macOS builds for both Apple Silicon (arm64) and Intel (x64) architectures, and Linux variants with Vulkan, ROCm 7.2, and OpenVINO backends. This extensive platform coverage means developers can deploy Qwen 3.5 models across diverse environments without complex compilation processes, from iOS applications to enterprise servers.

The release also includes specialized builds for niche platforms like openEuler with Huawei Ascend NPU support (310p and 910b with ACL Graph), demonstrating the project's commitment to hardware diversity. This comes at a time when efficient local inference is becoming increasingly important for privacy-sensitive applications, cost-effective deployment, and edge computing scenarios where cloud API calls are impractical or too expensive.

Key Points
  • Adds official support for Alibaba's Qwen 3.5 model family with tensor operation fixes
  • Provides pre-built binaries for 15+ platforms including Windows CUDA 12/13, macOS Apple Silicon, and Linux ROCm
  • Includes specialized builds for Huawei Ascend NPUs and openEuler for enterprise deployment scenarios

Why It Matters

Enables cost-effective, private deployment of state-of-the-art models across diverse hardware, reducing cloud dependency.