Developer Tools

llama.cpp b9169 adds multimodal chunks for Qwen3A

Supports audio tokens, limits memory with chunk size...

Deep Dive

The latest release of llama.cpp, b9169, introduces critical improvements to multimodal (mtmd) support specifically for the Qwen3A model. The commit adds chunk-based processing for multimodal data, which splits large inputs into manageable pieces to avoid memory blowup. Additionally, it fixes preprocessing logic and reorders audio tokens correctly. An attn_mask limit is now applied to constrain the chunk size, preventing excessive memory allocation during attention operations. These changes make it feasible to run Qwen3A, a model that likely handles both text and audio inputs, on consumer hardware with limited VRAM.

The release is built for an extensive range of platforms: macOS (Apple Silicon, Intel), iOS (XCFramework), Linux (x64, arm64, s390x with Vulkan, ROCm, SYCL), Windows (x64, arm64 with CUDA, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler (x86, aarch64 with ACL Graph). The binary assets are provided for each architecture, allowing seamless local inference. This update underscores llama.cpp's continued focus on enabling advanced models on user-owned devices, reducing reliance on cloud APIs.

Key Points
  • Adds mtmd chunk support for Qwen3A multimodal inference with memory-efficient attention mask limits.
  • Fixes preprocessing and reorders audio tokens for correct multimodal processing.
  • Available across 16+ platform builds including macOS, Windows, Linux, Android, iOS, and openEuler.

Why It Matters

Enables efficient local multimodal (text+audio) inference for Qwen3A on consumer devices.