Open Source

mtmd: qwen3 audio support (qwen3-omni and qwen3-asr)

New Qwen3 models process audio and vision, enabling multimodal AI on consumer hardware.

Deep Dive

The open-source AI community has taken a significant step toward democratizing multimodal AI with the release of quantized Qwen3 models featuring audio support. The team at ggml-org has published GGUF (GPT-Generated Unified Format) versions of Alibaba's Qwen3 models on Hugging Face, making them accessible for local deployment. The flagship Qwen3-Omni-30B model employs a Mixture of Experts (MoE) architecture that can process both visual and audio inputs simultaneously, representing a major advancement for open-source multimodal AI. These quantized models are optimized to run efficiently on consumer hardware, bypassing the need for expensive cloud API calls.

Two specialized automatic speech recognition (ASR) models, Qwen3-ASR-1.7B and Qwen3-ASR-0.6B, complement the multimodal offering. These smaller, dedicated models provide efficient and accurate speech-to-text conversion, which can be used independently or in conjunction with the larger Omni model. The availability of these models in the GGUF format means developers can now integrate audio understanding into applications running entirely on local machines, from desktop assistants to edge computing devices. This release effectively brings capabilities similar to proprietary models like GPT-4o into the open-source ecosystem, enabling greater experimentation and customization without vendor lock-in.

Key Points
  • Qwen3-Omni-30B is a multimodal MoE model accepting both vision and audio inputs, available in 'Thinking' and 'Instruct' variants.
  • Specialized Qwen3-ASR models (1.7B and 0.6B parameters) provide efficient automatic speech recognition for local deployment.
  • All models are released in GGUF format, optimized for local inference on consumer hardware without cloud dependencies.

Why It Matters

This brings powerful, multimodal AI with audio understanding to local devices, enabling private, customizable applications and reducing reliance on cloud APIs.