Qwen3.5-Omni Technical Report
The massive model beats Gemini-3.1 Pro in audio tasks and introduces a new 'Audio-Visual Vibe Coding' capability.
The Qwen Team, part of Alibaba, has detailed its latest flagship model, Qwen3.5-Omni, in a new technical report. This model represents a significant leap in omnimodal AI, scaling to hundreds of billions of parameters and supporting a massive 256k context length. It was trained on a colossal dataset of heterogeneous text-vision pairs and over 100 million hours of audio-visual content, which underpins its robust multimodal capabilities. The model employs a novel Hybrid Attention Mixture-of-Experts (MoE) framework for both its 'Thinker' and 'Talker' components, designed for efficient long-sequence inference.
In benchmark performance, Qwen3.5-Omni-plus achieves state-of-the-art (SOTA) results across 215 subtasks for audio and audio-visual understanding, reasoning, and interaction. It notably surpasses Google's Gemini-3.1 Pro in key audio tasks and matches it in comprehensive audio-visual understanding. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). A key technical innovation is ARIA, a new method that dynamically aligns text and speech units to solve instability in streaming speech synthesis, resulting in more natural conversational speech with minimal latency.
Furthermore, Qwen3.5-Omni expands linguistic support to 10 languages with human-like emotional nuance in speech generation. It exhibits superior audio-visual grounding, generating script-level structured captions with precise temporal sync and automated scene segmentation. Most remarkably, the report notes the emergence of a novel capability termed 'Audio-Visual Vibe Coding,' where the model can perform coding tasks directly from audio-visual instructions, hinting at a new frontier for multimodal AI assistants.
- Achieves SOTA on 215 benchmarks, beating Gemini-3.1 Pro in key audio tasks with training on 100M+ hours of audio-visual data.
- Introduces ARIA for stable, low-latency streaming speech and supports 10-hour audio, 400-second video, and a 256k context window.
- Demonstrates novel 'Audio-Visual Vibe Coding'—direct code generation from audio-visual instructions—and multilingual emotional speech across 10 languages.
Why It Matters
This sets a new bar for multimodal AI assistants, enabling professional-grade analysis of long media files and complex, context-rich interactions.