Viral Wire

Alibaba's Qwen3.5-LiveTranslate-Flash triples languages to 60 with voice cloning

Real-time translation now preserves your voice across 60 languages with 2.8s latency.

Deep Dive

Alibaba's Qwen team released Qwen3.5-LiveTranslate-Flash on May 19, 2026, supercharging its real-time multimodal translation model. The update triples language coverage from 18 to 60 languages and introduces real-time voice cloning. Built on the new Thinker-Talker architecture (replacing Qwen3-Omni), the system separates translation processing from speech generation: one component handles multilingual audio and visual context for translation, while another generates speech output in the target language. The model processes speech in smaller semantic units to reduce latency while preserving natural flow across different sentence structures. Alibaba reports average speech-to-speech latency of 2.8 seconds per token, a significant improvement over the previous version.

The model supports 29 spoken output languages and 31 text-only, with three voice cloning modes: pre-registered, clone-once, and real-time, enabling translated speech to preserve the speaker's vocal characteristics. This is particularly valuable for streamers, hosts, and guests in multilingual interactions. The system retains support for up to 1,000 dynamically configurable hotwords for names, brands, and technical terms. Alibaba positions the model for multilingual meetings, livestream localization, online classrooms, and business negotiations. The company also provides a browser-based LiveTranslate demo for testing speech translation and voice cloning. Future work aims to reduce latency further, expand language and dialect support, improve terminology consistency over long conversations, and enhance multimodal interaction involving gestures, lip movement, and facial expressions.

Key Points
  • Language coverage expanded from 18 to 60 languages with 29 output languages for speech and 31 for text-only
  • Real-time voice cloning with three modes (pre-registered, clone-once, real-time) to preserve speaker identity across languages
  • Speech-to-speech latency reduced to 2.8 seconds per token using the new Thinker-Talker architecture and semantic unit processing

Why It Matters

Real-time multilingual communication with preserved speaker identity transforms livestreaming, global meetings, and online education.