X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning
A tiny 0.4B model rivals billion-scale voice cloning with zero-shot cross-lingual ability.
A team of researchers from multiple institutions has released X-Voice, a compact 0.4B-parameter voice cloning model that can replicate any speaker's voice and read text in 30 languages – all without needing a transcript of the reference audio. The model is trained on a massive 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation, enabling consistent cross-lingual pronunciation. To avoid complex forced alignment or speaker verification pre-processing, the authors designed a two-stage training paradigm. In stage one, a conditional flow-matching model (X-Voice$_{s1}$) synthesizes 10K hours of speaker-consistent audio segments. In stage two, the model is fine-tuned on these segments with the prompt text masked out, yielding X-Voice$_{s2}$ that can clone voices from raw audio alone.
Architecturally, X-Voice extends F5-TTS with dual-level injection of language identifiers and decoupled classifier-free guidance scheduling. Subjective listening tests and objective metrics show that X-Voice surpasses existing flow-matching multilingual systems like LEMAS-TTS, and achieves zero-shot cross-lingual cloning performance on par with billion-scale models such as Qwen3-TTS – all while being significantly smaller. The full model, training code, and evaluation data have been open-sourced to encourage further research and transparency. This release lowers the barrier for multilingual voice cloning, enabling applications in dubbing, accessibility, and personal assistants without depending on proprietary APIs.
- X-Voice is a 0.4B-parameter zero-shot cross-lingual voice cloning model supporting 30 languages.
- Trained on 420K hours of multilingual data using IPA phonemes, eliminating the need for transcriptions of prompt audio.
- Outperforms LEMAS-TTS and matches billion-scale Qwen3-TTS in cloning accuracy; all resources are open-source.
Why It Matters
Open-source voice cloning for 30 languages without transcriptions democratizes multilingual dubbing and accessibility tools.