Audio & Speech

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

A tiny 0.4B model rivals billion-scale voice cloning with zero-shot cross-lingual ability.

Deep Dive

A team of researchers from multiple institutions has released X-Voice, a compact 0.4B-parameter voice cloning model that can replicate any speaker's voice and read text in 30 languages – all without needing a transcript of the reference audio. The model is trained on a massive 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation, enabling consistent cross-lingual pronunciation. To avoid complex forced alignment or speaker verification pre-processing, the authors designed a two-stage training paradigm. In stage one, a conditional flow-matching model (X-Voice$_{s1}$) synthesizes 10K hours of speaker-consistent audio segments. In stage two, the model is fine-tuned on these segments with the prompt text masked out, yielding X-Voice$_{s2}$ that can clone voices from raw audio alone.

Architecturally, X-Voice extends F5-TTS with dual-level injection of language identifiers and decoupled classifier-free guidance scheduling. Subjective listening tests and objective metrics show that X-Voice surpasses existing flow-matching multilingual systems like LEMAS-TTS, and achieves zero-shot cross-lingual cloning performance on par with billion-scale models such as Qwen3-TTS – all while being significantly smaller. The full model, training code, and evaluation data have been open-sourced to encourage further research and transparency. This release lowers the barrier for multilingual voice cloning, enabling applications in dubbing, accessibility, and personal assistants without depending on proprietary APIs.

Key Points
  • X-Voice is a 0.4B-parameter zero-shot cross-lingual voice cloning model supporting 30 languages.
  • Trained on 420K hours of multilingual data using IPA phonemes, eliminating the need for transcriptions of prompt audio.
  • Outperforms LEMAS-TTS and matches billion-scale Qwen3-TTS in cloning accuracy; all resources are open-source.

Why It Matters

Open-source voice cloning for 30 languages without transcriptions democratizes multilingual dubbing and accessibility tools.