One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech
Cross-lingual voice cloning keeps speaker identity intact in 3 languages...
A team led by Amanuel Gizachew Abebe and Yasmin Moslem has submitted a system to the IWSLT 2026 Cross-Lingual Voice Cloning shared task, tackling the challenge of preserving a speaker's voice identity when generating speech in a different language. The work focuses on the specialized domain of scientific communication, targeting Arabic, Chinese, and French. The researchers evaluate several state-of-the-art voice cloning models before building their own systems based on the OmniVoice foundation model. To overcome the scarcity of parallel multilingual speech data for scientific content, they employ data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus, a dataset of academic papers. This synthetic data is then used for fine-tuning, yielding consistent improvements in intelligibility metrics like Word Error Rate (WER) and Character Error Rate (CER) across all three languages, while maintaining high speaker similarity scores.
The paper, published on arXiv (2604.26136), demonstrates that fine-tuning on domain-specific synthetic data can significantly enhance cross-lingual voice cloning performance without sacrificing voice identity. The system's ability to generate fluent scientific speech in multiple languages from a single speaker's voice has implications for global knowledge dissemination, making research accessible to non-native speakers. The researchers note that while the approach shows promise, further work is needed to handle tonal languages like Chinese more naturally and to expand to additional low-resource languages. The submission is part of the IWSLT 2026 evaluation campaign, which benchmarks progress in spoken language translation and related technologies.
- System based on OmniVoice foundation model fine-tuned on synthetic data from ACL 60/60 corpus
- Achieves consistent WER and CER improvements across Arabic, Chinese, and French
- Multi-model ensemble distillation used for data augmentation to overcome domain-specific data scarcity
Why It Matters
Enables researchers to deliver scientific talks in multiple languages using their own voice, breaking language barriers.