Audio & Speech

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

Researchers achieve zero-shot TTS for 600+ languages using a novel diffusion architecture and 581k-hour dataset.

Deep Dive

A research team led by Han Zhu has introduced OmniVoice, a groundbreaking text-to-speech model that supports over 600 languages through a novel diffusion-based architecture. Unlike conventional two-stage pipelines that convert text to semantic tokens then to audio, OmniVoice uses a discrete non-autoregressive (NAR) approach to map text directly to multi-codebook acoustic tokens. This simplification is enabled by two key innovations: a full-codebook random masking strategy for efficient training and initialization from a pre-trained large language model (LLM) to ensure superior speech intelligibility.

The model was trained on a massive 581k-hour multilingual dataset curated entirely from open-source data, representing the broadest language coverage achieved to date in TTS research. OmniVoice delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks while maintaining zero-shot capability—meaning it can generate speech in languages not explicitly seen during training. The researchers have made their code and pre-trained models publicly available, potentially democratizing high-quality speech synthesis for hundreds of under-resourced languages.

This advancement represents a significant leap in making speech technology truly global and accessible. By leveraging diffusion models and massive open-source datasets, OmniVoice addresses the longstanding challenge of creating natural-sounding speech for languages with limited training data. The model's architecture also demonstrates how insights from large language models can be effectively transferred to speech generation tasks, opening new possibilities for cross-modal AI applications.

Key Points
  • Novel diffusion language model architecture enables direct text-to-acoustic token mapping, simplifying traditional two-stage TTS pipelines
  • Trained on 581k hours of open-source multilingual data, achieving coverage of over 600 languages—the broadest to date
  • Achieves state-of-the-art performance in Chinese, English, and multilingual benchmarks with publicly available code and models

Why It Matters

Democratizes high-quality speech synthesis for hundreds of under-resourced languages, enabling truly global voice applications.