615K hours, 240M segments from public corpora and web recordings; filtered to 510K hours (194M segments) for training.

Raon-OpenTTS-1B achieves 1.78% WER and 0.749 SIM on Seed-TTS-Eval, and 6.15% WER / 0.775 SIM on CV3-Hard-EN (first on both)?

Raon-OpenTTS-1B achieves 1.78% WER and 0.749 SIM on Seed-TTS-Eval, and 6.15% WER / 0.775 SIM on CV3-Hard-EN (first on both).

All code, data, filtering pipeline, and checkpoints are open-sourced, enabling reproducible TTS research?

All code, data, filtering pipeline, and checkpoints are open-sourced, enabling reproducible TTS research.

Audio & Speech

Raon-OpenTTS matches closed-source TTS with 615K hours of open data

arXiv eess.AS May 21, 2026

⚡New open TTS model achieves 1.78% word error rate on Seed-TTS-Eval.

Deep Dive

A team of researchers from KAIST, University of Washington, and other institutions has released Raon-OpenTTS, a fully open text-to-speech system designed to challenge the dominance of proprietary models. The project includes Raon-OpenTTS-Pool, a massive open dataset of 615K hours (240M speech segments) aggregated from publicly available English speech corpora and web-sourced recordings. After applying a model-based filtering pipeline, they derived Raon-OpenTTS-Core, a high-quality subset of 510K hours and 194M segments. Using this curated data, they trained a series of diffusion transformer (DiT) models ranging from 0.3B to 1B parameters. On the Seed-TTS-Eval benchmark, Raon-OpenTTS-1B achieves a word error rate (WER) of just 1.78% and a speaker similarity (SIM) of 0.749, ranking second on WER and first on SIM among recent open-weight TTS baselines. On the harder CV3-Hard-EN benchmark, it tops both metrics with a WER of 6.15% and SIM of 0.775, outperforming all open-weight alternatives.

To support rigorous evaluation, the team also introduces Raon-OpenTTS-Eval, a structured benchmark that assesses TTS robustness across clean, noisy, in-the-wild, and expressive speech conditions. On this benchmark, Raon-OpenTTS-1B achieves the best average WER and SIM among all evaluated models, including closed-source systems like Qwen3-TTS and CosyVoice 3, and ranks second in human preference (CMOS). The researchers have made the entire pipeline open: the data pool, filtering scripts, training code, and model checkpoints are all publicly available. This release is a significant step toward democratizing high-quality TTS, providing the research community with both the data and models needed to build robust speech synthesis without relying on proprietary datasets.

Key Points

Raon-OpenTTS-Pool: 615K hours, 240M segments from public corpora and web recordings; filtered to 510K hours (194M segments) for training.
Raon-OpenTTS-1B achieves 1.78% WER and 0.749 SIM on Seed-TTS-Eval, and 6.15% WER / 0.775 SIM on CV3-Hard-EN (first on both).
All code, data, filtering pipeline, and checkpoints are open-sourced, enabling reproducible TTS research.

Why It Matters

Open TTS models and data now rival proprietary systems, lowering barriers for robust speech synthesis development.

Read Original Article

Raon-OpenTTS matches closed-source TTS with 615K hours of open data

Why It Matters

Related Articles

🚀 Stay Ahead in AI