Sarashina2.2-TTS cracks Japanese kanji polyphony with 361k hours of data
A new TTS model reads all 2,136 Joyo kanji with state-of-the-art accuracy.
Researchers from multiple institutions have unveiled Sarashina2.2-TTS, a Japanese-focused text-to-speech system built on large language models. The paper, posted on arXiv, addresses a longstanding challenge in Japanese speech synthesis: kanji polyphony, where a single kanji character has multiple possible readings depending on context. The team trained the model on approximately 361,000 hours of speech data, combining Japanese and English to improve cross-lingual robustness.
To specifically tackle polyphony, they designed a data augmentation pipeline covering all 2,136 Joyo kanji (regular-use characters designated by Japan's Agency for Cultural Affairs). They also introduce the Joyo Kanji Yomi Benchmark, which includes 4,378 readings, and a new metric called Kana-CER that measures pronunciation correctness by comparing synthesized speech to reference readings in kana space. Experiments show Sarashina2.2-TTS achieves state-of-the-art kanji-level reading accuracy, matches top baselines on general sentence pronunciation, and delivers the highest speaker similarity in zero-shot Japanese TTS. Notably, it is the only system that maintains stable Japanese pronunciation regardless of prompt language, proving the effectiveness of balanced bilingual training.
- Trained on 361k hours of speech with balanced Japanese and English data for cross-lingual robustness.
- Targeted data augmentation covers all 2,136 Joyo kanji, yielding state-of-the-art reading accuracy on the new Joyo Kanji Yomi Benchmark.
- Introduces Kana-CER metric to directly measure pronunciation correctness by eliminating orthographic variation.
Why It Matters
Sarashina2.2-TTS solves a critical Japanese TTS bottleneck, enabling accurate and natural-sounding speech for any kanji-based text.