Audio & Speech

CC-G2PnP: Streaming Grapheme-to-Phoneme and prosody with Conformer-CTC for unsegmented languages

A new Conformer-CTC model processes text chunk-by-chunk, eliminating the need for explicit word boundaries.

Deep Dive

Researchers Yuma Shirahata and Ryuichi Yamamoto developed CC-G2PnP, a streaming grapheme-to-phoneme and prosody model. Based on a Conformer-CTC architecture, it processes text chunk-by-chunk with minimal look-ahead, enabling real-time inference. Crucially, its CTC decoder learns grapheme-phoneme alignment without needing word boundaries. In tests on unsegmented Japanese, it significantly outperformed baseline streaming models in PnP label prediction accuracy, paving the way for seamless, low-latency text-to-speech for more languages.

Why It Matters

This breakthrough enables real-time, high-quality voice synthesis for languages like Japanese, Chinese, and Thai, removing a major barrier for global AI voice applications.