Audio & Speech

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

arXiv eess.AS April 13, 2026

⚡New AI system from Korean researchers achieves more accurate lip-sync than human voice actors in cross-language dubbing.

Deep Dive

A research team from Gwangju Institute of Science and Technology has developed PS-TTS, a novel text-to-speech system specifically designed for automated dubbing that synchronizes translated speech with on-screen lip movements. The system tackles two critical synchronization challenges: isochrony (matching speech duration) and phonetic synchronization (matching lip movements). Their approach uses a language model to paraphrase translated text for timing alignment, then employs dynamic time warping with vowel distance metrics to ensure target language vowels produce similar mouth shapes to source language vowels.

The researchers extended this to PS-Comet, which jointly optimizes for both phonetic similarity and semantic preservation using a multi-objective approach. In comprehensive evaluations across Korean, English, and French language pairs, both PS-TTS systems outperformed standard TTS without phonetic synchronization on objective metrics. Remarkably, the systems also surpassed professional voice actors in Korean-to-English and English-to-Korean dubbing tasks for lip-sync accuracy. PS-Comet demonstrated the best performance across all language pairs, effectively balancing lip synchronization with meaning preservation, confirming its superiority over phonetic-only approaches.

The technology represents a significant advancement in automated dubbing, addressing what has been a persistent challenge in cross-language media localization. By achieving better lip synchronization than human professionals while maintaining semantic accuracy, PS-TTS could dramatically reduce the cost and time required for high-quality dubbing of films, television shows, and educational content. The system's demonstrated effectiveness across multiple language pairs suggests broad applicability for global media distribution.

Key Points

PS-TTS uses phonetic synchronization with dynamic time warping to match vowel mouth shapes across languages
PS-Comet variant jointly optimizes for both lip-sync accuracy and semantic preservation using multi-objective training
Outperformed professional voice actors in Korean-English dubbing tasks and worked across Korean, English, and French language pairs

Why It Matters

Enables cheaper, faster, and more natural automated dubbing for global media while maintaining lip synchronization better than human professionals.

Read Original Article

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

Why It Matters

Stay Ahead in AI