ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis
New method creates high-quality custom voices using only 3 minutes of real audio and unlimited synthetic speech.
A research team led by Youngwon Choi has introduced ZeSTA (Zero-Shot TTS Augmentation), a novel framework that revolutionizes personalized speech synthesis by enabling high-quality voice cloning with minimal real data. The system addresses a critical bottleneck in AI voice technology: the traditional need for extensive recordings of a target speaker. ZeSTA leverages zero-shot text-to-speech (ZS-TTS) systems as unlimited data augmentation sources, generating synthetic speech that's linguistically rich and phonetically diverse. This breakthrough approach allows developers to create personalized voice models with as little as 3 minutes of actual speaker audio, dramatically reducing data collection requirements while maintaining natural voice quality.
The technical innovation centers on domain-conditioned training that distinguishes real from synthetic speech through lightweight domain embeddings, preventing the speaker similarity degradation that typically occurs when mixing synthetic and real data. Combined with strategic real-data oversampling, ZeSTA stabilizes model adaptation under extremely limited target data without modifying base TTS architectures. Experiments on LibriTTS and proprietary datasets demonstrate superior speaker similarity preservation compared to naive synthetic augmentation methods while maintaining intelligibility and perceptual quality. The framework represents a significant step toward democratizing personalized voice technology, potentially enabling applications from audiobook narration to voice assistants that can mimic any voice with minimal training data.
- Enables personalized TTS with only 3 minutes of real speaker audio using synthetic augmentation
- Uses domain-conditioned training with lightweight embeddings to prevent speaker similarity degradation
- Tested on LibriTTS and proprietary datasets, showing improved results over naive augmentation methods
Why It Matters
Dramatically reduces data needed for custom voice creation, making personalized TTS accessible for more applications.