Audio & Speech

Emotion-Coherent Speech Data Augmentation and Self-Supervised Contrastive Style Training for Enhancing Kids's Story Speech Synthesis

This breakthrough could revolutionize audiobooks and educational content for children.

Deep Dive

Researchers have developed a new Text-to-Speech model specifically for expressive kids' story narration. The system uses emotion-coherent data augmentation and self-supervised contrastive training to create natural, multi-sentence speech with proper pauses and emotional tone in a single inference step. In evaluations, it outperformed baseline models, scoring higher in naturalness and style suitability while producing pause distributions closer to human narration. The paper was accepted at IEEE Spoken Language Technology Workshop 2024.

Why It Matters

This technology could dramatically improve audiobook production and create more engaging educational content for children.