Researchers Teach Text-to-Speech Systems to Sigh, Laugh, and Cry with 78.8% Emotion Accuracy
New AI model adds realistic non-verbal sounds for supercharged emotional speech synthesis.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Deep Dive
Researchers (Zhou et al.) developed an emotional TTS system that controls non-verbal vocalizations (NVs) beyond verbal prosody. Using a fine-grained annotation scheme on the EARS corpus, their model achieves 78.8% emotional recognition accuracy and expressiveness scores of 4.20 eMOS. Sadness reaches 98.3% accuracy. While naturalness slightly dips, the system marks a step toward more human-like AI speech.
Key Points
- Fine-grained annotation scheme enables precise control over NV types, frequencies, and durations.
- Emotion recognition accuracy reaches 78.8% overall, with sadness at 98.3% and happy/fear over 82%.
- Minor trade-off in perceived naturalness yields significant expressiveness improvement (eMOS 4.20).
Why It Matters
Makes AI voices sound genuinely human by adding realistic emotional sighs, laughs, and cries.