Audio & Speech

Researchers Teach Text-to-Speech Systems to Sigh, Laugh, and Cry with 78.8% Emotion Accuracy

New AI model adds realistic non-verbal sounds for supercharged emotional speech synthesis.

Deep Dive

Researchers (Zhou et al.) developed an emotional TTS system that controls non-verbal vocalizations (NVs) beyond verbal prosody. Using a fine-grained annotation scheme on the EARS corpus, their model achieves 78.8% emotional recognition accuracy and expressiveness scores of 4.20 eMOS. Sadness reaches 98.3% accuracy. While naturalness slightly dips, the system marks a step toward more human-like AI speech.

Key Points
  • Fine-grained annotation scheme enables precise control over NV types, frequencies, and durations.
  • Emotion recognition accuracy reaches 78.8% overall, with sadness at 98.3% and happy/fear over 82%.
  • Minor trade-off in perceived naturalness yields significant expressiveness improvement (eMOS 4.20).

Why It Matters

Makes AI voices sound genuinely human by adding realistic emotional sighs, laughs, and cries.