Fine-grained annotation scheme enables precise control over NV types, frequencies, and durations?

Fine-grained annotation scheme enables precise control over NV types, frequencies, and durations.

Emotion recognition accuracy reaches 78.8% overall, with sadness at 98.3% and happy/fear over 82%?

Emotion recognition accuracy reaches 78.8% overall, with sadness at 98.3% and happy/fear over 82%.

Minor trade-off in perceived naturalness yields significant expressiveness improvement (eMOS 4.20)?

Minor trade-off in perceived naturalness yields significant expressiveness improvement (eMOS 4.20).

Audio & Speech

Researchers Teach Text-to-Speech Systems to Sigh, Laugh, and Cry with 78.8% Emotion Accuracy

arXiv eess.AS May 26, 2026

⚡New AI model adds realistic non-verbal sounds for supercharged emotional speech synthesis.

Deep Dive

Researchers (Zhou et al.) developed an emotional TTS system that controls non-verbal vocalizations (NVs) beyond verbal prosody. Using a fine-grained annotation scheme on the EARS corpus, their model achieves 78.8% emotional recognition accuracy and expressiveness scores of 4.20 eMOS. Sadness reaches 98.3% accuracy. While naturalness slightly dips, the system marks a step toward more human-like AI speech.

Key Points

Fine-grained annotation scheme enables precise control over NV types, frequencies, and durations.
Emotion recognition accuracy reaches 78.8% overall, with sadness at 98.3% and happy/fear over 82%.
Minor trade-off in perceived naturalness yields significant expressiveness improvement (eMOS 4.20).

Why It Matters

Makes AI voices sound genuinely human by adding realistic emotional sighs, laughs, and cries.

Read Original Article

Researchers Teach Text-to-Speech Systems to Sigh, Laugh, and Cry with 78.8% Emotion Accuracy

Why It Matters

Related Articles

🚀 Stay Ahead in AI