Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration
A new metric, DS-WED, achieves a 2000-rating benchmark to fix how AI speech sounds flat.
A team of researchers has tackled a core problem in AI voice synthesis: flat, robotic speech. In a paper accepted to ICASSP 2026, they address the lack of reliable ways to measure 'prosody diversity'—the variations in rhythm, stress, and intonation that make speech sound natural and expressive. Current acoustic metrics only capture partial views and poorly match human perception, leaving developers without a good tool to improve their text-to-speech (TTS) models.
To solve this, the team built ProsodyEval, a substantial benchmark dataset. It contains 1000 speech samples generated by 7 mainstream zero-shot TTS systems, each evaluated with 2000 human-provided Prosody Mean Opinion Scores (PMOS). Building on this data, they proposed a novel objective metric called Discretized Speech Weighted Edit Distance (DS-WED). Instead of raw audio features, DS-WED quantifies diversity by calculating a weighted edit distance over semantic tokens extracted by models like HuBERT and WavLM, capturing higher-level prosodic patterns.
The results are significant. DS-WED achieves a substantially higher correlation with human judgment than existing acoustic metrics. Using this new tool, the researchers benchmarked state-of-the-art open-source TTS systems on datasets like LibriSpeech and Seed-TTS. Their exploration revealed key factors influencing prosody diversity, including generative modeling paradigms and duration control. A critical finding is that current large audio language models (LALMs) still struggle to capture these nuanced variations, charting a clear path for future model improvement.
- Introduced ProsodyEval, a benchmark with 2000 human ratings for 1000 samples from 7 TTS systems.
- Proposed DS-WED, a new metric using weighted edit distance over HuBERT/WavLM tokens, which aligns far better with human scores.
- Benchmarked modern TTS systems and found current large audio language models are still limited in capturing prosodic variation.
Why It Matters
Provides developers with the first reliable tool to measure and improve expressiveness in AI-generated voices, moving beyond robotic speech.