Audio & Speech

Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment

CPP values above 10 dB make TTS voices sound robotic, study finds.

Deep Dive

A new study from Huanchen Cai and Sten Ternström introduces voice mapping as a systematic, metric-driven framework for assessing text-to-speech (TTS) synthesis quality. The researchers evaluated six influential TTS models—Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, and VITS—using three objective metrics: crest factor (peak-to-average ratio), spectrum balance (frequency distribution), and cepstral peak prominence (CPPs, a measure of voice source periodicity). The goal is to move beyond subjective listening tests and provide reproducible benchmarks for vocal effort, expressiveness, and naturalness.

The results reveal clear performance frontiers. VITS exhibited the largest voice range, suggesting superior ability to model dynamic vocal effort across different speaking styles. Glow-TTS, though more limited in range, achieved the highest spectrum balance in soft phonation contexts—indicating better handling of breathy or gentle speech. Critically, the study establishes a practical CPP threshold: values between 7 and 8 dB correlate with natural human-like quality, while CPPs exceeding 10 dB produce a distinctly robotic timbre. This provides developers a quantitative target for fine-tuning.

These findings underscore TTS systems' uneven handling of voice dynamics and expressiveness. The proposed voice mapping approach could standardize quality evaluation, helping researchers identify trade-offs (e.g., range vs. naturalness in soft speech). For industry, it offers a clear path to auditing synthetic voices and guiding model selection—whether for virtual assistants, audiobooks, or accessibility tools.

Key Points
  • Six TTS models (Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, VITS) were evaluated using crest factor, spectrum balance, and CPPs.
  • VITS achieved the largest voice range; Glow-TTS showed superior soft phonation with higher spectrum balance.
  • CPP values of 7-8 dB indicate natural speech; values above 10 dB produce a robotic quality.

Why It Matters

Provides objective metrics to benchmark and improve naturalness and expressiveness in commercial TTS systems.