Research & Papers

Neural networks for Text-to-Speech evaluation

New AI models achieve 73.7% accuracy in side-by-side comparisons and a 0.40 RMSE, outperforming human evaluators.

Deep Dive

A research team led by Ilya Trofimenko and David Kocharyan has introduced a suite of neural models designed to automate the expensive and slow process of evaluating Text-to-Speech (TTS) system quality. The work tackles both relative (Side-by-Side, or SBS) and absolute (Mean Opinion Score, or MOS) assessment settings. For SBS, they propose NeuralSBS, a model built on the HuBERT audio representation, which achieves 73.7% accuracy on the SOMOS benchmark dataset. For MOS prediction, they enhance existing MOSNet techniques and introduce WhisperBert, a novel stacking ensemble that combines audio features from OpenAI's Whisper model with textual embeddings from BERT.

Their most significant result is that their best MOS-predicting models achieve a Root Mean Square Error (RMSE) of approximately 0.40. This is a substantial improvement over the established human inter-rater RMSE baseline of 0.62, meaning the AI can predict a speech sample's quality score more consistently than human evaluators can agree with each other. The study also provides crucial engineering insights, showing that a naive fusion of text and audio via cross-attention can hurt performance, while their ensemble-based 'stacking' approach is more effective. Furthermore, they report negative results with other advanced architectures like SpeechLM and zero-shot LLM evaluators (Qwen2-Audio, Gemini 2.5 Flash), underscoring the need for dedicated, trained metric-learning frameworks rather than relying on general-purpose models.

Key Points
  • NeuralSBS model achieves 73.7% accuracy for side-by-side TTS comparisons, automating a costly human task.
  • WhisperBert ensemble model predicts MOS scores with a 0.40 RMSE, outperforming the human consistency baseline of 0.62 RMSE.
  • The research invalidates simpler fusion methods and shows dedicated models beat zero-shot LLMs like Gemini 2.5 Flash for this task.

Why It Matters

Enables rapid, scalable, and objective quality testing for voice AI products, drastically reducing development time and cost.