Research & Papers

Neural networks for Text-to-Speech evaluation

arXiv cs.CL April 13, 2026

⚡New AI models achieve 73.7% accuracy in side-by-side comparisons and a 0.40 RMSE, outperforming human evaluators.

Deep Dive

A research team led by Ilya Trofimenko and David Kocharyan has introduced a suite of neural models designed to automate the expensive and slow process of evaluating Text-to-Speech (TTS) system quality. The work tackles both relative (Side-by-Side, or SBS) and absolute (Mean Opinion Score, or MOS) assessment settings. For SBS, they propose NeuralSBS, a model built on the HuBERT audio representation, which achieves 73.7% accuracy on the SOMOS benchmark dataset. For MOS prediction, they enhance existing MOSNet techniques and introduce WhisperBert, a novel stacking ensemble that combines audio features from OpenAI's Whisper model with textual embeddings from BERT.

Their most significant result is that their best MOS-predicting models achieve a Root Mean Square Error (RMSE) of approximately 0.40. This is a substantial improvement over the established human inter-rater RMSE baseline of 0.62, meaning the AI can predict a speech sample's quality score more consistently than human evaluators can agree with each other. The study also provides crucial engineering insights, showing that a naive fusion of text and audio via cross-attention can hurt performance, while their ensemble-based 'stacking' approach is more effective. Furthermore, they report negative results with other advanced architectures like SpeechLM and zero-shot LLM evaluators (Qwen2-Audio, Gemini 2.5 Flash), underscoring the need for dedicated, trained metric-learning frameworks rather than relying on general-purpose models.

Key Points

NeuralSBS model achieves 73.7% accuracy for side-by-side TTS comparisons, automating a costly human task.
WhisperBert ensemble model predicts MOS scores with a 0.40 RMSE, outperforming the human consistency baseline of 0.62 RMSE.
The research invalidates simpler fusion methods and shows dedicated models beat zero-shot LLMs like Gemini 2.5 Flash for this task.

Why It Matters

Enables rapid, scalable, and objective quality testing for voice AI products, drastically reducing development time and cost.

Read Original Article

Neural networks for Text-to-Speech evaluation

Why It Matters

Stay Ahead in AI