UniSRM: Unified speech reward model brings reasoning to audio quality assessment
ACL 2026 paper scores speech with interpretable reasoning, not just numbers
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Evaluating speech generation still relies heavily on human Mean Opinion Scores (MOS), which are expensive, subjective, and hard to reproduce. While recent AudioLLM-based judges exist, they only target narrow scenarios like utterance-level quality or single-turn dialogue. This paper from researchers at multiple universities introduces UniSRM, a unified speech reward model that supports multi-dimensional, interpretable reward signals with reliable reasoning. The authors also created UniSRM-Data and UniSRM-Bench, covering tasks from utterance-level quality to context-level coherence.
UniSRM uses a two-stage pipeline that enables reasoning-based fine-grained assessment. A key innovation is Reasoning-Consistent Rewards, which improve the reliability of the reasoning process. Experiments demonstrate that UniSRM delivers more reliable and human-aligned judgments across a broad range of speech evaluation tasks. Accepted at ACL 2026 (Main), this work provides a practical foundation for scalable and unified evaluation of speech quality, potentially reducing dependence on expensive human judges in industry and research.
- UniSRM uses a two-stage pipeline for reasoning-based fine-grained speech assessment, moving beyond simple scalar scores
- Includes UniSRM-Data and UniSRM-Bench covering diverse tasks from utterance-level quality to context-level coherence
- Reasoning-Consistent Rewards technique improves reliability of the model's interpretable judgments
Why It Matters
Automates speech quality evaluation with interpretable reasoning, reducing reliance on costly human MOS judges