Audio & Speech

UniSRM: Unified speech reward model brings reasoning to audio quality assessment

ACL 2026 paper scores speech with interpretable reasoning, not just numbers

Deep Dive

Evaluating speech generation still relies heavily on human Mean Opinion Scores (MOS), which are expensive, subjective, and hard to reproduce. While recent AudioLLM-based judges exist, they only target narrow scenarios like utterance-level quality or single-turn dialogue. This paper from researchers at multiple universities introduces UniSRM, a unified speech reward model that supports multi-dimensional, interpretable reward signals with reliable reasoning. The authors also created UniSRM-Data and UniSRM-Bench, covering tasks from utterance-level quality to context-level coherence.

UniSRM uses a two-stage pipeline that enables reasoning-based fine-grained assessment. A key innovation is Reasoning-Consistent Rewards, which improve the reliability of the reasoning process. Experiments demonstrate that UniSRM delivers more reliable and human-aligned judgments across a broad range of speech evaluation tasks. Accepted at ACL 2026 (Main), this work provides a practical foundation for scalable and unified evaluation of speech quality, potentially reducing dependence on expensive human judges in industry and research.

Key Points
  • UniSRM uses a two-stage pipeline for reasoning-based fine-grained speech assessment, moving beyond simple scalar scores
  • Includes UniSRM-Data and UniSRM-Bench covering diverse tasks from utterance-level quality to context-level coherence
  • Reasoning-Consistent Rewards technique improves reliability of the model's interpretable judgments

Why It Matters

Automates speech quality evaluation with interpretable reasoning, reducing reliance on costly human MOS judges