Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models
A new framework uses LLMs as meta-evaluators to judge speech quality, outperforming dedicated models like DNSMOS.
A research team from Academia Sinica and other institutions has published a paper, accepted to IEEE ICASSP 2026, introducing GatherMOS. This novel framework repurposes large language models (LLMs) as 'meta-evaluators' for a specialized task: non-intrusive speech quality assessment. Instead of training a model from scratch, GatherMOS prompts an LLM to aggregate and reason over diverse, pre-existing signals. These inputs include lightweight acoustic descriptors and pseudo-labels generated by established speech quality models like DNSMOS and VQScore. The LLM's objective is to synthesize this heterogeneous information to accurately infer a perceptual Mean Opinion Score (MOS), the gold standard for judging how humans perceive audio quality.
The study explores both zero-shot and few-shot in-context learning setups. The zero-shot version of GatherMOS demonstrated stable performance across diverse acoustic conditions, showing the LLM's inherent reasoning capability. However, performance saw large gains in the few-shot setup when the provided example support samples closely matched the test conditions. In experiments on the benchmark VoiceBank-DEMAND dataset, GatherMOS consistently outperformed not only the individual models it used as inputs (DNSMOS, VQScore) and simple score averaging, but also more complex learning-based models like CNN-BLSTM and MOS-SSL—particularly when trained under conditions with limited labeled data.
These results highlight a significant shift in methodology. The research demonstrates that an LLM acting as an intelligent aggregator can be a more practical and effective strategy for quality evaluation than building or fine-tuning a dedicated model, especially when high-quality labeled data is scarce. It validates the potential of using general-purpose foundation models as sophisticated reasoning engines for narrow, expert domains by providing them with the right contextual signals.
- GatherMOS uses an LLM as a meta-evaluator to aggregate signals from DNSMOS, VQScore, and acoustic features to predict speech MOS.
- It outperformed specialized models like CNN-BLSTM and MOS-SSL on the VoiceBank-DEMAND dataset, especially with limited training data.
- The framework works in both zero-shot and few-shot modes, with few-shot learning providing large gains when example conditions are matched.
Why It Matters
Enables highly accurate, data-efficient audio quality testing for telecom, voice AI, and media production without training massive specialized models.