DNSMOS-C boosts speech quality assessment with contrastive learning
New model achieves better accuracy and generalization without added compute.
DNSMOS-C, developed by Xinyu Liang and colleagues (accepted at Interspeech 2026), upgrades the DNSMOS Pro speech quality assessment framework by integrating a Mean Opinion Score (MOS)-guided triplet contrastive loss. Unlike prior methods that depend on large pre-trained self-supervised learning (SSL) encoders and multi-stage training, DNSMOS-C jointly learns speech representations and MOS regression within a single, unified pipeline. This design keeps the model compact and efficient while improving the organization of its latent space according to perceptual quality.
Experimental results across multiple datasets show that DNSMOS-C consistently outperforms DNSMOS Pro in correlation metrics and demonstrates superior generalization on challenging out-of-domain test sets. The contrastive supervision, applied directly to intermediate embeddings, encourages emergent low-dimensional quality ordering in the latent space. This ordering enhances interpretability and training stability without incurring additional computational overhead. The approach is particularly significant for real-time speech applications where both accuracy and low latency are critical.
- DNSMOS-C uses MOS-guided triplet contrastive loss to improve latent space organization for perceptual quality.
- Achieves better correlation and generalization than DNSMOS Pro without extra computational cost.
- Eliminates reliance on large SSL encoders by jointly learning representations and regression in one framework.
Why It Matters
Enables more accurate, lightweight speech quality assessment for real-time communications and audio AI products.