Audio & Speech

DNSMOS-C boosts speech quality assessment with contrastive learning

New model achieves better accuracy and generalization without added compute.

Deep Dive

DNSMOS-C, developed by Xinyu Liang and colleagues (accepted at Interspeech 2026), upgrades the DNSMOS Pro speech quality assessment framework by integrating a Mean Opinion Score (MOS)-guided triplet contrastive loss. Unlike prior methods that depend on large pre-trained self-supervised learning (SSL) encoders and multi-stage training, DNSMOS-C jointly learns speech representations and MOS regression within a single, unified pipeline. This design keeps the model compact and efficient while improving the organization of its latent space according to perceptual quality.

Experimental results across multiple datasets show that DNSMOS-C consistently outperforms DNSMOS Pro in correlation metrics and demonstrates superior generalization on challenging out-of-domain test sets. The contrastive supervision, applied directly to intermediate embeddings, encourages emergent low-dimensional quality ordering in the latent space. This ordering enhances interpretability and training stability without incurring additional computational overhead. The approach is particularly significant for real-time speech applications where both accuracy and low latency are critical.

Key Points
  • DNSMOS-C uses MOS-guided triplet contrastive loss to improve latent space organization for perceptual quality.
  • Achieves better correlation and generalization than DNSMOS Pro without extra computational cost.
  • Eliminates reliance on large SSL encoders by jointly learning representations and regression in one framework.

Why It Matters

Enables more accurate, lightweight speech quality assessment for real-time communications and audio AI products.

📬 Get the top 10 AI stories daily