Audio & Speech

Robust Nasality Representation Learning for Cleft Palate-Related Velopharyngeal Dysfunction Screening in Real-World Settings

arXiv eess.AS March 19, 2026

⚡A new two-stage AI framework achieves perfect clinical accuracy and beats baselines on noisy internet audio.

Deep Dive

A multi-institutional research team has published a novel AI framework designed to screen for Velopharyngeal Dysfunction (VPD), a condition causing hypernasal speech often associated with cleft palate. The core innovation addresses a major hurdle in medical AI: models that perform well in controlled clinical settings often fail when faced with the 'domain shift' of real-world audio from different devices, backgrounds, and recording qualities. To solve this, the researchers proposed a two-stage approach. First, they pre-trained a model using supervised contrastive learning on an auxiliary speech corpus, teaching it to distinguish between oral and nasal speech contexts to create a robust 'nasality-focused' representation.

This specialized encoder is then frozen and paired with simple classifiers that analyze tiny 0.5-second chunks of speech. The probabilities from these micro-analyses are aggregated to make a final screening decision. The results are striking. On an in-domain clinical cohort of 82 subjects, the model achieved a perfect macro-F1 score of 1.000 and 100% accuracy. More importantly, its robustness was proven on a separate, noisy dataset of 131 public internet recordings, where large pre-trained speech models degraded significantly. Here, the proposed method achieved a macro-F1 of 0.679 and accuracy of 0.695, outperforming the strongest baseline (MFCC features, at F1=0.612). This demonstrates a significant leap toward deployable, real-world speech pathology tools that aren't crippled by background noise or poor microphone quality.

Key Points

Achieved 100% accuracy (F1=1.000) on a clinical dataset of 82 subjects with VPD.
Outperformed all baselines on a challenging 'out-of-domain' set of 131 real-world internet recordings (F1=0.679 vs. 0.612 for MFCC).
Uses a novel two-stage framework: contrastive pre-training for nasality features, then lightweight classification on 0.5-second audio chunks.

Why It Matters

This brings reliable, AI-powered speech disorder screening closer to real-world use in telehealth and remote patient monitoring, beyond the clinic.

Read Original Article

Robust Nasality Representation Learning for Cleft Palate-Related Velopharyngeal Dysfunction Screening in Real-World Settings

Why It Matters

Stay Ahead in AI