When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
Even with clinical context, nine models showed no improvement—until fine-tuning.
A new paper from researchers including Pehuén Moure and Shih-Chii Liu evaluates whether current audio-language models can leverage multimodal clinical context to improve automatic speech recognition (ASR) for dysarthric speech. Using the Speech Accessibility Project (SAP) dataset, the study tested nine state-of-the-art models on prompts enriched with diagnosis labels, clinician-derived speech ratings, and progressively detailed clinical descriptions. The results are stark: across all models, adding such context yielded negligible improvements and often degraded word error rate, revealing a fundamental limitation in how these models integrate non-acoustic information at inference time.
To overcome this, the researchers conducted context-dependent fine-tuning using LoRA (Low-Rank Adaptation). By training on a mixture of clinical prompt formats, they achieved a word error rate of 0.066—a 52% relative reduction compared to the frozen baseline—while maintaining performance when context is absent. Subgroup analyses highlighted significant gains for speakers with Down syndrome and those with mild severity, suggesting the approach is particularly effective for certain populations. The fine-tuned models also preserved generalizability, avoiding catastrophic forgetting.
This work provides a crucial benchmark for measuring progress toward inclusive ASR systems. It clarifies that current audio-language models cannot natively utilize clinical metadata, but that lightweight fine-tuning can unlock substantial gains. The findings have direct implications for accessibility technologies, healthcare documentation, and voice-controlled interfaces used by individuals with speech impairments. The authors release their benchmark and fine-tuned models to the research community.
- Nine audio-language models tested failed to leverage clinical context prompts, often worsening word error rate.
- LoRA fine-tuning with mixed clinical prompts achieved a 52% relative WER reduction (from ~0.138 to 0.066).
- Significant gains observed for Down syndrome speakers and mild-severity dysarthria; no loss when context was unavailable.
Why It Matters
Proves current AI can't use medical context for dysarthric speech, but fine-tuning unlocks major accessibility gains.