Whisper-based model predicts speech intelligibility for hearing-impaired with 24.39 RMSE
New word-level acoustic fusion boosts prediction correlation by 0.011 over baseline
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Hearing-impaired individuals often struggle with speech understanding in noisy environments. A new paper by Kazushi Nakazawa tackles the challenge of predicting how much of a sentence a listener with hearing loss can understand—a critical metric for hearing aid tuning and rehabilitation. The approach, submitted to arXiv on May 22, 2026, reformulates the problem as word-level correctness modeling. It uses a frozen Whisper encoder to process degraded speech, then a teacher-forced decoder that conditions on the canonical transcript (the correct sentence). Sentence-level intelligibility is derived by averaging predicted correctness probabilities over valid reference words. To enrich the transcript-conditioned decoder states, the model adds two acoustic fusion branches: a word-aligned local branch based on character-level cross-attention alignment and an utterance-level global branch for calibration.
Results on the CPC3 evaluation set show clear gains. The decoder-only baseline achieves an RMSE of 24.92 and a correlation of 0.795. The full joint fusion model improves to an incorrect-word F1 of 0.778, a Matthews correlation coefficient (MCC) of 0.626, a correlation of 0.806, and a lower RMSE of 24.39. A similar trend with Whisper medium confirms that the improvement stems from prediction granularity at the word level and alignment-aware fusion. This work demonstrates that explicitly modeling word correctness with transcript priors and multi-scale acoustic features yields more precise and clinically useful intelligibility estimates for hearing-impaired populations.
- Uses frozen Whisper encoder and teacher-forced decoder conditioned on canonical transcript to predict word-level correctness
- Adds word-aligned local acoustic branch and utterance-level global acoustic branch for better calibration
- Achieves RMSE 24.39 and correlation 0.806 on CPC3 evaluation set, outperforming decoder-only baseline (RMSE 24.92, corr 0.795)
Why It Matters
Better intelligibility prediction means improved hearing aid tuning and personalized auditory rehabilitation for millions with hearing loss.