Uses frozen Whisper encoder and teacher-forced decoder conditioned on canonical transcript to predict word-level correctness?

Uses frozen Whisper encoder and teacher-forced decoder conditioned on canonical transcript to predict word-level correctness

Adds word-aligned local acoustic branch and utterance-level global acoustic branch for better calibration?

Adds word-aligned local acoustic branch and utterance-level global acoustic branch for better calibration

Achieves RMSE 24.39 and correlation 0.806 on CPC3 evaluation set, outperforming decoder-only baseline (RMSE 24.92, corr 0.795)?

Achieves RMSE 24.39 and correlation 0.806 on CPC3 evaluation set, outperforming decoder-only baseline (RMSE 24.92, corr 0.795)

Audio & Speech

Whisper-based model predicts speech intelligibility for hearing-impaired with 24.39 RMSE

arXiv eess.AS May 25, 2026

⚡New word-level acoustic fusion boosts prediction correlation by 0.011 over baseline

Deep Dive

Hearing-impaired individuals often struggle with speech understanding in noisy environments. A new paper by Kazushi Nakazawa tackles the challenge of predicting how much of a sentence a listener with hearing loss can understand—a critical metric for hearing aid tuning and rehabilitation. The approach, submitted to arXiv on May 22, 2026, reformulates the problem as word-level correctness modeling. It uses a frozen Whisper encoder to process degraded speech, then a teacher-forced decoder that conditions on the canonical transcript (the correct sentence). Sentence-level intelligibility is derived by averaging predicted correctness probabilities over valid reference words. To enrich the transcript-conditioned decoder states, the model adds two acoustic fusion branches: a word-aligned local branch based on character-level cross-attention alignment and an utterance-level global branch for calibration.

Results on the CPC3 evaluation set show clear gains. The decoder-only baseline achieves an RMSE of 24.92 and a correlation of 0.795. The full joint fusion model improves to an incorrect-word F1 of 0.778, a Matthews correlation coefficient (MCC) of 0.626, a correlation of 0.806, and a lower RMSE of 24.39. A similar trend with Whisper medium confirms that the improvement stems from prediction granularity at the word level and alignment-aware fusion. This work demonstrates that explicitly modeling word correctness with transcript priors and multi-scale acoustic features yields more precise and clinically useful intelligibility estimates for hearing-impaired populations.

Key Points

Uses frozen Whisper encoder and teacher-forced decoder conditioned on canonical transcript to predict word-level correctness
Adds word-aligned local acoustic branch and utterance-level global acoustic branch for better calibration
Achieves RMSE 24.39 and correlation 0.806 on CPC3 evaluation set, outperforming decoder-only baseline (RMSE 24.92, corr 0.795)

Why It Matters

Better intelligibility prediction means improved hearing aid tuning and personalized auditory rehabilitation for millions with hearing loss.

Read Original Article

Whisper-based model predicts speech intelligibility for hearing-impaired with 24.39 RMSE

Why It Matters

Related Articles

🚀 Stay Ahead in AI