Canary-WavLM fusion cuts hearing aid speech prediction error by 24.96 RMSE
New frame-aligned fusion method predicts speech intelligibility without a clean reference audio signal.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new paper from Kazushi Nakazawa (arXiv, May 2026) tackles a tough problem: predicting how well hearing-impaired listeners understand speech processed by hearing aids, without needing a perfect "clean" reference audio. The approach combines two powerful pretrained speech encoders—Canary and WavLM—using a novel frame-aligned fusion strategy. Instead of simply averaging scores or using cross-attention, the model temporally prepares WavLM with a learnable strided convolution and fuses it onto the coarser Canary timeline before pooling into a single intelligibility score.
The best configuration achieved an Eval RMSE of 24.96 ±0.06 and an Eval correlation of 0.796 ±0.001, outperforming single-backbone baselines, uniform score averaging, pool-late fusion, and even reverse alignment. The paper also presents severity, enhancement-system, layer-window, and temporal-shift analyses, emphasizing that preserving coarse local temporal correspondence before pooling provides a useful inductive bias for non-intrusive intelligibility prediction. This work is part of the 3rd Clarity Prediction Challenge and offers a practical path to evaluating hearing aid performance without cumbersome reference audio.
- Combines two frozen encoders (Canary and WavLM) with a learnable strided convolution for temporal alignment before fusion.
- Achieves state-of-the-art results: Eval RMSE 24.96, correlation 0.796 on the Clarity Prediction Challenge benchmark.
- Coarse temporal correspondence (frame-aligned fusion) outperforms score averaging, cross-attention, and reverse alignment strategies.
Why It Matters
Enables hearing aid developers to assess speech clarity directly from processed audio, speeding up tuning without reference recordings.