Audio & Speech

Canary-WavLM fusion cuts hearing aid speech prediction error by 24.96 RMSE

New frame-aligned fusion method predicts speech intelligibility without a clean reference audio signal.

Deep Dive

A new paper from Kazushi Nakazawa (arXiv, May 2026) tackles a tough problem: predicting how well hearing-impaired listeners understand speech processed by hearing aids, without needing a perfect "clean" reference audio. The approach combines two powerful pretrained speech encoders—Canary and WavLM—using a novel frame-aligned fusion strategy. Instead of simply averaging scores or using cross-attention, the model temporally prepares WavLM with a learnable strided convolution and fuses it onto the coarser Canary timeline before pooling into a single intelligibility score.

The best configuration achieved an Eval RMSE of 24.96 ±0.06 and an Eval correlation of 0.796 ±0.001, outperforming single-backbone baselines, uniform score averaging, pool-late fusion, and even reverse alignment. The paper also presents severity, enhancement-system, layer-window, and temporal-shift analyses, emphasizing that preserving coarse local temporal correspondence before pooling provides a useful inductive bias for non-intrusive intelligibility prediction. This work is part of the 3rd Clarity Prediction Challenge and offers a practical path to evaluating hearing aid performance without cumbersome reference audio.

Key Points
  • Combines two frozen encoders (Canary and WavLM) with a learnable strided convolution for temporal alignment before fusion.
  • Achieves state-of-the-art results: Eval RMSE 24.96, correlation 0.796 on the Clarity Prediction Challenge benchmark.
  • Coarse temporal correspondence (frame-aligned fusion) outperforms score averaging, cross-attention, and reverse alignment strategies.

Why It Matters

Enables hearing aid developers to assess speech clarity directly from processed audio, speeding up tuning without reference recordings.