Audio & Speech

Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus

Researchers combine SSL models with a novel fusion technique to boost WER on naturalistic speech.

Deep Dive

Researchers Szu-Jui Chen and John H.L. Hansen from the University of Texas at Dallas have published a new study advancing automatic speech recognition (ASR) for naturalistic environments. Their work, accepted to Speech Communication 2026, focuses on the Fearless Steps (FS) APOLLO corpus—a massive dataset of Apollo mission audio—and introduces a novel deep cross-attention (DCA) fusion method. This technique combines features from multiple self-supervised learning (SSL) models, such as wav2vec 2.0 and HuBERT, to improve recognition accuracy in challenging, real-world scenarios.

The team tested their approach on the FS Challenge Phase-4 corpus and the CHiME-6 dataset, finding that previous feature refinement and fusion methods were less effective on the Apollo data. Their DCA method achieved an absolute +1.1% improvement in word error rate (WER), a significant gain for noisy, multi-speaker audio. This advancement enables more accurate metadata creation for the Apollo community resource, aiding researchers in history, linguistics, and engineering. The study underscores the potential of SSL feature fusion for specialized ASR tasks, offering a blueprint for handling other historical or domain-specific audio archives.

Key Points
  • Deep cross-attention (DCA) fusion method combines SSL models (e.g., wav2vec 2.0, HuBERT) for ASR.
  • Achieves +1.1% absolute WER improvement on the Fearless Steps Apollo Phase-4 corpus.
  • Tested on CHiME-6 dataset, outperforming prior feature refinement and fusion methods.

Why It Matters

Enables better ASR for noisy historical audio, unlocking new research in Apollo mission analysis.