Audio & Speech

Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts

Training on messy broadcast data, not just clean speech, improves AI models for multiple audio tasks.

Deep Dive

A team from IRCAM and Sorbonne Université has published a paper, accepted at LREC 2026, investigating a critical but often overlooked factor in audio AI: the impact of pretraining data on Self-Supervised Learning (SSL) models. While most state-of-the-art audio encoders like Wav2Vec 2.0 or HuBERT are trained on clean, segmented speech datasets like LibriSpeech, the researchers argue this limits their real-world utility. To test this, they constructed a massive, diverse pretraining corpus from French audiovisual broadcasts, automatically annotating it for content like speech, music, and noise. They then created smaller, controlled subsets to isolate the effect of data composition.

Their experiments trained SSL models on these subsets and evaluated them on multiple downstream tasks including Automatic Speech Recognition (ASR), voice activity detection, music detection, and speaker recognition. The key finding is that models pretrained on the diverse, 'messy' broadcast data consistently outperformed those trained only on clean speech, with performance gains of up to 15% on non-speech tasks like music detection. This demonstrates that exposing models to the full spectrum of real-world audio makes them more robust and versatile. The study also performed a membership inference attack, revealing that models can memorize training data, underscoring the necessity of data deduplication—a crucial step for privacy and preventing overfitting. This research provides a blueprint for building more general-purpose audio foundation models and could help bridge the separate communities working on speech and music AI.

Key Points
  • Models trained on diverse French broadcast audio (speech, music, noise) outperformed clean-speech-only models on tasks like music detection by up to 15%.
  • The study used a large, automatically annotated corpus of TV/radio content to create controlled training subsets, proving data composition directly impacts model capability.
  • A membership inference attack highlighted significant memorization, stressing the critical need for data deduplication in SSL training pipelines for privacy and performance.

Why It Matters

This research provides a roadmap for building more robust, general-purpose audio AI that works in the noisy real world, not just in lab conditions.