Audio & Speech

FSD50K-Solo: New method uses diffusion models to clean audio datasets

Generative diffusion filters multi-source noise, creating high-quality single-source sound events.

Deep Dive

The FSD50K dataset, while large and open, suffers from multi-source samples—background interference or overlapping events that degrade training data quality. To solve this, researchers developed an automated curation framework using a generative diffusion model to synthesize controlled noisy mixtures of single-class events. These mixtures serve as supervision for a pre-trained audio encoder paired with a discriminative classifier, which learns to identify and filter out multi-source samples from the raw FSD50K corpus. The result is FSD50K-Solo, a model-curated subset of single-source audio samples, validated against a human-expert test set. The method establishes a scalable paradigm for cleaning open audio datasets, eliminating costly manual annotation. Published at EUSIPCO 2026, the approach promises to accelerate audio AI research by providing cleaner training data for neural networks.

Key Points
  • Framework leverages a generative diffusion model to synthesize clean single-class events for supervised training.
  • Combines pre-trained audio encoder with discriminative classifier to automatically filter multi-source samples.
  • Releases FSD50K-Solo as an open dataset; method generalizes to other large-scale audio corpora.

Why It Matters

Cleaner audio datasets enable more robust sound event detection models, reducing noise in real-world AI applications.