Temporal Pooling Strategies for Training-Free Anomalous Sound Detection with Self-Supervised Audio Embeddings
A novel 'relative deviation pooling' technique achieves state-of-the-art results on key benchmarks without any model training.
A team of researchers has published a significant paper on arXiv, challenging the standard approach to training-free Anomalous Sound Detection (ASD). The work, led by Kevin Wilkinghoff, Sarthak Yadav, and Zheng-Hua Tan, systematically evaluates how temporal pooling strategies impact the performance of systems that use pre-trained, self-supervised audio embeddings. These systems are crucial for industrial monitoring, as they can detect unusual sounds—like machine failures—using only normal operating data, eliminating the need for costly collections of rare 'anomalous' samples. The researchers found that the field has overly relied on simple temporal mean pooling, leaving performance gains on the table.
The core innovation is the introduction of Relative Deviation Pooling (RDP), an adaptive method that emphasizes informative temporal deviations within an audio clip, and a hybrid strategy that combines RDP with generalized mean pooling. In experiments across five standard datasets, their methods consistently beat mean pooling. Most notably, on the challenging DCASE2025 ASD benchmark, their training-free approach surpassed the performance of all previously reported systems that required full training, including complex ensembles. This breakthrough demonstrates that sophisticated pooling is a key lever for unlocking the latent potential of existing audio foundation models, paving the way for more reliable and efficient predictive maintenance and audio surveillance tools.
- Introduced Relative Deviation Pooling (RDP), a novel adaptive method that focuses on informative temporal deviations in audio embeddings.
- Achieved state-of-the-art results on the DCASE2025 ASD benchmark, outperforming all prior trained and ensemble systems with a training-free method.
- Systematically evaluated pooling strategies across multiple audio embedding models, proving hybrid RDP-generalized mean pooling consistently beats standard mean pooling.
Why It Matters
Enables more accurate, data-efficient machine health monitoring and audio surveillance without the cost and complexity of model training.