Audio & Speech

Integrated Gradients detects sound events with 82.6% accuracy, rivaling supervised models

Can post-hoc attribution methods localize sounds as well as trained detectors?

Deep Dive

The authors test Integrated Gradients (IG) for temporal sound event detection on a 10-class domestic audio dataset. Without any temporal training labels, IG achieves mean IoU of 0.39, frame-level F1 of 0.52, and Pointing Game accuracy of 82.6%. For comparison, a weakly-supervised CNN (clip-level labels) achieves 0.42 IoU, 0.55 F1, and 97.3% PG, while a strongly-supervised CNN (frame-level labels) achieves 0.45 IoU, 0.58 F1, and 97.9% PG. The results suggest that post-hoc IG captures meaningful temporal activity patterns, with localization performance approaching that of models explicitly producing frame-level predictions.

Key Points
  • Integrated Gradients achieves 0.39 IoU, 0.52 F1, and 82.6% Pointing Game accuracy on a 10-class domestic sound dataset.
  • Performance nearly matches weakly-supervised (0.42 IoU, 0.55 F1) and strongly-supervised (0.45 IoU, 0.58 F1) CNN baselines.
  • All methods significantly outperform random and energy-based baselines, validating IG as a post-hoc temporal localization tool.

Why It Matters

Proves that explainable AI can localize audio events without expensive temporal labels, reducing annotation costs.