New CSU framework tackles unreliable labels in audio tagging systems
Class-wise supervision unreliability modeling improves AudioSet robustness by 12%
Weakly labeled datasets like AudioSet have driven audio tagging progress, but annotation quality varies significantly across sound classes. Labels can be incomplete, ambiguous, or unreliable—especially as synthetic audio increasingly mixes with real recordings, where generated samples often mismatch their semantic labels. Prior research focused mainly on missing-positive labels, but three other sources of unreliable supervision remain underexplored: spurious additions (incorrect extra labels), misassignments between similar classes (e.g., 'dog bark' vs. 'wolf howl'), and weakened label evidence (marginal or ambiguous presence). These introduce class-dependent optimization bias that standard training ignores.
To address this, the paper introduces the Class-wise Supervision Unreliability (CSU) framework. CSU learns a separate unreliability parameter for each sound class during training, dynamically down-weighting supervision from less reliable classes. This requires no changes to model architecture or inference—only a small overhead in training. The authors also release ESC-FreeGen50, a manually verified benchmark of 50 sound classes that includes both real and generated audio for controlled evaluation. Experiments on controlled benchmarks and AudioSet show CSU improves robustness across architectures (CNNs, Transformers) and all three sources of unreliability, with up to 5% absolute mAP gain on the most unreliable classes. The method is simple, practical, and easily integrated into existing audio tagging pipelines.
- CSU learns per-class unreliability parameters to down-weight problematic labels without changing model architecture or inference.
- Addresses three under-explored issues: spurious additions, misassignments between similar classes, and weakened label evidence.
- Introduces ESC-FreeGen50 benchmark – 50 manually verified sound classes combining real and generated audio for robust evaluation.
Why It Matters
Practical method to improve AI audio tagging reliability as synthetic data proliferates, benefiting voice assistants and surveillance systems.