CSU learns per-class unreliability parameters to down-weight problematic labels without changing model architecture or inference?

CSU learns per-class unreliability parameters to down-weight problematic labels without changing model architecture or inference.

Addresses three under-explored issues?

spurious additions, misassignments between similar classes, and weakened label evidence.

Introduces ESC-FreeGen50 benchmark – 50 manually verified sound classes combining real and generated audio for robust evaluation?

Introduces ESC-FreeGen50 benchmark – 50 manually verified sound classes combining real and generated audio for robust evaluation.

Audio & Speech

New CSU framework tackles unreliable labels in audio tagging systems

arXiv eess.AS May 19, 2026

⚡Class-wise supervision unreliability modeling improves AudioSet robustness by 12%

Deep Dive

Weakly labeled datasets like AudioSet have driven audio tagging progress, but annotation quality varies significantly across sound classes. Labels can be incomplete, ambiguous, or unreliable—especially as synthetic audio increasingly mixes with real recordings, where generated samples often mismatch their semantic labels. Prior research focused mainly on missing-positive labels, but three other sources of unreliable supervision remain underexplored: spurious additions (incorrect extra labels), misassignments between similar classes (e.g., 'dog bark' vs. 'wolf howl'), and weakened label evidence (marginal or ambiguous presence). These introduce class-dependent optimization bias that standard training ignores.

To address this, the paper introduces the Class-wise Supervision Unreliability (CSU) framework. CSU learns a separate unreliability parameter for each sound class during training, dynamically down-weighting supervision from less reliable classes. This requires no changes to model architecture or inference—only a small overhead in training. The authors also release ESC-FreeGen50, a manually verified benchmark of 50 sound classes that includes both real and generated audio for controlled evaluation. Experiments on controlled benchmarks and AudioSet show CSU improves robustness across architectures (CNNs, Transformers) and all three sources of unreliability, with up to 5% absolute mAP gain on the most unreliable classes. The method is simple, practical, and easily integrated into existing audio tagging pipelines.

Key Points

CSU learns per-class unreliability parameters to down-weight problematic labels without changing model architecture or inference.
Addresses three under-explored issues: spurious additions, misassignments between similar classes, and weakened label evidence.
Introduces ESC-FreeGen50 benchmark – 50 manually verified sound classes combining real and generated audio for robust evaluation.

Why It Matters

Practical method to improve AI audio tagging reliability as synthetic data proliferates, benefiting voice assistants and surveillance systems.

Read Original Article

New CSU framework tackles unreliable labels in audio tagging systems

Why It Matters

Related Articles

🚀 Stay Ahead in AI