Audio & Speech

Audio AI robustness certification flawed by preprocessing choices

Same noise level, vastly different certified accuracy depending on feature representation.

Deep Dive

Randomized smoothing (RS) is a popular technique for certifying neural network robustness against adversarial perturbations, but its application to audio classification has been ambiguous. A new paper from researchers at Carnegie Mellon University reveals that RS results are highly dependent on how audio signals are represented and preprocessed—a factor often overlooked in prior work. The authors demonstrate that standard pipelines, which normalize, range-control, and transform waveforms into log-mel or other spectral features, effectively change the underlying perturbation space. This means the certified radius reported by a block-box RS method may not correspond to the actual perturbation the user expects. At the same smoothing level σ=0.0025, two benchmark datasets (keyword spotting and environmental sound) exhibit identical median raw radius (0.007996), yet their signal-to-noise ratios differ by nearly 7 dB due to waveform energy variations. This discrepancy makes direct comparison of certified accuracy misleading.

Further diagnostics show that the choice of feature space dramatically shifts certification outcomes. When smoothing in log-mel domain, the model certifies more examples with a positive radius (68.42% on environmental sounds vs. 65.53% for waveform smoothing), but those certifications apply to perturbations in the feature space rather than the original waveform—meaning an attacker could bypass the guarantee by altering preprocessing. Additionally, common operations like peak normalization or clipping change the effective ℓ₂ norm of the perturbation by a factor of 230× to 351×, completely altering the threat model. The authors recommend that future audio RS studies explicitly report the certified object (waveform vs. feature), the perturbation location (input or latent), the gain policy (normalization scheme), the raw radius before any geometry change, and any post-noise modifications. These guidelines would ensure fair comparisons and reproducible robustness certifications across the growing field of audio AI safety.

Key Points
  • At σ=0.0025, two datasets share the same median raw radius (0.007996) but different SNR-equivalent scales (83.98 vs 90.97 dB) due to waveform energy differences.
  • Log-mel smoothing achieves higher certified accuracy on environmental sounds (68.42% vs 65.53%) but certifies perturbations in feature space, not waveform space.
  • Clipping or peak normalization changes the effective perturbation norm by 230–351×, substantially altering the certified threat model.

Why It Matters

Standardizes reporting for audio robustness, enabling reproducible certification and fair comparisons between models.