Audio & Speech

Optimal Transport Audio Distance with Learned Riemannian Ground Metrics

OTAD detects audio artifacts FAD misses, with 1.9x to 3.6x better sensitivity.

Deep Dive

Evaluating the quality of generated audio remains a challenge, as traditional metrics like Fréchet Audio Distance (FAD) can miss subtle artifacts. In a new paper, Wonwoo Jeong introduces Optimal Transport Audio Distance (OTAD), which fixes FAD’s two core limitations: a frozen embedding cost and a Gaussian coupling assumption. OTAD replaces the cost with a residual Riemannian ground-metric adapter and uses entropic Sinkhorn optimal transport for the coupling.

Tested across eight encoders, OTAD’s Sinkhorn-based coupling shows 1.9 to 3.6 times higher sensitivity to rank-1 contamination at ε=0.05. It also achieves a higher mean Spearman correlation with human MOS (DCASE 2023 Task 7) than baseline metrics. A key advantage is per-sample diagnostic capability—OTAD yields AUROC ≥0.86 for artifact detection, something scalar or kernel metrics cannot do. The open-source otadtk toolkit is available to researchers.

Key Points
  • OTAD improves artifact sensitivity by 1.9x–3.6x over FAD using Sinkhorn transport
  • Higher mean Spearman correlation with human MOS on DCASE 2023 Task 7
  • Per-sample diagnostics with AUROC ≥0.86, unavailable in FAD or similar metrics

Why It Matters

Better audio evaluation means more reliable generation models and fewer undetected artifacts in production audio.