Optimal Transport Audio Distance with Learned Riemannian Ground Metrics
OTAD detects audio artifacts FAD misses, with 1.9x to 3.6x better sensitivity.
Evaluating the quality of generated audio remains a challenge, as traditional metrics like Fréchet Audio Distance (FAD) can miss subtle artifacts. In a new paper, Wonwoo Jeong introduces Optimal Transport Audio Distance (OTAD), which fixes FAD’s two core limitations: a frozen embedding cost and a Gaussian coupling assumption. OTAD replaces the cost with a residual Riemannian ground-metric adapter and uses entropic Sinkhorn optimal transport for the coupling.
Tested across eight encoders, OTAD’s Sinkhorn-based coupling shows 1.9 to 3.6 times higher sensitivity to rank-1 contamination at ε=0.05. It also achieves a higher mean Spearman correlation with human MOS (DCASE 2023 Task 7) than baseline metrics. A key advantage is per-sample diagnostic capability—OTAD yields AUROC ≥0.86 for artifact detection, something scalar or kernel metrics cannot do. The open-source otadtk toolkit is available to researchers.
- OTAD improves artifact sensitivity by 1.9x–3.6x over FAD using Sinkhorn transport
- Higher mean Spearman correlation with human MOS on DCASE 2023 Task 7
- Per-sample diagnostics with AUROC ≥0.86, unavailable in FAD or similar metrics
Why It Matters
Better audio evaluation means more reliable generation models and fewer undetected artifacts in production audio.