An Empirical Analysis of Task-Induced Encoder Bias in Fr\'echet Audio Distance
Study reveals the Fréchet Audio Distance metric is fundamentally flawed, inheriting biases from its encoder's training task.
A new study by researcher Wonwoo Jeong reveals a critical flaw in the primary metric used to evaluate AI-generated audio. The paper, "An Empirical Analysis of Task-Induced Encoder Bias in Fréchet Audio Distance," demonstrates that the widely adopted Fréchet Audio Distance (FAD) inherits systematic biases from the underlying encoder model used to compute it. Since an encoder's training task dictates which acoustic features it prioritizes, FAD scores become inconsistent and unreliable, challenging the validity of countless published AI audio model comparisons.
The research decomposes evaluation into Recall, Precision, and Alignment, using a novel log-scale normalization for cross-encoder comparison. Controlled experiments on six encoders—including reconstruction-based AudioMAE, ASR-trained Whisper, and classification-trained VGGish—reveal a clear four-axis trade-off. For instance, Whisper excels at detecting structural issues but is blind to signal degradation, while VGGish maximizes semantic detection but penalizes legitimate audio variation within a class. The conclusion is stark: no current encoder serves as a universal evaluator. This finding mandates a paradigm shift toward developing evaluation-native encoders intrinsically aligned with human perception, a move that will fundamentally reshape how future text-to-audio models like AudioGen or Stable Audio are benchmarked and improved.
- FAD, the standard metric for AI audio, inherits bias from its encoder's training task (e.g., reconstruction, ASR, classification).
- Testing six encoders revealed a four-axis trade-off: AudioMAE leads precision, Whisper dominates structure, VGGish maximizes semantics.
- The study proves no single encoder is a universal evaluator, invalidating many published model comparisons and pushing for new metrics.
Why It Matters
This invalidates comparisons between major AI audio models, forcing developers to rebuild evaluation from the ground up.