Simple acoustic features catch degraded voice clones with 85% accuracy
A 3-feature acoustic detector spots failed synthetic speech faster than deep learning.
A lightweight, interpretable method detects degraded voice clones using three acoustic features: fundamental frequency (f0), harmonics-to-noise ratio (HNR), and vocal tract length (VTL). Tested on human-labeled samples from WaveRNN (n=54) and HiFi-GAN (n=40), f0 and HNR each achieved 85.2% accuracy for WaveRNN; for HiFi-GAN, HNR reached 80.0% and f0 77.5%. Designed for clinical AVATAR therapy with schizophrenia patients, the approach offers a fast, low-cost first-pass screen against synthetic speech failures.
- f0 and HNR each hit 85.2% accuracy on WaveRNN voice clones, far exceeding VTL (64.8%).
- The method uses only 3 low-dimensional features, avoiding computationally expensive deep-learning models.
- Designed for AVATAR therapy, where rapid detection of degraded synthetic speech preserves patient immersion.
Why It Matters
A fast, cheap, and transparent detector keeps AI voice clones from breaking critical clinical tools like AVATAR therapy.