f0 and HNR each hit 85.2% accuracy on WaveRNN voice clones, far exceeding VTL (64.8%)?

f0 and HNR each hit 85.2% accuracy on WaveRNN voice clones, far exceeding VTL (64.8%).

The method uses only 3 low-dimensional features, avoiding computationally expensive deep-learning models?

The method uses only 3 low-dimensional features, avoiding computationally expensive deep-learning models.

Designed for AVATAR therapy, where rapid detection of degraded synthetic speech preserves patient immersion?

Designed for AVATAR therapy, where rapid detection of degraded synthetic speech preserves patient immersion.

Audio & Speech

Simple acoustic features catch degraded voice clones with 85% accuracy

arXiv eess.AS May 12, 2026

⚡A 3-feature acoustic detector spots failed synthetic speech faster than deep learning.

Deep Dive

A lightweight, interpretable method detects degraded voice clones using three acoustic features: fundamental frequency (f0), harmonics-to-noise ratio (HNR), and vocal tract length (VTL). Tested on human-labeled samples from WaveRNN (n=54) and HiFi-GAN (n=40), f0 and HNR each achieved 85.2% accuracy for WaveRNN; for HiFi-GAN, HNR reached 80.0% and f0 77.5%. Designed for clinical AVATAR therapy with schizophrenia patients, the approach offers a fast, low-cost first-pass screen against synthetic speech failures.

Key Points

f0 and HNR each hit 85.2% accuracy on WaveRNN voice clones, far exceeding VTL (64.8%).
The method uses only 3 low-dimensional features, avoiding computationally expensive deep-learning models.
Designed for AVATAR therapy, where rapid detection of degraded synthetic speech preserves patient immersion.

Why It Matters

A fast, cheap, and transparent detector keeps AI voice clones from breaking critical clinical tools like AVATAR therapy.

Read Original Article

Simple acoustic features catch degraded voice clones with 85% accuracy

Why It Matters

Related Articles

🚀 Stay Ahead in AI