Audio & Speech

Classical Machine Learning Baselines for Deepfake Audio Detection on the Fake-or-Real Dataset

An RBF SVM model using interpretable acoustic features achieves ~93% test accuracy on the Fake-or-Real dataset.

Deep Dive

A team of researchers has published a paper establishing a strong, interpretable baseline for detecting AI-generated deepfake audio. Using the Fake-or-Real (FoR) dataset, they extracted classical acoustic features like pitch variability and spectral richness from two-second audio clips at both 44.1 kHz and 16 kHz sampling rates. Statistical analysis identified which features differed significantly between real and synthetic speech. They then trained and compared multiple classical classifiers including Logistic Regression, SVMs, and Gaussian Mixture Models.

The best-performing model was a Radial Basis Function (RBF) Support Vector Machine (SVM), which achieved approximately 93% test accuracy and a 7% Equal Error Rate (EER) on both high-fidelity and telephone-quality audio. In contrast, simpler linear models only reached about 75% accuracy. Feature analysis revealed that pitch variability and spectral characteristics (like spectral centroid and bandwidth) were the most reliable discriminative cues separating real human speech from AI-generated audio. The paper, accepted for oral presentation at the 35th IEEE Microelectronics Design and Test Symposium, provides a crucial transparent benchmark against which more complex, black-box deep learning detectors can be measured.

Key Points
  • The RBF SVM model achieved ~93% test accuracy and ~7% EER on the Fake-or-Real dataset.
  • Key discriminative features were pitch variability and spectral richness (spectral centroid, bandwidth).
  • The work provides a transparent, interpretable baseline for evaluating future neural audio deepfake detectors.

Why It Matters

Offers a crucial, understandable benchmark for detecting AI voice fraud, moving beyond opaque 'black box' neural models.