Audio & Speech

Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

A cost-effective method fine-tunes OpenAI's Whisper to spot synthetic words in speech with next-token prediction.

Deep Dive

A research team led by Hoan My Tran has published a novel method for detecting deepfake words within audio recordings by repurposing OpenAI's Whisper speech recognition model. The core innovation is fine-tuning Whisper to perform next-token prediction, allowing it to simultaneously transcribe speech and identify which words are likely synthetic forgeries. This approach is presented as a cost-effective alternative to building dedicated detection models from scratch, as it leverages a powerful, pre-trained foundation. The researchers also explored using partially vocoded utterances for training to reduce data collection costs.

The technical paper, 'Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper,' details that the fine-tuned model yields low synthetic-word detection and transcription error rates on in-domain test data. Crucially, when tested on out-of-domain data with synthetic words from unseen generative models, it performed on par with a specialized ResNet-based detector. However, the team notes an overall performance degradation in these generalized scenarios, highlighting a need for improved robustness. This work represents a significant step toward practical, multi-tasking AI tools that can authenticate audio content in real-time, addressing a critical gap in media integrity.

Key Points
  • Fine-tunes OpenAI's Whisper model to detect synthetic words via next-token prediction during transcription.
  • Achieves performance on par with dedicated ResNet detectors on out-of-domain data, though generalization needs improvement.
  • Uses partially vocoded data for fine-tuning, reducing the cost and complexity of training data collection.

Why It Matters

Provides a dual-purpose, cost-effective tool for real-time audio transcription and deepfake detection, crucial for media verification.