Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
New model beats SOTA detectors by 20% on a novel 10,000-hour dataset covering 12 Indic languages.
A research team from IIIT Hyderabad and other institutions has published a pivotal paper addressing a critical gap in AI security: detecting neural audio codec-synthesized speech deepfakes (CodecFakes) in Indic languages. Their work introduces two major contributions. First, they created the Indic-CodecFake (ICF) dataset, the first large-scale benchmark comprising over 10,000 hours of real and synthetic speech across 12 major Indic languages, diverse speakers, and multiple codec types. This dataset exposes a significant vulnerability, as their experiments show state-of-the-art detectors trained on English or Chinese data fail to generalize, with performance dropping by over 30% due to the phonetic and prosodic complexity of Indic speech.
To solve this detection problem, the team proposed SATYAM, a novel hyperbolic Audio Language Model (ALM) specifically architected for the task. SATYAM's innovation lies in a dual-stage fusion framework. It first integrates semantic representations from OpenAI's Whisper with prosodic features from Google's TRILLsson model using a Bhattacharya distance metric within a hyperbolic space. This allows it to effectively model hierarchical relationships within the speech signal. It then aligns this fused representation with an input text prompt. This approach enables SATYAM to outperform both existing end-to-end detectors and other ALMs in a zero-shot setting, achieving a 20% higher detection accuracy on the challenging ICF benchmark, as validated in its acceptance to ACL 2026.
- Introduced the Indic-CodecFake (ICF) dataset, the first 10,000+ hour benchmark for CodecFakes across 12 Indic languages.
- Found a >30% performance drop for SOTA English-trained detectors, highlighting a major security gap for billions of speakers.
- Proposed SATYAM, a hyperbolic ALM fusing Whisper & TRILLsson, beating baselines by 20% on the ICF benchmark.
Why It Matters
Secures voice authentication and media integrity for over 1.5 billion speakers in a region vulnerable to AI-powered disinformation.