Research & Papers

[P] Deezer showed CNN detection fails on compressed audio, here's a dual-engine approach that survives MP3

A new method combining CNNs and source separation achieves 80%+ AI detection, surviving audio compression.

Deep Dive

A significant challenge in detecting AI-generated music has been the failure of standard methods when audio is compressed. Deezer's research showed that Convolutional Neural Networks (CNNs) like ResNet18, trained on mel-spectrograms from WAV files, break down when analyzing real-world MP3 or AAC files. The compression process destroys the subtle spectral artifacts that these models rely on for identification, rendering them ineffective for monitoring streaming platforms or downloaded content.

To solve this, a researcher developed a novel hybrid detection system. Instead of trying to fortify the CNN, they added a second, complementary engine based on the Demucs source separation model. This dual-engine approach works by first using the CNN for high-confidence predictions. When the CNN is uncertain, it triggers the more computationally expensive separation engine. Demucs splits a track into four stems—vocals, drums, bass, and other—and then re-mixes them. The key insight is that human-recorded music, with its natural acoustic bleed and microphone crosstalk, shows a measurable difference after this process. AI-generated music, where stems are synthesized independently, reconstructs nearly perfectly.

The results are promising for practical deployment. The system maintains an AI detection rate of over 80% and a human false positive rate of about 1.1%, and it works reliably across different audio codecs like MP3, AAC, and OGG. By only deploying the resource-intensive source separation for borderline cases, the method remains computationally efficient. While limitations exist—such as variable performance across different AI music generators and the non-deterministic nature of Demucs—this hybrid framework presents a robust new direction for audio authenticity tools in the age of AI synthesis.

Key Points
  • Solves CNN failure on compressed audio: Deezer's paper showed CNNs break on MP3/AAC; new method works across codecs.
  • Uses Demucs for source separation: Analyzes reconstruction error of separated stems (vocals, drums, bass, other) to detect AI.
  • Achieves 80%+ AI detection with ~1.1% human false positives: Hybrid system uses CNN first, then costly separation only when uncertain.

Why It Matters

Provides a practical tool for platforms and labels to identify AI-generated music in the real world of streaming and compressed files.