Echoes: A semantically-aligned music deepfake detection dataset
New 110-hour dataset from ten AI music generators creates the hardest benchmark for detection models.
A research team from the University Politehnica of Bucharest, including Octavian Pascu, Dan Oneata, Horia Cucu, and Nicolas M. Muller, has launched Echoes, a groundbreaking dataset designed to train and benchmark AI music deepfake detectors under realistic conditions. The dataset comprises 3,577 tracks totaling 110 hours of audio, spanning genres like pop, rock, and electronic. Crucially, it includes content generated by ten different popular AI music generation systems. To force detectors to learn robust, transferable cues rather than superficial artifacts, the team constructed Echoes with semantic-level alignment. This means each AI-generated "spoof" track is conditioned directly on a real "bona fide" reference track's waveform or song descriptors, making the fake and real versions musically equivalent in content, not just in superficial noise patterns.
In cross-dataset evaluations using state-of-the-art Wav2Vec2 XLS-R 2B models, Echoes proved to be the hardest in-domain benchmark. The study found that detectors trained on existing, less challenging datasets transferred poorly to Echoes, often failing to generalize. Conversely, models trained on the semantically-aligned, provider-diverse Echoes dataset demonstrated the strongest generalization performance when tested on other datasets. This key finding suggests that diversity in AI music sources and forcing semantic alignment are critical for learning detection cues that remain effective as AI music generators evolve. The work underscores a major shift in the field: to build reliable deepfake detectors, the training data must mirror the complexity and quality of real-world forgeries.
- Contains 3,577 tracks (110 hours) from ten different AI music generation systems, ensuring provider diversity.
- Uses semantic alignment by conditioning fakes on real song waveforms/descriptors, preventing detectors from learning easy shortcuts.
- Proven the hardest benchmark; training on Echoes yields the strongest cross-dataset generalization for detection models.
Why It Matters
Provides a crucial, realistic benchmark to develop detectors that can keep pace with rapidly improving AI music generators, protecting artists and platforms.