Evaluating Pretrained General-Purpose Audio Representations for Music Genre Classification
New research shows self-supervised BYOL-A embeddings outperform PANNs and VGGish for audio analysis.
A new study presented at the International Conference on Pattern Recognition and Machine Intelligence (PReMI) 2025 provides a comprehensive benchmark for using pretrained, general-purpose audio AI models to classify music genres. Researchers Kashish Rai and Mrinmoy Bhattacharjee systematically evaluated embeddings from self-supervised learning models like BYOL-A against established models such as PANNs and VGGish. Their key finding is that BYOL-A embeddings, when processed by a custom deep neural network classifier, deliver superior performance, achieving 81.5% accuracy on the standard GTZAN dataset and 64.3% on the more challenging FMA-Small dataset.
The proposed DNN architecture was a significant factor, boosting accuracy by 10-16% compared to using basic linear classifiers on the same embeddings. The researchers also tackled the challenge of cross-dataset generalization by creating a unified 18-class label space from GTZAN and FMA-Small for joint training. While this caused a slight performance drop on GTZAN, it yielded comparable results on FMA-Small, demonstrating a more robust model. All scripts from this work are publicly available, offering a practical toolkit for developers and researchers looking to implement state-of-the-art audio classification without training models from scratch.
- BYOL-A embeddings outperformed PANNs and VGGish, scoring 81.5% accuracy on GTZAN and 64.3% on FMA-Small.
- A custom Deep Neural Network classifier provided a 10-16% performance boost over standard linear classifiers.
- The study addressed cross-dataset challenges by unifying GTZAN and FMA-Small into an 18-class label space for joint training.
Why It Matters
This benchmark enables more accurate AI for music streaming recommendations, content tagging, and audio analysis tools.