Audio & Speech

Evaluating Pretrained General-Purpose Audio Representations for Music Genre Classification

New research shows self-supervised BYOL-A embeddings outperform PANNs and VGGish for audio analysis.

Deep Dive

A new study presented at the International Conference on Pattern Recognition and Machine Intelligence (PReMI) 2025 provides a comprehensive benchmark for using pretrained, general-purpose audio AI models to classify music genres. Researchers Kashish Rai and Mrinmoy Bhattacharjee systematically evaluated embeddings from self-supervised learning models like BYOL-A against established models such as PANNs and VGGish. Their key finding is that BYOL-A embeddings, when processed by a custom deep neural network classifier, deliver superior performance, achieving 81.5% accuracy on the standard GTZAN dataset and 64.3% on the more challenging FMA-Small dataset.

The proposed DNN architecture was a significant factor, boosting accuracy by 10-16% compared to using basic linear classifiers on the same embeddings. The researchers also tackled the challenge of cross-dataset generalization by creating a unified 18-class label space from GTZAN and FMA-Small for joint training. While this caused a slight performance drop on GTZAN, it yielded comparable results on FMA-Small, demonstrating a more robust model. All scripts from this work are publicly available, offering a practical toolkit for developers and researchers looking to implement state-of-the-art audio classification without training models from scratch.

Key Points
  • BYOL-A embeddings outperformed PANNs and VGGish, scoring 81.5% accuracy on GTZAN and 64.3% on FMA-Small.
  • A custom Deep Neural Network classifier provided a 10-16% performance boost over standard linear classifiers.
  • The study addressed cross-dataset challenges by unifying GTZAN and FMA-Small into an 18-class label space for joint training.

Why It Matters

This benchmark enables more accurate AI for music streaming recommendations, content tagging, and audio analysis tools.