Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering
New benchmark pushes AI beyond single-track analysis to compare 12,173 music pairs with three question types.
A research team from institutions including KAIST and the University of Illinois Urbana-Champaign has published a new benchmark called Jamendo-MT-QA, designed to push music AI beyond simple track identification. The dataset addresses a critical gap: while current models can answer questions about a single song, real-world music discussion is inherently comparative. The team built upon the existing Jamendo-QA dataset, using an LLM-assisted pipeline to generate and filter 36,519 high-quality comparative question-answer items across 12,173 distinct pairs of Creative Commons-licensed music tracks from Jamendo.
The benchmark systematically evaluates three types of reasoning: yes/no questions, short-answer queries, and complex sentence-level explanations. For example, a model might be asked to compare the tempo or mood between two tracks. The researchers have already used the dataset to benchmark representative audio-language models, employing both automatic metrics and an LLM-as-a-Judge evaluation method. This provides a standardized, challenging testbed for the next generation of music understanding AI.
By moving from single-track to multi-track analysis, Jamendo-MT-QA sets a new direction for the field. It forces models to develop deeper relational understanding, a capability essential for building AI that can curate playlists, recommend music based on nuanced comparisons, or even assist in music production and analysis. The benchmark's scale and structured question types make it a valuable tool for researchers aiming to close the gap between AI's current capabilities and human-like musical discourse.
- Contains 36,519 comparative QA items built from 12,173 pairs of music tracks.
- Systematically tests three question types: yes/no, short-answer, and sentence-level explanations.
- Uses an LLM-assisted pipeline for generation and will benchmark audio-language models with LLM-as-a-Judge evaluation.
Why It Matters
Enables development of AI that understands music relationally, powering better recommendations, curation, and analysis tools.