Research & Papers

UCSD's Reddit2Deezer dataset brings 190k real music chat pairs

190k real Reddit conversations linked to Deezer music metadata for grounded CMR

Deep Dive

Conversational music recommendation (CMR) research has long been stuck between two imperfect options: authentic conversation corpora that are too small to scale, and large synthetic corpora that feel artificial. A new dataset from UCSD researchers Haven Kim and Julian McAuley aims to break that deadlock. Reddit2Deezer extracts 190k unique {thread, leaf-comment} pairs from real Reddit discussions about music, then links each mentioned song or artist to a Deezer identifier. This gives researchers instant access to audio previews and metadata like genre tags, popularity scores, and BPM, making the dataset suitable for content-grounded CMR systems.

The dataset comes in two versions: a raw version that preserves the original Reddit text for maximum authenticity, and a carefully paraphrased version designed to remove personally identifiable information while keeping musical intent intact — enabling long-term reproducibility without privacy concerns. A human validation study confirmed that both the dialogues, the item grounding, and the paraphrases are high quality. By providing a large-scale, reality-grounded resource, Reddit2Deezer opens the door to training recommendation models that understand how real people talk about music, and could lead to more natural voice assistants and playlist curators.

Key Points
  • 190,000 authentic Reddit thread-comment pairs extracted for conversational music recommendation
  • Each entity linked to Deezer IDs with rich metadata (genre, BPM, popularity) for content grounding
  • Two versions: raw (authentic) and paraphrased (reproducible, privacy-safe); human validation confirms quality

Why It Matters

Brings scalable, authentic training data for conversational AI in music discovery, advancing real-world recommendation systems.