190,000 authentic Reddit thread-comment pairs extracted for conversational music recommendation?

190,000 authentic Reddit thread-comment pairs extracted for conversational music recommendation

Each entity linked to Deezer IDs with rich metadata (genre, BPM, popularity) for content grounding?

Each entity linked to Deezer IDs with rich metadata (genre, BPM, popularity) for content grounding

raw (authentic) and paraphrased (reproducible, privacy-safe); human validation confirms quality

Research & Papers

UCSD's Reddit2Deezer dataset brings 190k real music chat pairs

arXiv cs.IR May 12, 2026

⚡190k real Reddit conversations linked to Deezer music metadata for grounded CMR

Deep Dive

Conversational music recommendation (CMR) research has long been stuck between two imperfect options: authentic conversation corpora that are too small to scale, and large synthetic corpora that feel artificial. A new dataset from UCSD researchers Haven Kim and Julian McAuley aims to break that deadlock. Reddit2Deezer extracts 190k unique {thread, leaf-comment} pairs from real Reddit discussions about music, then links each mentioned song or artist to a Deezer identifier. This gives researchers instant access to audio previews and metadata like genre tags, popularity scores, and BPM, making the dataset suitable for content-grounded CMR systems.

The dataset comes in two versions: a raw version that preserves the original Reddit text for maximum authenticity, and a carefully paraphrased version designed to remove personally identifiable information while keeping musical intent intact — enabling long-term reproducibility without privacy concerns. A human validation study confirmed that both the dialogues, the item grounding, and the paraphrases are high quality. By providing a large-scale, reality-grounded resource, Reddit2Deezer opens the door to training recommendation models that understand how real people talk about music, and could lead to more natural voice assistants and playlist curators.

Key Points

190,000 authentic Reddit thread-comment pairs extracted for conversational music recommendation
Each entity linked to Deezer IDs with rich metadata (genre, BPM, popularity) for content grounding
Two versions: raw (authentic) and paraphrased (reproducible, privacy-safe); human validation confirms quality

Why It Matters

Brings scalable, authentic training data for conversational AI in music discovery, advancing real-world recommendation systems.

Read Original Article

UCSD's Reddit2Deezer dataset brings 190k real music chat pairs

Why It Matters

Related Articles

🚀 Stay Ahead in AI