Research & Papers

Researchers use LLaMa-3 and Qwen2.5 to boost music recommendations by 95%

A new multimodal framework fuses audio, lyrics, and listening data for 95% better recall.

Deep Dive

Traditional music recommendation systems treat songs as opaque tokens, relying on collaborative filtering and ignoring the actual content. A new paper from researchers at Adobe Research and academic institutions tackles this by introducing a multimodal framework that jointly models audio, lyrics, and user engagement signals within an LLM-based sequential reasoning architecture. The team enriched the LastFM-1K dataset with three complementary inputs: audio and lyric embeddings from pretrained models, LLM-generated semantic metadata using the MGPHot annotation schema (powered by LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in zero-shot and fine-tuned settings), and listening completion ratios. They extended the E4SRec framework, testing multiple item ID encoder backbones including SASRec, BERT4Rec, and GRU4Rec, and integrated LLMs for both feature generation and reasoning.

The experimental results are striking: integrating content-based features improved over ID-only baselines by up to 95% in Recall and 79% in NDCG. However, the team also found that naive multimodal fusion does not always yield additive gains, highlighting the challenges of cross-modal integration. The work demonstrates that combining semantic, acoustic, and behavioral signals can significantly outperform traditional collaborative filtering, especially in cold-start and sparse interaction scenarios. As a contribution to the field, the authors are releasing a large-scale multimodal benchmark for music recommendation, which includes the enriched dataset and evaluation protocols. This research points toward a future where streaming services can recommend songs based on what they actually sound and feel like, not just what similar users listened to.

Key Points
  • Framework enriches LastFM-1K with audio/lyric embeddings, LLM-generated metadata (LLaMa-3-70B, Qwen2.5-7B), and listening completion ratios.
  • Achieved up to 95% improvement in Recall and 79% in NDCG over ID-only baselines using E4SRec with SASRec, BERT4Rec, and GRU4Rec backbones.
  • Research team releases a large-scale multimodal benchmark for music recommendation to enable further community work.

Why It Matters

Smarter music recommendations that understand song content could transform streaming personalization and discovery.