Research & Papers

SCALR: Synthetic Data from Cross-Domain Events Boosts Recommendation Systems

New framework generates synthetic user-item interactions to fight data sparsity in recommendations.

Deep Dive

Large-scale recommendation systems operating across diverse domains face persistent challenges of data sparsity and noisy implicit feedback. Traditional approaches rely on model-specific knowledge distillation from source to target domains, but these often struggle to generalize. In a new paper, researchers from an industrial team introduce SCALR (Synthetic Cross-domain Augmentation and Learning for Recommendation), a framework inspired by the transformative success of synthetic data generation in large language models. SCALR generates synthetic user-item interaction events for a target domain by leveraging observed events from a source domain, effectively translating cross-domain behavior into training data.

SCALR decomposes cross-domain learning into two modular stages. First, it translates observed user events in source domains by framing event generation as estimating the probability that a user would interact with a target-domain item, conditioned on their source-domain interactions. Second, downstream recommendation models train on these synthetic events in a model-agnostic manner, augmenting the target domain's training data. The researchers report statistically significant improvements in online A/B tests on an industrial recommendation platform. To their knowledge, this is among the first works to explicitly frame cross-domain event transfer as synthetic data generation for recommendation systems, opening a new avenue for tackling data scarcity without requiring complex model-specific adaptations.

Key Points
  • SCALR translates source domain user events into synthetic target domain interactions by estimating interaction likelihood conditioned on source behavior.
  • Two-stage modular approach: event generation then model-agnostic training on synthetic data.
  • Achieved statistically significant improvements in online A/B tests on an industrial recommendation platform.

Why It Matters

Tackles data sparsity in large-scale recommender systems using LLM-inspired synthetic data generation.