Airbnb uses LLMs to generate synthetic search queries, beating cold-start by 7.5x
Airbnb's new method creates realistic search queries without any real user data.
Deploying natural language search systems faces a severe cold-start challenge: no real user queries exist to learn linguistic patterns, and no relevance labels are available to train ranking models. Airbnb's research team presents a framework that leverages large language models (LLMs) to generate synthetic queries and labels, enabling model training and evaluation from day one. Their approach combines contrastive listing pairs from booking sessions with seed queries from user research to balance realism and diversity, allowing a smooth cold-to-warm start transition as real data becomes available.
For label generation, the team introduced contrastive generation that produces topicality labels by construction, alongside Virtual Judge (VJ) labeling for broader coverage. Compared against baselines, the seed-guided approach achieved remarkable fidelity: query length distribution matched real users with a KL divergence of 0.66 versus 12.03 for the InPars baseline (7.5x improvement), and attribute type distributions hit a KL divergence of 0.04—outperforming even pure seed queries (0.09). Importantly, the synthetic evaluation examples proved harder (79% pairwise accuracy) than the no-seed baseline (97%), giving clearer discriminative signal for model improvement. Airbnb now runs production pipelines generating synthetic examples daily for embedding-based retrieval and ranking evaluation.
- Seed-guided synthetic queries achieved KL divergence of 0.66 vs. real user queries, a 7.5x improvement over the InPars baseline (12.03).
- Attribute type distributions from synthetic data matched real data with KL divergence of 0.04, surpassing even seed queries (0.09).
- Generated evaluation examples were harder (79% pairwise accuracy) than baselines (97%), providing stronger signal for model improvement.
Why It Matters
Enables Airbnb to launch natural language search immediately without any real user data, solving a critical cold-start bottleneck.