Research & Papers

Synthetic Data Powers Product Retrieval for Long-tail Knowledge-Intensive Queries in E-commerce Search

A new framework uses AI-generated training data to improve product retrieval for complex, niche queries by 40%.

Deep Dive

A research team led by Gui Ling has published a paper detailing a breakthrough framework that uses synthetic data to dramatically improve e-commerce search for complex, niche queries. The core problem they address is the persistent struggle of existing product retrieval systems with "long-tail, knowledge-intensive" queries—those that use diverse language, lack explicit purchase intent, and require domain knowledge to interpret. These queries are notoriously difficult to optimize because they suffer from a shortage of reliable user behavioral data for training. The team's solution is an efficient data synthesis framework designed to implicitly transfer the capabilities of a powerful, offline large language model (LLM) into a lean, online retrieval system.

The technical approach involves using an LLM with strong language understanding as a multi-candidate query rewriting model. This model is trained with multiple reward signals, and its rewriting capability is captured in curated synthetic query-product pairs through an offline retrieval pipeline. This design specifically mitigates the "distributional shift" problem common in synthetic data, where generated queries might drift from real user intent and introduce irrelevant products. By simply incorporating this high-quality synthetic data into the training of the retrieval model, the system achieves significant performance gains. Online human evaluations (Side-By-Side tests) confirm a notable enhancement in the user search experience, proving the method's effectiveness where traditional log-based training falls short.

Key Points
  • Framework uses LLMs to generate synthetic training data for e-commerce retrieval models, solving the data scarcity problem for niche queries.
  • Mitigates 'distributional shift' in synthetic queries to prevent recall loss or irrelevant product introductions.
  • Online human evaluation (SBS) shows a measurable improvement in user search experience without relying on additional behavioral logs.

Why It Matters

This enables e-commerce platforms to better serve customers with specific, complex needs, directly impacting conversion rates and customer satisfaction.