Microsoft's HARNESS-LM slims down search AI by 27x latency with 98% accuracy
New recipe distills 8B-parameter models into 190M-parameter student for Bing Ads.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Microsoft researchers (Bing Ads) have developed HARNESS-LM (HLM), a three-phase training recipe that transfers the capabilities of large-scale sponsored search retrievers into compact, cost-efficient models. The approach starts with fine-tuning a billion-parameter SLM (e.g., Qwen3-Embedding-8B) as a high-performance teacher. Phase two aligns query representations via an L2 distillation objective to a student encoder with under 600M parameters. Phase three applies contrastive refinement to optimize the student for retrieval performance. The team also conducted a comprehensive empirical study of key design choices like alignment objectives, embedding dimensionality, model scale, and architecture.
On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. An online A/B test on Bing Ads further validated the approach: the deployed 190M parameter model produced a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the existing ensemble of retrievers. This demonstrates that even relatively lightweight models can match or exceed the performance of much larger systems in production, making them ideal for high-throughput, latency-sensitive sponsored search environments.
- HARNESS-LM (HLM) is a three-phase distillation framework: teacher fine-tuning on a billion-parameter SLM, L2-based query alignment to a <600M student, and contrastive refinement.
- The student model recovers 98%+ of teacher precision while achieving 27x lower latency and 20x higher throughput on NVIDIA A100 GPUs compared to the 8B teacher.
- Online A/B test on Bing Ads with a 190M parameter HLM model yielded +1% Revenue, +0.6% Impressions, and +0.4% Click uplift over the production ensemble.
Why It Matters
Proves you can get near-perfect retrieval quality from a model 40x smaller, cutting infrastructure costs and latency for real-time ad search.