HARNESS-LM (HLM) is a three-phase distillation framework?

teacher fine-tuning on a billion-parameter SLM, L2-based query alignment to a <600M student, and contrastive refinement.

The student model recovers 98%+ of teacher precision while achieving 27x lower latency and 20x higher throughput on NVIDIA A100 GPUs compared to the 8B teacher?

The student model recovers 98%+ of teacher precision while achieving 27x lower latency and 20x higher throughput on NVIDIA A100 GPUs compared to the 8B teacher.

Online A/B test on Bing Ads with a 190M parameter HLM model yielded +1% Revenue, +0.6% Impressions, and +0.4% Click uplift over the production ensemble?

Online A/B test on Bing Ads with a 190M parameter HLM model yielded +1% Revenue, +0.6% Impressions, and +0.4% Click uplift over the production ensemble.

Research & Papers

Microsoft's HARNESS-LM slims down search AI by 27x latency with 98% accuracy

Q: The student model recovers 98%+ of teacher precision while achieving 27x lower latency and 20x higher throughput on NVIDIA A100 GPUs compared to the 8B teacher?

The student model recovers 98%+ of teacher precision while achieving 27x lower latency and 20x higher throughput on NVIDIA A100 GPUs compared to the 8B teacher.

Q: Online A/B test on Bing Ads with a 190M parameter HLM model yielded +1% Revenue, +0.6% Impressions, and +0.4% Click uplift over the production ensemble?

Online A/B test on Bing Ads with a 190M parameter HLM model yielded +1% Revenue, +0.6% Impressions, and +0.4% Click uplift over the production ensemble.

arXiv cs.IR May 25, 2026

⚡New recipe distills 8B-parameter models into 190M-parameter student for Bing Ads.

Deep Dive

Microsoft researchers (Bing Ads) have developed HARNESS-LM (HLM), a three-phase training recipe that transfers the capabilities of large-scale sponsored search retrievers into compact, cost-efficient models. The approach starts with fine-tuning a billion-parameter SLM (e.g., Qwen3-Embedding-8B) as a high-performance teacher. Phase two aligns query representations via an L2 distillation objective to a student encoder with under 600M parameters. Phase three applies contrastive refinement to optimize the student for retrieval performance. The team also conducted a comprehensive empirical study of key design choices like alignment objectives, embedding dimensionality, model scale, and architecture.

On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. An online A/B test on Bing Ads further validated the approach: the deployed 190M parameter model produced a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the existing ensemble of retrievers. This demonstrates that even relatively lightweight models can match or exceed the performance of much larger systems in production, making them ideal for high-throughput, latency-sensitive sponsored search environments.

Key Points

HARNESS-LM (HLM) is a three-phase distillation framework: teacher fine-tuning on a billion-parameter SLM, L2-based query alignment to a <600M student, and contrastive refinement.
The student model recovers 98%+ of teacher precision while achieving 27x lower latency and 20x higher throughput on NVIDIA A100 GPUs compared to the 8B teacher.
Online A/B test on Bing Ads with a 190M parameter HLM model yielded +1% Revenue, +0.6% Impressions, and +0.4% Click uplift over the production ensemble.

Why It Matters

Proves you can get near-perfect retrieval quality from a model 40x smaller, cutting infrastructure costs and latency for real-time ad search.

Read Original Article

Microsoft's HARNESS-LM slims down search AI by 27x latency with 98% accuracy

Why It Matters

Related Articles

🚀 Stay Ahead in AI