New Benchmark Tests 22 Patent Embedding Models on Key Tasks
113K patents and 46K queries reveal fine-tuning pitfalls across model scales
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new study from arXiv (May 2026) evaluates 22 embedding models for patent analysis, spanning small encoders (22M params) up to 12B instruction-tuned LLMs like KaLM-Gemma3 and Qwen3 variants. The benchmark covers three tasks—citation-based retrieval, multi-label classification over five datasets, and unsupervised clustering—using a dataset of 113,148 WIPO assistive-technology patents and 46,069 citation-graph queries, plus the public DAPFAM set for external validation.
Critical results: fine-tuning helps in-domain but can reduce performance on external patent landscapes (55–65% drop on out-of-domain queries). Within model families, scaling up (e.g., Qwen3 0.6B→8B) improves scores, but cross-family scaling is noisy—Qwen3-0.6B leads in ARI clustering while the 12B KaLM-Gemma3 ranks only 8th in retrieval. The best text strategy is Title+Abstract+Claims; aligning abstract and claim views boosts retrieval nDCG@10 by up to 7.1%. Hybrid BM25-dense fusion gives only +0.002 to +0.015 nDCG@10, mostly helping weaker zero-shot models. All code and an evaluation framework are open-sourced.
- Fine-tuning boosts in-domain scores but causes 55–65% accuracy drop on out-of-domain patent queries
- Qwen3-0.6B outperforms much larger models (12B KaLM-Gemma3) in clustering (ARI), highlighting cross-family noise
- Multi-view abstract-claim alignment improves retrieval by 7.1% nDCG@10; hybrid BM25-dense fusion yields only marginal gains (+0.002–0.015)
Why It Matters
Helps IP professionals choose the right embedding model and avoid fine-tuning pitfalls in patent search and classification