BioELX boosts cross-lingual biomedical entity linking by +19.2 Recall@1
New zero-shot framework slashes errors on Turkish, Korean, and Thai biomedical mentions by over 20%.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Biomedical entity linking (BEL) is critical for clinical NLP, but training data is scarce for most languages. Existing systems like SapBERT rely on English-dominated aliases, failing on non-English mentions. BioELX solves this with a two-stage zero-shot approach. First, it augments SapBERT training with multilingual aliases from Wikidata, enabling the retriever to handle diverse languages. Second, a frozen LLM ranker (e.g., Llama or GPT variants) scores candidate–context pairs, providing disambiguation without supervised fine-tuning. This design means BioELX can be deployed instantly for any language with a biomedical KB.
Results are striking. On the XL-BEL benchmark covering 10 languages, BioELX improves Recall@1 by an average of 19.2 points over prior methods. Low-resource languages see the biggest jumps: Thai (+30.8), Korean (+22.1), Turkish (+21.6). Consistent gains also appear on EMEA (+6.2), Patent (+5.4), and WikiMed-DE (+12.8). The method uses no gold-standard annotated training data, only Wikidata aliases and LLM reasoning. This makes it practical for real-world clinical systems serving multilingual populations. Code and resources are promised upon publication.
- BioELX uses a two-stage pipeline: Wikidata-enriched SapBERT for candidate retrieval, then a pretrained LLM for context-aware ranking without any task-specific training data.
- Average Recall@1 jumps by +19.2 on the XL-BEL benchmark; low-resource languages see gains of +30.8 (Thai), +22.1 (Korean), and +21.6 (Turkish).
- Consistent improvements across EMEA (+6.2), Patent (+5.4), and WikiMed-DE (+12.8) demonstrate broad cross-domain applicability.
Why It Matters
Enables accurate biomedical entity linking for low-resource languages, unlocking clinical NLP for underserved populations worldwide.