Retrieval-based method using frozen Qwen-8B achieves 49.12 Macro-F1 on Eurlex, beating GPT-5.2 zero-shot (40.41) with 20-30x less compute?

Retrieval-based method using frozen Qwen-8B achieves 49.12 Macro-F1 on Eurlex, beating GPT-5.2 zero-shot (40.41) with 20-30x less compute.

With only 100 training samples, retrieval nearly doubles Micro-F1 over hierarchical Legal-BERT on ECtHR-A (48.29 vs. 27.87)?

With only 100 training samples, retrieval nearly doubles Micro-F1 over hierarchical Legal-BERT on ECtHR-A (48.29 vs. 27.87).

GPT-5.2 hallucinated labels in 0.12-0.9% of test samples; retrieval eliminates hallucinations by design?

GPT-5.2 hallucinated labels in 0.12-0.9% of test samples; retrieval eliminates hallucinations by design.

Research & Papers

Retrieval-based legal annotation model beats GPT-5.2 with 20-30x less compute

arXiv cs.CL May 19, 2026

⚡New LLM alternative uses k-nearest neighbors, eliminates hallucinations entirely.

Deep Dive

A new paper from Li Zhang, Jaromir Savelka, and Kevin Ashley proposes a retrieval-based framework for multi-label legal annotation that sidesteps the high costs and hallucination risks of generative LLMs. The method treats annotation as a retrieval task: both documents and label descriptions are embedded using a frozen Qwen-8B model, and labels are assigned via k-nearest neighbors in embedding space. When the label taxonomy changes, the system simply re-embeds and re-indexes the new descriptions, requiring no gradient-based retraining. Experiments on three legal datasets (ECtHR-A, ECtHR-B, and Eurlex with 100 labels) show the retrieval approach matches or exceeds the accuracy of fine-tuned encoders and large language models.

On Eurlex, retrieval achieved a 49.12 Macro-F1 score, outperforming GPT-5.2 zero-shot (40.41) while reducing estimated compute by 20–30 times compared to fine-tuning. In low-data regimes, the method shines: with just 100 training examples on ECtHR-A, it doubled Micro-F1 over hierarchical Legal-BERT (48.29 vs. 27.87). The paper also documents a critical failure mode in generative models: GPT-5.2 hallucinated labels outside the provided taxonomy in 0.12–0.9% of test cases, even under deterministic decoding. Retrieval, in contrast, strictly respects the label set by design, making it a safer, more extensible option for high-cardinality and rapidly evolving legal annotation tasks.

Key Points

Retrieval-based method using frozen Qwen-8B achieves 49.12 Macro-F1 on Eurlex, beating GPT-5.2 zero-shot (40.41) with 20-30x less compute.
With only 100 training samples, retrieval nearly doubles Micro-F1 over hierarchical Legal-BERT on ECtHR-A (48.29 vs. 27.87).
GPT-5.2 hallucinated labels in 0.12-0.9% of test samples; retrieval eliminates hallucinations by design.

Why It Matters

A compute-efficient, hallucination-free alternative for legal annotation that adapts quickly to changing taxonomies.

Read Original Article

Retrieval-based legal annotation model beats GPT-5.2 with 20-30x less compute

Why It Matters

Related Articles

🚀 Stay Ahead in AI