Research & Papers

CanLegalRAGBench exposes LLM hallucination risks in Canadian case law

New benchmark finds 8-29% of legal AI answers unsupported by retrieved documents.

Deep Dive

A team from the University of British Columbia and other institutions has released CanLegalRAGBench, a new benchmark designed to evaluate retrieval-augmented generation (RAG) systems on Canadian case law. Unlike prior benchmarks that rely on synthetic queries, this one uses realistic legal questions and expert-annotated answers grounded in real court decisions. The goal is to measure how well AI legal assistants retrieve and generate accurate, truthful information.

The evaluation results reveal several key findings. First, retrieval performance is highly sensitive to design choices (e.g., chunking strategy, embedding model). Open-source embedding models proved competitive with closed-source alternatives like those from OpenAI. However, automatic evaluation metrics sometimes penalize systems for retrieving alternative but still relevant documents, indicating a gap between automated scoring and human judgment. More worryingly, generated answers frequently diverge from the gold standard—producing hallucinations, overly verbose responses, or irrelevant content. Across models, 8–29% of claims in generated answers were not supported by the retrieved documents. The authors hope CanLegalRAGBench will drive progress in making legal RAG systems safer and more reliable, especially for Canadian law, which has been underrepresented in existing benchmarks.

Key Points
  • Benchmark uses realistic, expert-annotated queries grounded in Canadian case law
  • Open-source embedding models match closed-source ones in retrieval tasks
  • 8–29% of AI-generated legal claims are unsupported by retrieved documents, indicating persistent hallucination risks

Why It Matters

Reliable AI legal assistants depend on accurate retrieval; this benchmark exposes dangerous gaps in Canadian law applications.