Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA
New benchmark shows domain-adapted LLMs are more likely to hallucinate when statutory evidence is missing.
A team of researchers has published a new paper, 'Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA,' introducing the SearchFireSafety benchmark. This tool addresses a significant gap in AI legal evaluation by shifting focus from case law to the complex world of statutes and regulations. The core challenge is that statutory evidence is often fragmented across a hierarchy of linked documents, creating a 'statutory retrieval gap' where standard AI retrievers fail and models frequently hallucinate answers when provided with incomplete context.
SearchFireSafety is instantiated on fire-safety regulations as a representative case study. It employs a dual-source evaluation framework, combining real-world questions that require citation-aware retrieval with synthetic scenarios designed to stress-test a model's ability to refuse answering when statutory context is insufficient. The researchers tested multiple large language models (LLMs) and found that while graph-guided retrieval methods substantially improve performance, they uncovered a critical and counterintuitive safety issue.
The experiments revealed a troubling trade-off: models that are adapted or fine-tuned for the legal domain become more accurate at answering questions when they have the right information, but they also become significantly more likely to hallucinate plausible-sounding but incorrect answers when key statutory evidence is missing. This finding challenges the assumption that domain specialization inherently improves safety and highlights a blind spot in current AI benchmarking for high-stakes fields like law and compliance.
The paper's findings underscore the urgent need for new benchmarks that jointly evaluate both hierarchical retrieval capability and model safety in regulatory settings. The authors argue that as AI is deployed for legal and compliance tasks, systems must be rigorously tested not just on their ability to find the right answer, but on their capacity to know when they don't have enough information to answer safely.
- Introduces SearchFireSafety, a new benchmark for evaluating AI on statute-centric legal questions and answers (QA), moving beyond traditional case-law focus.
- Reveals a critical safety trade-off: domain-adapted LLMs show improved retrieval but are more prone to hallucination when key regulatory evidence is missing.
- Uses a dual evaluation framework with real-world questions and synthetic partial-context scenarios to stress-test retrieval and refusal behavior.
Why It Matters
This exposes a major safety flaw in specialized legal AI, showing that making models more knowledgeable can also make them more dangerously confident when wrong.