CBR-to-SQL: Rethinking Retrieval-based Text-to-SQL using Case-based Reasoning in the Healthcare Domain
New AI framework tackles messy medical jargon, achieving state-of-the-art accuracy on the MIMICSQL benchmark.
A team of researchers has published a new paper, 'CBR-to-SQL: Rethinking Retrieval-based Text-to-SQL using Case-Based Reasoning in the Healthcare Domain,' proposing a novel solution to a critical bottleneck in medical research. Extracting data from complex Electronic Health Record (EHR) databases like MIMIC-IV requires SQL expertise, which most clinicians and researchers lack. While Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) offer a promising path by translating plain English questions to SQL, they falter with the variability and noise of medical terminology. Standard RAG's single-step retrieval from a static pool of examples often leads to inaccurate queries or requires bloated example sets that introduce more problems.
To solve this, the researchers' CBR-to-SQL framework is inspired by Case-Based Reasoning. Instead of retrieving raw examples, it represents past question-SQL pairs as reusable, abstract case templates. It then uses a sophisticated two-stage retrieval process: first to capture the logical structure of a new query, and second to resolve the specific medical entities involved. This method proved significantly more effective than standard RAG approaches. Evaluated on the MIMICSQL benchmark, CBR-to-SQL achieved state-of-the-art logical form accuracy and competitive execution accuracy.
Crucially, the framework demonstrates higher sample efficiency and robustness, performing well even under conditions of data scarcity and retrieval perturbations where traditional methods struggle. This represents a major step toward democratizing access to vital healthcare data, allowing medical professionals to ask complex, ad-hoc questions of patient databases without needing to write a single line of code.
- Replaces standard RAG with a Case-Based Reasoning (CBR) approach, using abstract templates and two-stage retrieval.
- Achieved state-of-the-art logical form accuracy on the MIMICSQL healthcare database benchmark.
- Demonstrates superior robustness and sample efficiency, especially effective with limited training data.
Why It Matters
Enables clinicians and researchers to query complex medical databases in plain English, accelerating healthcare insights without SQL expertise.