Deep reflective reasoning in interdependence constrained structured data extraction from clinical notes for digital health
A new LLM agent framework iteratively self-critiques medical data extraction, fixing inconsistent outputs.
A research team from UT Southwestern and other institutions has introduced a novel AI framework called 'deep reflective reasoning' to solve a critical flaw in using Large Language Models (LLMs) for healthcare. Current LLM pipelines often produce clinically inconsistent data when extracting structured information from doctors' notes, failing to capture the logical dependencies between variables (e.g., a tumor's size constrains its possible stage). The new method transforms an LLM into an agent that performs iterative self-critique, checking its proposed extractions for consistency with the source text, domain knowledge, and other extracted variables, and revising until outputs converge.
The team rigorously evaluated the framework across three complex oncology tasks. For colorectal cancer synoptic reporting, the average F1 score across eight categorical variables jumped from 0.828 to 0.911. In identifying immunostaining patterns for Ewing sarcoma, accuracy rose from 87.0% to 92.7%. Most notably, for the challenging task of lung cancer tumor staging (pT/pN), accuracy improved from 68% to 83%, with the pN (node) staging component seeing a rise from 88.5% to 94.8%. These gains demonstrate the system's ability to enforce clinical logic where standard one-pass LLM extraction fails.
This work represents a significant shift from treating LLMs as simple extractors to using them as reasoning agents with a feedback loop. By systematically resolving inconsistencies, the framework moves beyond raw performance metrics to address the reliability and clinical validity of the extracted data. The resulting high-quality, machine-operable datasets are essential for downstream digital health applications, including training more accurate diagnostic models and enabling large-scale clinical research.
- Framework improved lung cancer tumor staging accuracy by 15 percentage points, from 68% to 83%.
- Boosted F1 score for colorectal cancer synoptic reporting from 0.828 to 0.911 across eight variables.
- Uses an LLM agent that performs iterative self-critique against domain knowledge to ensure clinical consistency.
Why It Matters
Enables reliable automation of clinical data entry, creating cleaner datasets for research and AI-driven diagnostics.