Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
New prompt optimization technique GEPA boosts GPT-5's medical error detection accuracy to 78.5%, nearing doctor-level performance.
A new research paper accepted at EACL HeaLing 2026 demonstrates that systematic prompt optimization can dramatically improve language models' ability to detect critical errors in medical documentation. Researchers from the University of St Andrews and University of Edinburgh developed Genetic-Pareto (GEPA), an automatic prompt optimization method that boosted GPT-5's accuracy on the MEDEC medical error detection benchmark from 66.9% to 78.5%—a 17% relative improvement that approaches human doctor performance. The study rigorously tested both frontier models like GPT-5 and open-source alternatives including Qwen3-32B, showing consistent gains across model types and establishing new state-of-the-art results for this safety-critical healthcare application.
The technical breakthrough centers on GEPA's ability to automatically evolve and select optimal prompts through a multi-objective optimization process that balances accuracy with other performance metrics. This method proved particularly effective for medical error detection, where subtle phrasing changes can mean the difference between catching dangerous documentation mistakes and missing them entirely. The implications are significant for healthcare systems burdened by administrative errors, as optimized AI assistants could serve as first-line reviewers for clinical notes, flagging potential issues before they lead to treatment delays or patient harm. The researchers have made their code publicly available on GitHub, enabling further development and deployment in real-world medical settings where accurate documentation is literally a matter of life and death.
- Genetic-Pareto (GEPA) optimization boosted GPT-5's medical error detection accuracy by 17% (66.9% → 78.5%) on MEDEC benchmark
- Method also improved Qwen3-32B performance from 57.8% to 69.0%, showing effectiveness across both proprietary and open-source models
- Optimized models now approach human doctor performance for automated clinical text review, with code publicly released on GitHub
Why It Matters
Automated error detection at near-doctor accuracy could prevent treatment mistakes and save lives by catching documentation errors before they harm patients.