MedThink boosts small AI models' diagnostic accuracy by 12.7%
A two-stage teacher-student method that teaches small models to reason like doctors.
Accurate clinical diagnosis requires nuanced reasoning that large language models (LLMs) can provide, but their high computational cost limits deployment in resource-constrained settings like rural clinics or mobile devices. Traditional knowledge distillation (KD) only transfers superficial answer patterns, missing the structured reasoning steps essential for reliable diagnosis. The new MedThink framework addresses this by introducing a two-stage distillation process: first, a teacher LLM screens training data and augments it with domain-knowledge explanations to fine-tune a small student model; second, the teacher evaluates the student's diagnostic errors, generates explicit reasoning chains that link knowledge to the correct answer, and refines the student via a second fine-tuning round. This iterative approach ensures the student internalizes not just answers but the reasoning process itself.
The team evaluated MedThink against six distillation strategies on general medical benchmarks and a gastroenterology dataset comprising 955 question-answer pairs. Results show MedThink consistently outperforms competitors, achieving up to a 12.7% improvement over the student baseline on general tasks and a top accuracy of 56.4% on the specialized gastroenterology evaluation. These gains demonstrate that reasoning-centered iterative distillation can significantly boost diagnostic accuracy and generalization in small models while preserving computational efficiency—SLMs (small language models) require far less memory and compute than their larger counterparts. Code and data are publicly available on GitHub, enabling further research and practical deployment in low-resource healthcare environments.
- Two-stage distillation: first injects domain explanations, then generates reasoning chains from teacher evaluation of student errors
- Up to 12.7% improvement over baseline student models on general medical benchmarks
- Achieves 56.4% top accuracy on a gastroenterology dataset of 955 QA pairs
Why It Matters
Brings high-quality clinical AI to devices with limited compute, expanding diagnostic support to underserved areas.