Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage
New benchmark shows AI models lag behind junior dentists on complex, multi-domain dental triage decisions.
A research team led by Ziyi He has published Dental-TriageBench, the first expert-annotated benchmark designed to evaluate AI's ability to perform multimodal reasoning for hierarchical dental triage. Built from authentic outpatient workflows, the benchmark contains 246 de-identified clinical cases. Each case is annotated with expert-authored 'golden reasoning trajectories' and hierarchical triage labels, requiring models to integrate patient complaints with radiographic evidence (like OPG images) to determine complete referral plans.
The researchers benchmarked 19 proprietary, open-source, and medical-domain Multimodal Large Language Models (MLLMs) against a human baseline of three junior dentists. The results revealed a substantial human-model gap, particularly on fine-grained, treatment-level triage decisions. Further analysis showed that accurate triage requires both complaint and OPG information, and that model errors concentrate heavily on cases involving multiple referral domains. In these complex scenarios, MLLMs tend to produce overly narrow referral sets and make omission-heavy errors, failing to recommend the full scope of necessary specialist care.
Dental-TriageBench establishes a realistic, safety-critical testbed for developing clinical AI. It highlights that current models, despite their general capabilities, are not yet clinically grounded or coverage-aware enough for complex, real-world medical routing tasks. The benchmark's public availability aims to steer future AI development toward systems that are safer for downstream patient care by rigorously testing their reasoning on authentic clinical workflows.
- First expert-annotated benchmark for multimodal dental triage, built from 246 real outpatient cases with expert reasoning paths.
- Tested 19 MLLMs (like GPT-4, Claude, Med-PaLM) which showed a significant performance gap versus 3 junior dentists on complex referrals.
- Model errors concentrated on multi-domain cases, where AI produced narrow referral sets and omission errors, lacking clinical coverage.
Why It Matters
Highlights a critical safety gap in medical AI, pushing development toward more reliable, clinically-aware systems for complex diagnostic routing.