Research & Papers

Study finds LLMs diagnose better in English, except OpenAI's o3

Four of five AI models showed significant diagnostic accuracy drops when prompted in French vs English.

Deep Dive

Researchers from the arXiv preprint evaluated the impact of prompting language on diagnostic reasoning in five LLMs: OpenAI's o3, DeepSeek-R1, GPT-4-Turbo (OpenAI), Llama-3.1-405B-Instruct (Meta), and BioMistral-7B. They used 180 clinical vignettes covering 16 medical specialties, assessed by two physicians on an 18-point scale measuring both diagnosis accuracy and reasoning quality (including differential diagnosis, logical structure, and internal validity).

Results showed that four of five models performed significantly better in English than in French, with mean differences ranging from 0.37 to 0.91 points (adjusted p < 0.05). The gap spanned multiple reasoning dimensions, not just final diagnosis. Notably, OpenAI's o3 was the only model with no significant language effect, suggesting it may process clinical prompts more language-agnostically.

The study highlights a critical flaw in current LLM clinical decision support: performance degrades in non-English languages, risking inequitable outcomes in multilingual healthcare settings. The authors call for broader multilingual evaluations and language-aware training to ensure fair deployment worldwide.

Key Points
  • Only OpenAI's o3 showed no significant performance gap between English and French prompting.
  • Llama-3.1-405B-Instruct had the largest accuracy drop (-0.91 points) when prompted in French.
  • The language effect impacted reasoning quality (differential diagnosis, logic) beyond just final diagnosis accuracy.

Why It Matters

Language bias in diagnostic LLMs could worsen healthcare inequality for non-English-speaking populations worldwide.