Only OpenAI's o3 showed no significant performance gap between English and French prompting?

Only OpenAI's o3 showed no significant performance gap between English and French prompting.

Llama-3.1-405B-Instruct had the largest accuracy drop (-0.91 points) when prompted in French?

Llama-3.1-405B-Instruct had the largest accuracy drop (-0.91 points) when prompted in French.

The language effect impacted reasoning quality (differential diagnosis, logic) beyond just final diagnosis accuracy?

The language effect impacted reasoning quality (differential diagnosis, logic) beyond just final diagnosis accuracy.

Research & Papers

Study finds LLMs diagnose better in English, except OpenAI's o3

arXiv cs.CL May 20, 2026

⚡Four of five AI models showed significant diagnostic accuracy drops when prompted in French vs English.

Deep Dive

Researchers from the arXiv preprint evaluated the impact of prompting language on diagnostic reasoning in five LLMs: OpenAI's o3, DeepSeek-R1, GPT-4-Turbo (OpenAI), Llama-3.1-405B-Instruct (Meta), and BioMistral-7B. They used 180 clinical vignettes covering 16 medical specialties, assessed by two physicians on an 18-point scale measuring both diagnosis accuracy and reasoning quality (including differential diagnosis, logical structure, and internal validity).

Results showed that four of five models performed significantly better in English than in French, with mean differences ranging from 0.37 to 0.91 points (adjusted p < 0.05). The gap spanned multiple reasoning dimensions, not just final diagnosis. Notably, OpenAI's o3 was the only model with no significant language effect, suggesting it may process clinical prompts more language-agnostically.

The study highlights a critical flaw in current LLM clinical decision support: performance degrades in non-English languages, risking inequitable outcomes in multilingual healthcare settings. The authors call for broader multilingual evaluations and language-aware training to ensure fair deployment worldwide.

Key Points

Only OpenAI's o3 showed no significant performance gap between English and French prompting.
Llama-3.1-405B-Instruct had the largest accuracy drop (-0.91 points) when prompted in French.
The language effect impacted reasoning quality (differential diagnosis, logic) beyond just final diagnosis accuracy.

Why It Matters

Language bias in diagnostic LLMs could worsen healthcare inequality for non-English-speaking populations worldwide.

Read Original Article

Study finds LLMs diagnose better in English, except OpenAI's o3

Why It Matters

Related Articles

🚀 Stay Ahead in AI