Research & Papers

When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models

New research reveals standard prompt engineering fails for medical LLMs, with CoT reducing accuracy.

Deep Dive

A new study by researchers Binesh Sadanandan and Vahid Behzadan reveals that standard prompt engineering techniques, validated on general-purpose models, can significantly backfire when applied to specialized medical language models. Evaluating Google's MedGemma (4B and 27B parameters) on the MedMCQA and PubMedQA datasets, the researchers discovered that Chain-of-Thought (CoT) prompting—where a model explains its reasoning—actually decreased accuracy by 5.7% compared to direct answering. Even more concerning, providing few-shot examples degraded performance by 11.9% and dramatically increased position bias from 0.14 to 0.47, meaning the model became overly reliant on the order of presented information.

The research uncovered severe sensitivity to prompt formatting that undermines reliability. Simply shuffling the order of multiple-choice answer options caused the model to change its predictions 59.1% of the time, with accuracy dropping by up to 27.4 percentage points. Context manipulation was equally problematic: truncating the first 50% of a medical context caused accuracy to plummet below the no-context baseline, while removing the last 50% preserved 97% of full-context accuracy, highlighting an unexpected reliance on concluding information.

Crucially, the study identified a more reliable alternative. 'Cloze scoring'—where the model selects the answer option with the highest log-probability for a single token—achieved accuracies of 51.8% (4B model) and 64.5% (27B model), surpassing all generative prompting strategies. This indicates that medical LLMs internally 'know' more correct information than their generated text reveals. The findings mandate a reevaluation of deployment practices for AI in high-stakes domains like healthcare, where unreliable prompting could have serious consequences.

Key Points
  • Chain-of-Thought prompting reduced MedGemma's accuracy by 5.7% on medical QA tasks, contradicting its benefit for general models.
  • Shuffling multiple-choice answer options caused the model to change its prediction 59.1% of the time, revealing extreme format sensitivity.
  • Cloze scoring (selecting by token probability) outperformed all prompting, showing models 'know' more than they generate, with the 27B model achieving 64.5% accuracy.

Why It Matters

For deploying medical AI, standard prompt engineering is unreliable; safer, probability-based evaluation methods are needed for clinical trust.