Research & Papers

ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

Even GPT-5 scores only 41.2% on a new benchmark of 2,437 real patient-doctor dialogues.

Deep Dive

Researchers Monica Munnangi and Saiph Savage have released ThReadMed-QA, a new benchmark designed to stress-test large language models (LLMs) on the complex, iterative nature of real medical consultations. Unlike existing benchmarks that use single-turn Q&A or simulated dialogues, ThReadMed-QA is built from 2,437 authentic conversation threads scraped from the Reddit forum r/AskDocs, comprising 8,204 verified question-answer pairs across up to 9 conversational turns. This dataset captures the natural flow of patient follow-ups and physician clarifications, providing a more realistic and challenging testbed.

When the researchers evaluated five state-of-the-art models—GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B—the results were stark. Even the strongest model, GPT-5, managed only a 41.2% rate of fully correct responses on a test set of 238 conversations. Performance degraded sharply with each turn; wrong-answer rates roughly tripled by the third exchange. The study introduced new metrics like the Error Propagation Rate (EPR), showing that a single incorrect model response makes a subsequent error 1.9 to 6.1 times more likely.

The findings expose a fundamental tension in current LLM design: models excelling at single-turn tasks, like GPT-5 which scored 75.2 out of 100 on the first turn, suffer the steepest declines in multi-turn reliability. For instance, Claude Haiku's score dropped 25 points by turn two, and nearly one in three of its conversations swung between completely correct and completely wrong answers. This benchmark proves that today's most advanced AI lacks the conversational consistency and reasoning stamina required for high-stakes, multi-turn domains like healthcare.

Key Points
  • Benchmark built from 2,437 real Reddit medical threads, containing 8,204 QA pairs.
  • GPT-5 scored only 41.2% fully correct; error rates tripled by the third conversational turn.
  • A single wrong answer raises the probability of a subsequent error by 1.9x to 6.1x (Error Propagation Rate).

Why It Matters

Reveals a critical weakness in AI assistants for healthcare, therapy, or customer support, where multi-turn reliability is essential.