Media & Culture

Stanford Chair of Medicine: LLMs Are Superhuman Guessers

AI models solved medical image tasks 10% better than experts without seeing a single picture.

Deep Dive

A groundbreaking study from Stanford University, co-authored by AI pioneer Fei Fei Li, has revealed that large language models (LLMs) can perform at a 'superhuman' level on medical diagnostic tasks without ever seeing the relevant images. The research presented models like the open-source Qwen 2.5 with text-based questions that required analyzing an image to solve, but withheld the images themselves. Astonishingly, the LLMs' answers, generated purely by inferring the likely visual content from the prompt, outperformed practicing radiologists by an average of 10%. This held true even on questions derived from a private dataset published after Qwen 2.5's release, suggesting the capability stems from deep pattern recognition in their training data, not simple memorization.

In one of the most striking results, a model operating without any visual input outperformed all contemporary AI models—including those specifically designed for medical imaging—on a standard chest x-ray diagnostic benchmark, ranking at the top of the public leaderboard. As noted by Stanford's Chair of Medicine, Euan Ashley, the models performed well without images and only a little better with them. This finding challenges fundamental assumptions about how AI should be applied in medicine, indicating that an LLM's textual knowledge base contains a surprisingly robust and accurate model of visual medical concepts, which it can activate through reasoning alone.

Key Points
  • LLMs like Qwen 2.5 solved image-based medical questions 10% better than radiologists without seeing the images.
  • A 'no-image' model topped a chest x-ray diagnostic leaderboard, beating specialized medical imaging AI.
  • Performance on a private dataset released after the model's training shows capability is based on reasoning, not data leakage.

Why It Matters

This redefines AI's role in medicine, suggesting diagnostic support could come from pure language models, potentially making advanced care more accessible.