Media & Culture

Stanford Chair of Medicine: LLMs Are Superhuman Guessers

AI models solved medical image tasks 10% better than radiologists without ever seeing the actual pictures.

Deep Dive

A groundbreaking study from Stanford University, co-authored by AI pioneer Fei Fei Li, reveals that large language models (LLMs) possess a startling ability to solve medical diagnostic tasks without visual input. The research tested models like Alibaba's Qwen 2.5 on benchmarks that typically require analyzing radiology images, such as chest x-rays. Remarkably, the LLMs were given only the text-based questions and answer choices, forcing them to 'guess' the image's contents. Despite this limitation, they outperformed human radiologists by an average of 10%.

This 'superhuman guessing' capability was demonstrated even on a private dataset published after Qwen 2.5 was released as open-source, ruling out memorization. In one striking result, a text-only LLM outperformed all existing image-analysis models on a major chest x-ray benchmark, claiming the top spot on the leaderboard. As noted by Stanford's Chair of Medicine, Euan Ashley, the models performed well without images and only 'a little better' with them. This challenges fundamental assumptions about how AI interprets medical data and suggests LLMs leverage vast latent knowledge from their training to make clinical inferences.

Key Points
  • LLMs like Qwen 2.5 solved image-based medical questions 10% better than radiologists without seeing the images.
  • A 'no-image' model topped a chest x-ray benchmark leaderboard, beating specialized models on a private post-release dataset.
  • The study, co-authored by Fei Fei Li, shows LLMs use latent knowledge to infer visual content from text prompts.

Why It Matters

This reveals a hidden, 'superhuman' reasoning layer in LLMs that could reshape AI diagnostics and challenge how we evaluate model capabilities.