Image & Video

MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

Frontier models score 97% on MCQ but drop to 14% on GE scanner recalls...

Deep Dive

A new benchmark called MRI-Eval is putting LLMs through a rigorous test of MRI physics and GE scanner operations knowledge — and the results reveal a stark gap between multiple-choice bravado and real recall. Developed by Perry E. Radau, the benchmark includes 1,365 scored items across nine categories and three difficulty tiers, sourced from textbooks, GE scanner manuals, programming course materials, and expert-generated questions. Five model families were evaluated: GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, and Llama 3.3 70B.

On the standard multiple-choice setting, all models performed impressively, with accuracy ranging from 93.2% to 97.1%. But the results flipped when the researchers removed answer options (stem-only testing) and used an independent LLM judge to grade free-text responses. Frontier model accuracy plummeted to 58.4%–61.1%, while Llama 3.3 70B fell to 37.1%. The most telling finding came from the GE scanner operations category, where stem-only accuracy ranged from a dismal 13.8% to 29.8%. This proves that high MCQ scores can mask weak free-text recall, especially for vendor-specific operational knowledge. Radau emphasizes that MRI-Eval is most useful as a relative comparison benchmark rather than an absolute competency measure, and warns against using raw LLM outputs for GE-specific protocol guidance without caution.

Key Points
  • MRI-Eval benchmark: 1,365 items across 9 categories and 3 difficulty tiers from textbooks, GE manuals, and expert questions.
  • MCQ accuracy 93-97% for all models, but stem-only (no options) dropped frontier models to 58-61% and Llama 3.3 70B to 37%.
  • GE scanner operations stem-only accuracy was just 13.8-29.8%, highlighting a major gap in vendor-specific knowledge.

Why It Matters

High LLM MCQ scores can mislead users; free-text recall of vendor-specific knowledge is still dangerously weak.