Research & Papers

MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

A simple 'open-book' prompt boosts accuracy from 52% to 85%, beating complex RL-trained systems.

Deep Dive

A new audit by researcher Artus Krohn-Grimberghe challenges the fundamental validity of MedCalc-Bench, a standard benchmark for evaluating large language models (LLMs) on clinical calculator tasks. The paper reveals a systematic audit uncovered over 20 critical errors in the benchmark's calculator implementations, ranging from inaccurate formulas to runtime bugs in a dataset previously published at NeurIPS. This calls into question the reliability of prior scores, where state-of-the-art direct prompting had plateaued around 35% and the best published approach using reinforcement learning (RL) with verifiable rewards reached 74%.

The study's most significant finding is that the benchmark's framing is flawed. By simply providing the model with the calculator specification at inference time—an 'open-book' prompting approach—accuracy on models like GLM-4.6V and GLM-4.7 jumped from ~52% to 81-85%, outperforming all complex, fine-tuned RL systems without any additional training. Using GPT-5.2-Thinking established a performance upper bound of 95-97%, with residual errors linked to dataset issues. The results strongly indicate MedCalc-Bench primarily tests a model's ability to memorize formulas and perform arithmetic, not its clinical reasoning capabilities, and would be more accurately framed as an evaluation of tool-use proficiency.

Key Points
  • Audit found over 20 critical errors in formulas and code within the NeurIPS-published MedCalc-Bench dataset.
  • "Open-book" prompting (providing calculator specs) boosted GLM model accuracy from ~52% to 81-85%, beating all prior RL-trained systems.
  • The findings suggest the benchmark measures formula memorization and arithmetic, not clinical reasoning, reframing it as a tool-use evaluation.

Why It Matters

This forces a reevaluation of how we benchmark AI for specialized domains, prioritizing realistic tool-use over artificial memorization tasks.