Research & Papers

EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal

New benchmark tests AI models on 1,933 real patient messages, revealing which LLMs excel at medical communication tasks.

Deep Dive

A research team from Yale New Haven Hospital and collaborating institutions has introduced EPPCMinerBen, a groundbreaking benchmark designed to evaluate how effectively Large Language Models (LLMs) can analyze and understand electronic patient-provider communication. As healthcare increasingly shifts to digital platforms like secure patient portals, the ability to automatically extract insights from these exchanges becomes critical for improving treatment outcomes and adherence. The benchmark addresses this need by providing a standardized testbed using 1,933 real, expert-annotated sentences from 752 actual secure messages, focusing on three core tasks that mirror real-world clinical needs: classifying the primary communicative intent (Code Classification), identifying more specific subcategories (Subcode Classification), and pinpointing the exact text evidence supporting those classifications (Evidence Extraction).

The initial evaluation across various LLMs under zero-shot and few-shot settings revealed significant performance variations. Meta's Llama-3.1-70B model demonstrated strong capability in evidence extraction, achieving an F1 score of 82.84%, while the instruction-tuned Llama-3.3-70b-Instruct led in code classification with 67.03% F1. Notably, smaller models consistently underperformed, especially in the nuanced subcode classification task where F1 scores dropped by over 30 percentage points, highlighting the challenge of fine-grained medical reasoning for less capable AIs. The findings confirm that large, instruction-tuned models are currently best suited for this domain, with few-shot prompting generally improving results. By releasing this benchmark through the NCI Cancer Data Service, the team aims to spur development of more generalized AI models capable of sophisticated discourse-level understanding in healthcare, potentially leading to tools that can automatically flag communication gaps, summarize patient concerns, and support clinical decision-making.

Key Points
  • Benchmark uses 1,933 real annotated sentences from 752 Yale patient portal messages for authentic evaluation.
  • Llama-3.1-70B leads in Evidence Extraction with 82.84% F1 score, showing strength in pinpointing supporting text.
  • Smaller models struggle significantly, with >30% F1 drop in Subcode Classification, highlighting need for scale in medical AI.

Why It Matters

Provides a crucial yardstick for developing AI that can safely and effectively analyze sensitive patient-doctor digital conversations.