Image & Video

Evaluating Computational Pathology Foundation Models for Prostate Cancer Grading under Distribution Shifts

New benchmark reveals pathology foundation models struggle with cross-hospital data differences.

Deep Dive

A new study by Fredrik K. Gustafsson and Mattias Rantalainen, posted on arXiv, evaluates the robustness of pathology foundation models (PFMs) for prostate cancer grading under clinically relevant distribution shifts. Using the PANDA dataset of whole-slide images (WSIs), the researchers benchmarked PFMs as frozen patch-level feature extractors in weakly supervised slide-level grading models. They tested robustness against two shifts: variations in WSI appearance across collection sites (Radboud vs. Karolinska) and changes in the label distribution over cancer grade groups. In controlled, in-distribution settings, PFMs consistently outperformed a natural-image baseline, demonstrating their value as pretrained encoders.

However, under cross-site transfer, performance dropped substantially for all models, indicating that large-scale pretraining alone does not guarantee downstream generalization. The models were less sensitive to label-distribution shifts, suggesting that visually grounded domain shift is the dominant challenge. Representation analysis confirmed persistent domain separation between sites across all PFMs, with grade-related structure present but comparatively weak. The authors conclude that while PFMs provide strong representations, generalizability remains constrained by the quality and diversity of data used to train downstream prediction models, highlighting a key hurdle for clinical deployment.

Key Points
  • PFMs outperform natural-image baselines in controlled settings but fail under cross-site transfer (e.g., Radboud to Karolinska).
  • Models are less sensitive to label-distribution shifts than to visual domain shifts.
  • Representation analysis shows persistent domain separation between hospitals, limiting real-world generalizability.

Why It Matters

Highlights critical robustness gaps in pathology AI, urging more diverse training data for clinical reliability.