PRIME: Prototype-Driven Multimodal Pretraining for Cancer Prognosis with Missing Modalities
New multimodal framework achieves 0.689 AUROC on mortality prediction despite missing clinical data.
A research team led by Kai Yu has introduced PRIME (Prototype-Driven Multimodal Pretraining), a novel AI framework designed to predict cancer prognosis using incomplete clinical data. The system integrates three key data types: histopathology whole-slide images, gene expression profiles, and pathology reports. What sets PRIME apart is its ability to handle real-world scenarios where patients often have missing modalities—a common problem in clinical practice that limits most existing AI approaches. The framework maps different data types into a unified token space and uses a shared prototype memory bank for semantic imputation, essentially filling in missing information through consensus retrieval rather than reconstructing raw signals.
PRIME was pretrained on The Cancer Genome Atlas data across 32 cancer types without using survival labels, then evaluated on five cohorts for three critical tasks: overall survival prediction, 3-year mortality classification, and 3-year recurrence classification. The results show significant improvements, with PRIME achieving a 0.653 C-index for survival prediction and 0.689 AUROC for mortality classification—outperforming existing methods by approximately 10%. The framework maintains robustness even when tested with intentionally missing data, demonstrating its practical value for fragmented clinical settings where complete patient records are rare.
The technical innovation lies in PRIME's two complementary pretraining objectives: inter-modality alignment and post-fusion consistency under structured missingness augmentation. This approach allows the model to learn representations that remain predictive regardless of which data modalities are available at test time. The researchers also demonstrated that PRIME supports parameter-efficient and label-efficient adaptation, meaning it can be fine-tuned with limited labeled data—another crucial advantage for medical applications where expert annotations are scarce and expensive.
- Achieves 0.653 C-index and 0.689 AUROC on cancer survival prediction using incomplete multimodal data
- Handles missing histopathology, gene expression, or pathology reports through semantic imputation rather than reconstruction
- Pretrained on 32 cancer types from The Cancer Genome Atlas without survival labels for scalable adaptation
Why It Matters
Enables accurate cancer prognosis predictions using real-world, incomplete medical records where patients often lack some test results.