Research & Papers

Learning Representations from Incomplete EHR Data with Dual-Masked Autoencoding

New model processes incomplete EHR data without imputation, beating XGBoost and DuETT on clinical tasks.

Deep Dive

Researchers from MIT and Harvard developed AID-MAE, a dual-masked autoencoder that learns directly from incomplete Electronic Health Records. It uses two masks—one for natural missing values, one for hiding observed values—and processes only unmasked tokens. The model outperformed XGBoost and DuETT baselines across multiple clinical prediction tasks on two datasets, creating embeddings that naturally stratify patient cohorts without requiring data imputation first.

Why It Matters

Enables more accurate clinical predictions from real-world, messy patient data where measurements are irregular and often missing.