Learning Representations from Incomplete EHR Data with Dual-Masked Autoencoding
New model processes incomplete EHR data without imputation, beating XGBoost and DuETT on clinical tasks.
Researchers from MIT and Harvard developed AID-MAE, a dual-masked autoencoder that learns directly from incomplete Electronic Health Records. It uses two masks—one for natural missing values, one for hiding observed values—and processes only unmasked tokens. The model outperformed XGBoost and DuETT baselines across multiple clinical prediction tasks on two datasets, creating embeddings that naturally stratify patient cohorts without requiring data imputation first.
Why It Matters
Enables more accurate clinical predictions from real-world, messy patient data where measurements are irregular and often missing.