Nationwide EHR-Based Chronic Rhinosinusitis Prediction Using Demographic-Stratified Models
AI model uses 110K medical codes to spot sinus disease 2 years early.
A team led by Sicong Chang and Yidan Shen has developed a machine learning framework that predicts chronic rhinosinusitis (CRS) using two years of pre-diagnostic electronic health record (EHR) data from the nationwide *All of Us* Research Program. CRS is a common inflammatory condition often misdiagnosed as allergic rhinitis, leading to delayed treatment and high healthcare costs. Prior models relied on single-institution data with limited generalizability. This work overcomes that by leveraging a diverse, nationwide cohort and a hybrid feature-selection pipeline that reduces ~110,000 medical codes to just 100 interpretable features—balancing prevalence-based screening with model-based importance ranking.
To capture demographic heterogeneity, the team trained separate models for six adult subgroups defined by sex and life stage (e.g., younger males, older females), each with subgroup-specific hyperparameter tuning. The overall AUC reached 0.8461, outperforming the best baseline by 0.0168. The study is accepted at IEEE EMBC 2026 and published on arXiv. The authors emphasize that routinely collected EHR data can support population-representative CRS risk stratification, enabling earlier triage and referral prioritization in primary care settings without requiring specialized clinical data.
- Used nationwide EHR data from the *All of Us* Research Program (not single-institution).
- Hybrid feature selection compressed ~110,000 candidate codes into 100 interpretable features.
- Achieved AUC 0.8461 with demographic-stratified models across six adult subgroups.
Why It Matters
Enables early, population-scale CRS screening from routine EHR data, reducing misdiagnosis and costs.