Bayes-MICE: A Bayesian Approach to Multiple Imputation for Time Series Data
New Bayesian framework reduces imputation errors and quantifies uncertainty for healthcare and environmental data.
A team of researchers has introduced Bayes-MICE, a novel Bayesian framework that enhances the classic Multiple Imputation by Chained Equations (MICE) method specifically for time-series data. The approach, detailed in a new arXiv paper, tackles the pervasive problem of missing values in sequential data from fields like healthcare (PhysioNet) and environmental monitoring (AirQuality). By integrating Bayesian inference through Markov Chain Monte Carlo (MCMC) sampling, Bayes-MICE doesn't just generate plausible replacements for missing data points; it also quantifies the uncertainty associated with both the model's parameters and the imputed values themselves. This represents a significant shift from deterministic or frequentist imputation methods, providing a more statistically rigorous error measure.
Key technical innovations include a temporally informed initialization process and the incorporation of time-lagged features, ensuring the model respects the sequential dependencies inherent in time-series. The researchers evaluated two MCMC samplers—Random Walk Metropolis (RWM) and the Metropolis-Adjusted Langevin Algorithm (MALA)—finding that MALA converged faster while achieving comparable accuracy and offering more consistent exploration of the posterior distribution. Empirical results demonstrated that Bayes-MICE reduced imputation errors across all variables compared to baseline methods. The framework successfully balances increased accuracy with practical efficiency, making it a compelling tool for analysts who need both reliable filled datasets and a clear understanding of the confidence in those imputations.
- Extends the standard MICE method with a Bayesian framework using MCMC sampling (RWM & MALA) to quantify uncertainty.
- Reduces imputation errors on real-world datasets (AirQuality, PhysioNet) and respects time-series structure with lagged features.
- Finds the MALA sampler converges faster than RWM, providing a practical tool for environmental and clinical data analysis.
Why It Matters
Provides data scientists with more reliable, uncertainty-aware imputations for critical time-series analysis in healthcare and climate research.