Smart Ensemble Learning Framework for Predicting Groundwater Heavy Metal Pollution
New AI framework outperforms raw-scale models for mapping heavy metal contamination in groundwater.
A team of researchers led by T. Ansah-Narh has published a new machine learning framework for predicting groundwater heavy metal pollution in the Densu Basin, Ghana. Traditional methods fail to capture the statistical complexity and spatial heterogeneity of pollution indicators, especially the Heavy Metal Pollution Index (HPI), which is often skewed with correlated contaminants. The study integrates response transformations (raw, log, and Gaussian copula) with nested cross-validated ensemble learning across six algorithms: support vector regression (SVM), k-nearest neighbours, CART, Elastic Net, kernel ridge regression, and a stacked Lasso ensemble.
Raw-scale models produced deceptively high fits (Elastic Net and stacked ensemble R²≈1.0), suggesting over-optimism. The log transformation stabilized variance, with SVM achieving R²=0.93 and RMSE=0.18, and k-NN reaching R²=0.92. However, the Gaussian copula transformation delivered the most reliable results: the stacked ensemble scored R²=0.96 with an RMSE of 0.19, while other learners maintained high accuracy. Copula-based models also produced spatially plausible pollution maps and improved residual distributions. DBSCAN clustering revealed that iron (Fe) and manganese (Mn) are the primary contributors to HPI, consistent with regional hydrogeochemistry.
The study acknowledges limitations, including reliance on random (not spatial) cross-validation and basin-specific scope. Future work should explore spatial validation and applications to other geological settings. Accepted for publication in Earth Systems and Environment (2026), this work demonstrates that distribution-aware ensembles combined with clustering diagnostics offer robust, interpretable tools for groundwater contamination monitoring.
- Gaussian copula stacked ensemble achieved R²=0.96 and RMSE=0.19 for HPI prediction
- Log transformation boosted SVM to R²=0.93, outperforming raw-scale models that showed over-optimism
- DBSCAN clustering identified iron and manganese as primary HPI contributors, matching regional hydrogeochemistry
Why It Matters
Enables accurate, interpretable groundwater contamination monitoring using distribution-aware machine learning, improving environmental risk assessment.