[R] From Garbage to Gold: A Formal Proof that GIGO Fails for High-Dimensional Data with Latent Structure — with a Connection to Benign Overfitting Prerequisites
A 120-page proof overturns GIGO, showing noisy, high-dimensional data can outperform cleaned datasets when latent structure exists.
In a significant theoretical challenge to the long-held 'Garbage In, Garbage Out' (GIGO) axiom, researcher Terry St. John and colleagues have formally proven that for data generated by a latent hierarchical structure, collecting more variables—even noisy ones—is a superior strategy to meticulously cleaning a smaller set. The 120-page paper, titled 'From Garbage to Gold,' distinguishes between two types of noise: 'Predictor Error' (addressable by cleaning) and 'Structural Uncertainty' (an irreducible ambiguity in the data-generating process). The core finding is that while cleaning hits a hard limit imposed by Structural Uncertainty, expanding the predictor set with distinct proxies for the latent variables does not, making a 'Breadth' strategy asymptotically dominant.
The research provides a generative explanation for the conditions behind Benign Overfitting, where models that perfectly fit training data can still generalize well. It shows that the required low-rank covariance structure emerges naturally from the proposed latent hierarchy (Y ← S¹ → S² → S'²). The theory was empirically grounded by a prior clinical result from Cleveland Clinic Abu Dhabi, which achieved a .909 AUC for predicting stroke using thousands of uncurated electronic health record variables without manual cleaning—a result existing theory couldn't explain. The authors provide heuristics for assessing if a dataset has the necessary latent structure and are explicit about the framework's limitations, marking a nuanced but powerful shift in how to approach noisy, high-dimensional data problems.
- Formally proves 'Breadth' (adding predictors) beats 'Depth' (cleaning) for data with latent hierarchical structure by overcoming 'Structural Uncertainty'.
- Connects to Benign Overfitting theory, providing a data-architectural reason for its empirical success beyond abstract math.
- Motivated by a real-world clinical study achieving .909 AUC with uncurated EHR data, challenging traditional data cleaning dogma.
Why It Matters
This could revolutionize data collection strategies in fields like healthcare and genomics, prioritizing variable breadth over costly, intensive data cleaning.