A formal proof when and why "Garbage in, Garbage out" is wrong
A 2.5-year formal proof shows dirty, uncurated data can beat clean data in complex systems.
A new research paper by Terry St. John provides a formal mathematical proof challenging the long-held 'Garbage In, Garbage Out' (GIGO) principle in data science. After 2.5 years of work, the research explains a modern paradox: why models trained on vast, dirty, uncurated datasets often outperform those built on meticulously cleaned data. The core finding is that for data from complex systems—like medical patients, financial markets, or sensor networks—the critical factor isn't the cleanliness of individual variables, but whether a diverse set of observable 'proxies' provides complete coverage of the underlying, unobservable 'latent states' driving the system. The paper introduces the concept of 'Structural Uncertainty,' an irreducible ambiguity that arises when proxies are too few, regardless of how clean they are.
This theoretical framework has been validated at scale in a real-world healthcare application. At Cleveland Clinic Abu Dhabi, researchers applied the principle to predict strokes and heart attacks using electronic health records (EHR) from over 558,000 patients across 3.4 million patient-months. Instead of manually cleaning thousands of variables, they used the raw, uncurated data. The resulting model achieved an Area Under the Curve (AUC) of .909, substantially outperforming the clinical risk models currently used by cardiologists. This demonstrates a profound shift: in complex domains, data quality should be viewed as a portfolio-level architectural property—ensuring diverse and redundant coverage of latent drivers—rather than an item-level cleanliness property. The work suggests that enterprises investing heavily in data cleaning may be misallocating resources when dealing with complex, system-generated data.
- Formally proves 'Garbage In, Garbage Out' can fail for data from complex systems, introducing the concept of 'Structural Uncertainty'.
- Validated with a real-world healthcare model using uncurated EHR data from 558k patients, achieving a .909 AUC for stroke/heart attack prediction.
- Shifts the data quality paradigm from cleaning individual variables to ensuring diverse proxy coverage of underlying latent states.
Why It Matters
Could save enterprises billions in unnecessary data cleaning costs and enable more accurate models from existing messy data.