Information-Geometric Decomposition of Generalization Error in Unsupervised Learning
A new paper uses information geometry to break down why unsupervised models fail to generalize.
Researcher Gilhan Kim has published a novel theoretical framework that provides an exact mathematical decomposition of generalization error in unsupervised machine learning. The work, titled 'Information-Geometric Decomposition of Generalization Error in Unsupervised Learning,' uses concepts from information geometry—specifically the generalized Pythagorean theorem and a dual e-mixture variance identity—to split the Kullback–Leibler (KL) generalization error into three distinct, non-negative components: model error, data bias, and variance. This decomposition is exact for any model class that is 'e-flat,' a specific geometric property, offering a more rigorous diagnostic tool than previous approximations.
As a concrete demonstration, Kim applies this framework to ε-PCA, a regularized version of principal component analysis where small eigenvalues are pinned to a noise floor ε. While this model isn't inherently e-flat, Kim shows it can be reformulated to apply the decomposition on isotropic Gaussian data. The analysis yields closed-form expressions for each error component and reveals that the optimal model retains only those empirical eigenvalues exceeding the noise floor (λ*_cut = ε). This cutoff represents a balance between reducing model error and incurring data bias.
The framework further predicts a three-regime phase diagram for model behavior—'retain-all,' 'interior,' and 'collapse'—separated by statistical thresholds like the Marchenko–Pastur edge and a computable collapse threshold ε*(α), where α is the dimension-to-sample-size ratio. All theoretical claims are backed by numerical verification. This work moves beyond vague notions of 'overfitting' by providing a precise, component-wise accounting of why a model fails to generalize to new data.
- Exact decomposition of KL generalization error into model error, data bias, and variance using information geometry.
- Applied to ε-PCA, proving optimal rank selection occurs at λ*_cut = ε, balancing error gain and bias cost.
- Reveals a three-regime phase diagram (retain-all, interior, collapse) with analytically computable thresholds, validated numerically.
Why It Matters
Provides a precise, mathematical blueprint for diagnosing why AI models fail, moving beyond heuristic explanations of overfitting.