Canonical Diffusion Method Quantifies Mode Separation Beyond PCA and Entropy
Two new scores SSA and DA reveal hidden structure in image generation and molecular dynamics
Mode separation—how sharply a probability distribution fragments into barrier-separated clusters—is a fundamental geometric property that existing metrics like differential entropy and PCA fail to capture in high dimensions. Tolkovsky, Meidler, and Zuk (arXiv:2605.08777) propose a novel solution: a unique reversible diffusion with the target density as its stationary distribution and constant scalar diffusion coefficient. From the autocovariance matrix of this diffusion, they extract two readouts: SSA (Sum of Squared Autocorrelations), a scalar measure sensitive to barriers, and DA (Dominant Autocorrelation directions), which orders directions by metastability rather than variance. Under an isotropic-Gaussian null, they derive a closed-form spectrum generalizing Marchenko–Pastur, providing an analytic upper edge to select the lag for DA. The framework requires only samples and a score function, making it scalable via Tweedie’s identity and pretrained score-based generative models.
The authors demonstrate the method on three settings. First, on synthetic Gaussian mixtures, SSA tracks mutual information accurately. Second, on SDXL text-to-image generations, SSA and DA identify structural variations (e.g., background vs. object clustering) that entropy and PCA completely miss. Third, on molecular dynamics of alanine dipeptide, DA recovers the known slow backbone dihedrals from static samples alone—without any temporal trajectory data. These results show that canonical diffusion metrics provide a principled, high-dimensional tool for clustering analysis, outlier detection, and bridging generative modeling with scientific discovery.
- SSA (Sum of Squared Autocorrelations) is a scalar barrier-sensitive measure of mode separation that tracks mutual information in Gaussian mixtures.
- DA (Dominant Autocorrelation directions) orders linear projections by metastability instead of variance, recovering slow degrees of freedom in alanine dipeptide from static samples.
- The method scales to high dimensions via Tweedie’s identity and pretrained score-based models, outperforming entropy and PCA on SDXL text-to-image data.
Why It Matters
Provides a principled, scalable way to discover hidden structure in high-dimensional distributions, improving clustering and scientific analysis.