Research & Papers

Dependence Fidelity and Downstream Inference Stability in Generative Models

New paper shows AI-generated data can reverse regression signs despite perfect marginal matches.

Deep Dive

A new research paper by Nazia Riasat, presented at MathAI 2026, exposes a fundamental flaw in how we evaluate generative AI models like diffusion models and variational autoencoders. While current metrics focus on whether synthetic data matches individual variable distributions (marginal fidelity), the research proves this is insufficient. Models can achieve perfect marginal matches while completely failing to preserve the relationships between variables—what the paper terms 'dependence fidelity.' This hidden failure has serious consequences for statistical inference.

Riasat establishes three core findings that should alarm anyone using synthetic data for analysis. First, distributions can match all univariate marginals exactly while exhibiting substantially different dependence structures. Second, this dependence divergence causes quantitative instability in downstream tasks, including sign reversals in regression coefficients—meaning a positive relationship in real data could appear negative in synthetic data. Third, controlling covariance-level dependence divergence is essential for stable performance in dependence-sensitive tasks like principal component analysis. The paper provides synthetic constructions showing how these failures lead to incorrect conclusions despite identical marginal distributions.

The implications are particularly significant for fields relying on synthetic data for privacy-preserving analytics, data augmentation, or simulation. The research demonstrates that tasks governed by covariance structure—including many common statistical analyses—require this new diagnostic. However, the authors note that tasks requiring higher-order dependence, such as tail-event estimation, would need even richer criteria beyond covariance matching. This work establishes dependence fidelity as a necessary practical criterion for evaluating whether generative distributions preserve joint structure beyond univariate marginals.

For AI developers, this means current evaluation methods are incomplete. The paper provides both a theoretical framework and practical demonstrations showing why preserving multivariate relationships matters more than previously recognized. As generative models produce increasingly realistic synthetic data, this research provides crucial guidance for ensuring that data remains useful for scientific inference and decision-making, not just surface-level realism.

Key Points
  • Generative models can match individual variable distributions perfectly while failing to preserve relationships between variables
  • This dependence failure causes sign reversals in regression coefficients and incorrect PCA results despite identical marginal behavior
  • The paper introduces covariance-level dependence fidelity as a necessary diagnostic for diffusion models and VAEs in statistical tasks

Why It Matters

Ensures synthetic data used for research and analytics produces statistically valid conclusions, not just superficially realistic outputs.