Research & Papers

Scalable Kernel-Based Distances for Statistical Inference and Integration

New 'kernel quantile discrepancies' offer a competitive alternative to the widely-used MMD for comparing probability distributions.

Deep Dive

A new PhD thesis by Masha Naslidnyk, titled 'Scalable Kernel-Based Distances for Statistical Inference and Integration,' makes significant contributions to the mathematical tools used for comparing probability distributions in machine learning and statistics. The work focuses on kernel-based distances, particularly the widely-used Maximum Mean Discrepancy (MMD), which is favored for its computational tractability in tasks like two-sample testing and generative model evaluation. The thesis is structured in two parts: the first part delivers improved estimators for MMD in simulation-based inference and conditional expectation estimation, while the second part introduces a novel family of distances designed to overcome MMD's inherent limitations.

The core technical innovation is the introduction of 'kernel quantile discrepancies' (KQDs), a new class of distances that move beyond comparing only the mean embeddings of distributions, as MMD does. Through both theoretical analysis and empirical study, Naslidnyk demonstrates that these KQDs offer a competitive and often superior alternative to MMD and its fast approximations. This advancement addresses key pitfalls in distribution comparison, potentially leading to more robust statistical inference, better-calibrated models, and more efficient integration methods. The work provides a rigorous foundation for practitioners to choose and implement distances that encode specific desired properties like robustness or smoothness in their machine learning pipelines.

Key Points
  • Introduces 'kernel quantile discrepancies' (KQDs), a new family of distances that compete with the standard Maximum Mean Discrepancy (MMD).
  • Provides improved, theoretically-sound estimators for MMD to enhance simulation-based inference and conditional expectation calculations.
  • Demonstrates through theory and experiments that the new methods address known pitfalls of MMD for tasks like statistical integration and calibration.

Why It Matters

Provides better mathematical tools for comparing data distributions, which is foundational for evaluating AI models, robust statistical testing, and simulation.