Research & Papers

A Semi-Supervised Kernel Two-Sample Test

Leverages abundant unlabeled data to dramatically boost hypothesis test power.

Deep Dive

Two-sample testing—determining whether two populations differ—is a fundamental statistical task. Standard tests ignore covariate data (e.g., age, location) that is often abundant and unlabeled. A new paper from Lee, Shekhar, and Kim introduces a kernel-based semi-supervised approach that incorporates these covariates without breaking exchangeability under the null hypothesis. Their test statistic is asymptotically normal, making calibration straightforward via standard normal quantiles. This design avoids the complex permutation or bootstrap procedures required by other covariate-aware tests.

The method's power advantage is striking: simulations show it often far exceeds existing kernel tests that disregard covariates. The authors formally prove consistency against both fixed alternatives (any true difference) and local alternatives (subtle effects). For practitioners, this means more reliable detection of group differences using readily available unlabeled data—common in fields like genomics, A/B testing, and causal inference. The code and full proofs are available on arXiv.

Key Points
  • Uses abundant unlabeled covariate data to improve two-sample test power, unlike standard tests
  • Achieves asymptotic normality under the null, enabling simple calibration without permutations
  • Formally proven consistent against fixed and local alternatives; simulations show large gains over existing kernel tests

Why It Matters

Lets data scientists detect group differences more accurately using unlabeled covariates, improving hypothesis testing in A/B tests and ML workflows.