Research & Papers

Counterfactual Edits Show Why Deep Two-Sample Tests Reject Null

Editing MRI scans pixel-by-pixel to reveal hidden group differences? This new method does exactly that.

Deep Dive

Two-sample testing is essential for detecting distributional differences, but classical tests falter on high-dimensional data like images. Deep two-sample tests improve sensitivity by learning informative representations, yet they offer little insight into which features drive the rejection of the null hypothesis. This lack of interpretability limits their use in scientific discovery, where understanding the 'why' behind a statistical result is often as important as the result itself.

To bridge this gap, Lai et al. introduce a counterfactual explanation framework that combines a diffusion autoencoder with a pretrained deep two-sample test. The method optimizes a maximum mean discrepancy (MMD) objective in the test's representation space, generating plausible edits that move a sample from a source group toward a target distribution. It quantifies distribution-level effects via changes in the test statistic and p-values. On synthetic shape datasets and two MRI cohorts, the counterfactual transformations consistently increased p-values, indicating the edited source set became statistically closer to the target. The edits are kept minimal using LPIPS distance, and on MRI, the localized changes matched known anatomical differences between cohorts, providing interpretable evidence of the features underlying group differences.

Key Points
  • Combines diffusion autoencoder with MMD optimization to generate counterfactual edits in the test model's representation space.
  • Quantifies distribution-level impact through changes in p-values; edits consistently move source samples closer to the target distribution.
  • On MRI data, localized changes align with known anatomical differences, proving the method's interpretability in real-world cohorts.

Why It Matters

Makes deep statistical tests transparent, enabling researchers to identify key features behind group differences in high-stakes fields like medicine.