Mean Testing under Truncation beyond Gaussian
Truncated data hides up to ε-fraction of mass, creating a sharp detectability limit.
In a new paper on arXiv, researchers Yuhao Wang, Roberto Imbuzeiro Oliveira, and Themis Gouleakis tackle an important but overlooked problem in statistics: how to test the mean of a distribution when some data is missing due to an unknown truncation mechanism. They consider a setting where samples are drawn from the conditional distribution given an unknown truncation set S that may hide up to an ε-fraction of the probability mass. This creates a systematic bias that degrades the ability to distinguish whether the true mean is zero (null) or some nonzero value α (alternative).
The team derives information-theoretic limits: they show that when the signal α falls below a bias threshold of order O(ν ε^(1-1/p)), where ν bounds the p-th directional moments, the null and alternative are fundamentally indistinguishable no matter how many samples are collected. This establishes a sharp "detectability floor." Above that floor, they propose a simple second-order test that achieves near-optimal sample complexity n = O(||Σ_P|| / (α - 4ν ε^(1-1/p))^2 * √d), where Σ_P is the covariance matrix. This result connects finite-moment distributions to sub-Gaussian and median-regular regimes.
Perhaps most striking is their finding of a structural escape: if the distribution satisfies a directional median regularity condition, the truncation bias improves from polynomial to linear order O(ε). In this regime, testing recovers the classical Θ(√d) rate while estimation still requires Θ(d) samples. The work provides a unified framework bridging several theoretical traditions and offers practical guidance for hypothesis testing in settings with missing or truncated data—common in surveys, econometrics, and machine learning pipelines.
- Proves a sharp detectability floor: if signal α < O(ν ε^(1-1/p)), hypotheses are indistinguishable even with infinite data.
- Achieves near-optimal sample complexity n = O(||Σ_P|| / (α - 4ν ε^(1-1/p))^2 * √d) above the bias threshold.
- Under median regularity, bias improves to linear O(ε), enabling classical √d testing rates while estimation still requires Θ(d) samples.
Why It Matters
This provides a rigorous theoretical foundation for mean testing with truncated data, directly impacting missing-data pipelines in ML and statistics.