Research & Papers

Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research

A new study reveals decades of flawed statistical practices in IR...

Deep Dive

In a forthcoming ACM SIGIR 2026 paper, Julián Urbano challenges the long-standing practice of using the Wilcoxon signed-rank test in Information Retrieval (IR) benchmarking. The paper, titled 'Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research,' argues that the test, often portrayed as a safer non-parametric alternative to the t-test due to non-normal metric distributions, is actually harmful. Urbano's systematic literature review reveals inconsistencies in how statistics textbooks present the test's assumptions, leading to widespread confusion. Using TREC data, he demonstrates that the Wilcoxon test easily loses control of its Type I error rate in IR settings, creating a false sense of safety while introducing severe flaws that mislead researchers.

Urbano's empirical demonstrations show that the Wilcoxon test virtually guarantees breakdown in typical IR evaluation scenarios, such as comparing retrieval systems with small sample sizes or skewed relevance judgments. The paper concludes that decades of routine misapplication have perpetuated a methodological crisis in IR, where researchers rely on a test that fails to control false positives. Urbano recommends abandoning the Wilcoxon test entirely in favor of more robust alternatives, like bootstrap or permutation tests, to improve the reliability of system comparisons. This work has immediate implications for IR conferences and publications, urging a shift in statistical practices to enhance the field's methodological rigor.

Key Points
  • Urbano's systematic review finds inconsistencies in how statistics textbooks present Wilcoxon assumptions, fueling decades of misuse in IR.
  • Empirical analysis with TREC data shows the Wilcoxon test loses Type I error control, misleading researchers about system significance.
  • Paper recommends abandoning Wilcoxon in favor of bootstrap or permutation tests for more reliable IR benchmarking.

Why It Matters

This challenges a statistical cornerstone in IR, urging a methodological shift to improve research reliability.