Research & Papers

Offline Evaluation Measures of Fairness in Recommender Systems

Fairness measures in AI recommenders may be broken—a new study reveals why.

Deep Dive

A new PhD thesis from Theresia Veronika Rampisela, published on arXiv (2604.25032), dives deep into the theoretical and empirical weaknesses of offline fairness evaluation measures used in recommender systems. The work addresses a critical gap: while many fairness metrics exist—covering users, items, and group-level granularity—they are often deployed without rigorous robustness checks. Rampisela identifies issues such as division-by-zero errors, non-interpretable score distributions, and cases where models can game the metrics to appear fair. The thesis proposes novel evaluation approaches and measures that overcome these limitations, alongside practical guidelines for selecting appropriate metrics in real-world applications.

This research is timely given recent AI fairness legislation and the growing reliance on recommenders in platforms like e-commerce, streaming, and social media. By exposing how current metrics can mislead, the study helps developers and regulators avoid false confidence in system fairness. The proposed solutions aim to make offline evaluation more reliable, ensuring that fairness claims are backed by robust, interpretable measures. For practitioners, the guidelines offer a clear path to choosing metrics that match specific fairness definitions—whether for individual users, protected groups, or item exposure—reducing confusion and improving accountability in AI systems.

Key Points
  • Analyzes multiple fairness measures for users, items, and group granularity
  • Identifies flaws like division-by-zero errors, skewed distributions, and low interpretability
  • Proposes new measures and usage guidelines for robust offline evaluation

Why It Matters

Helps AI developers and regulators avoid flawed fairness metrics, enabling more trustworthy and accountable recommender systems.