Research & Papers

From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

Researchers systematically test 11 counterfactual explanation methods across 3 datasets and 6 recommender models.

Deep Dive

A research team from multiple institutions has published a comprehensive reproducibility study titled 'From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems.' The paper systematically re-implements and evaluates eleven state-of-the-art counterfactual explanation (CE) methods, including native explainers like LIME-RS, SHAP, PRINCE, ACCENT, LXR, and GREASE, plus specific graph-based explainers originally designed for GNNs. The researchers address a critical problem in AI explainability: previous evaluations used heterogeneous protocols with different datasets, recommenders, and metrics, making fair comparisons impossible.

To solve this, the team proposes a unified benchmarking framework that assesses explainers along three key dimensions: explanation format (implicit vs. explicit), evaluation level (item-level vs. list-level), and perturbation scope (user interaction vectors vs. user-item interaction graphs). They conducted extensive experiments on three real-world datasets using six representative recommender models, measuring effectiveness, sparsity, and computational complexity. The study extends assessments beyond single-item (Top-1) explanations to more realistic top-K list-level scenarios.

The results challenge earlier assumptions about CE methods. The trade-off between explanation effectiveness and sparsity depends strongly on the specific method and evaluation setting, particularly under explicit formats. While explainer performance remained largely consistent between item-level and list-level evaluations, several graph-based explainers showed notable scalability limitations on large recommender graphs. These findings refine our understanding of which explanation methods are truly robust and practical for real-world deployment.

Key Points
  • Benchmarked 11 counterfactual explanation methods including LIME-RS, SHAP, PRINCE, ACCENT, LXR, and GREASE across unified protocols
  • Found that effectiveness-sparsity trade-offs are highly dependent on method and evaluation setting, challenging previous conclusions
  • Identified scalability issues with several graph-based explainers on large recommender graphs

Why It Matters

Provides standardized benchmarks for AI explainability in recommender systems, helping developers choose the right tools for transparent AI.