Research & Papers

From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

arXiv cs.IR April 22, 2026

⚡Researchers systematically test 11 counterfactual explanation methods across 3 datasets and 6 recommender models.

Deep Dive

A research team from multiple institutions has published a comprehensive reproducibility study titled 'From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems.' The paper systematically re-implements and evaluates eleven state-of-the-art counterfactual explanation (CE) methods, including native explainers like LIME-RS, SHAP, PRINCE, ACCENT, LXR, and GREASE, plus specific graph-based explainers originally designed for GNNs. The researchers address a critical problem in AI explainability: previous evaluations used heterogeneous protocols with different datasets, recommenders, and metrics, making fair comparisons impossible.

To solve this, the team proposes a unified benchmarking framework that assesses explainers along three key dimensions: explanation format (implicit vs. explicit), evaluation level (item-level vs. list-level), and perturbation scope (user interaction vectors vs. user-item interaction graphs). They conducted extensive experiments on three real-world datasets using six representative recommender models, measuring effectiveness, sparsity, and computational complexity. The study extends assessments beyond single-item (Top-1) explanations to more realistic top-K list-level scenarios.

The results challenge earlier assumptions about CE methods. The trade-off between explanation effectiveness and sparsity depends strongly on the specific method and evaluation setting, particularly under explicit formats. While explainer performance remained largely consistent between item-level and list-level evaluations, several graph-based explainers showed notable scalability limitations on large recommender graphs. These findings refine our understanding of which explanation methods are truly robust and practical for real-world deployment.

Key Points

Benchmarked 11 counterfactual explanation methods including LIME-RS, SHAP, PRINCE, ACCENT, LXR, and GREASE across unified protocols
Found that effectiveness-sparsity trade-offs are highly dependent on method and evaluation setting, challenging previous conclusions
Identified scalability issues with several graph-based explainers on large recommender graphs

Why It Matters

Provides standardized benchmarks for AI explainability in recommender systems, helping developers choose the right tools for transparent AI.

Read Original Article

From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

Why It Matters

Stay Ahead in AI