ABSTRACT
Negative item sampling in offline top-n recommendation evaluation has become increasingly wide-spread, but remains controversial. While several studies have warned against using sampled evaluation metrics on the basis of being a poor approximation of the full ranking (i.e. using all negative items), others have highlighted their improved discriminative power and potential to make evaluation more robust. Unfortunately, empirical studies on negative item sampling are based on relatively few methods (between 3-12) and, therefore, lack the statistical power to assess the impact of negative item sampling in practice.
In this article, we present preliminary findings from a comprehensive benchmarking study of negative item sampling based on 52 recommendation algorithms and 3 benchmark data sets. We show how the number of sampled negative items and different sampling strategies affect the consistency and discriminative power of sampled evaluation metrics. Furthermore, we investigate the impact of sparsity bias and popularity bias on the robustness of these metrics. In brief, we show that the optimal parameterizations for negative item sampling are dependent on data set characteristics and the goals of the investigator, suggesting a need for greater transparency in related experimental design decisions.
- Alejandro Bellogín, Pablo Castells, and Iván Cantador. 2017. Statistical biases in information retrieval metrics for recommender systems. Information Retrieval Journal 20 (2017), 606–634.Google ScholarDigital Library
- Rocío Cañamares and Pablo Castells. 2020. On target item sampling in offline recommender system evaluation. In Proceedings of the 14th ACM Conference on Recommender Systems. 259–268.Google ScholarDigital Library
- Weiyu Cheng, Yanyan Shen, Yanmin Zhu, and Linpeng Huang. 2018. DELF: A dual-embedding based deep latent factor model for recommendation.. In Proceedings of the 27th International Joint Conferences on Artificial Intelligence Organization, Vol. 18. 3329–3335.Google ScholarCross Ref
- Alexander Dallmann, Daniel Zoller, and Andreas Hotho. 2021. A case study on sampling strategies for evaluating neural sequential item recommendation models. In Proceedings of the 15th ACM Conference on Recommender Systems. 505–514.Google ScholarDigital Library
- Travis Ebesu, Bin Shen, and Yi Fang. 2018. Collaborative memory network for recommendation systems. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 515–524.Google ScholarDigital Library
- Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cremonesi, and Dietmar Jannach. 2021. A troubling analysis of reproducibility and progress in recommender systems research. ACM Transactions on Information Systems 39, 2 (2021), 1–49.Google ScholarDigital Library
- F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems 5, 4 (2015), 1–19.Google ScholarDigital Library
- Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web. 507–517.Google ScholarDigital Library
- Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. 173–182.Google ScholarDigital Library
- Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 426–434.Google ScholarDigital Library
- Walid Krichene and Steffen Rendle. 2020. On sampled metrics for item recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1748–1757.Google ScholarDigital Library
- Aleksandr Petrov and Craig Macdonald. 2022. A systematic review and replicability study of BERT4Rec for sequential recommendation. In Proceedings of the 16th ACM Conference on Recommender Systems. 436–447.Google ScholarDigital Library
- Joseph P Simmons, Leif D Nelson, and Uri Simonsohn. 2011. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science 22, 11 (2011), 1359–1366.Google ScholarCross Ref
- Zhu Sun, Di Yu, Hui Fang, Jie Yang, Xinghua Qu, Jie Zhang, and Cong Geng. 2020. Are we evaluating rigorously? Benchmarking recommendation for reproducible evaluation and fair comparison. In Proceedings of the 14th ACM Conference on Recommender Systems. 23–32.Google ScholarDigital Library
- Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells. 2018. On the robustness and discriminative power of information retrieval metrics for top-N recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems. 260–268.Google ScholarDigital Library
- Xiang Wang, Dingxian Wang, Canran Xu, Xiangnan He, Yixin Cao, and Tat-Seng Chua. 2019. Explainable reasoning over knowledge graphs for recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5329–5336.Google ScholarDigital Library
- Wayne Xin Zhao, Zihan Lin, Zhichao Feng, Pengfei Wang, and Ji-Rong Wen. 2022. A revisiting study of appropriate offline evaluation for top-N recommendation algorithms. ACM Transactions on Information Systems 41, 2 (2022), 1–41.Google ScholarDigital Library
- Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, 2021. Recbole: Towards a unified, comprehensive and efficient framework for recommendation algorithms. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management. 4653–4664.Google ScholarDigital Library
Index Terms
- On the Consistency, Discriminative Power and Robustness of Sampled Metrics in Offline Top-N Recommender System Evaluation
Recommendations
What We Evaluate When We Evaluate Recommender Systems: Understanding Recommender Systems’ Performance using Item Response Theory
RecSys '23: Proceedings of the 17th ACM Conference on Recommender SystemsCurrent practices in offline evaluation use rank-based metrics to measure the quality of top-n recommendation lists. This approach has practical benefits as it centres assessment on the output of the recommender system and, therefore, measures ...
A Scalable, Accurate Hybrid Recommender System
WKDD '10: Proceedings of the 2010 Third International Conference on Knowledge Discovery and Data MiningRecommender systems apply machine learning techniques for filtering unseen information and can predict whether a user would like a given resource. There are three main types of recommender systems: collaborative filtering, content-based filtering, and ...
Injecting Semantic Diversity in Top-N Recommender Systems Using Determinantal Point Processes and Curated Lists
UMAP '18: Adjunct Publication of the 26th Conference on User Modeling, Adaptation and PersonalizationTop-N Recommender Systems usually suffer from intra-list diversity as they are tailored for relevance and predicted rating accuracy. This problem is magnified in the case of cold start setting - resulting in users being restricted to popular set of ...
Comments