extended-abstract

On the Consistency, Discriminative Power and Robustness of Sampled Metrics in Offline Top-N Recommender System Evaluation

Authors:
Yang Liu

Department of Computer Science, University of Helsinki, Finland

Department of Computer Science, University of Helsinki, Finland

0000-0002-1454-9585
View Profile

,
Alan Medlar

Department of Computer Science, University of Helsinki, Finland

Department of Computer Science, University of Helsinki, Finland

0000-0002-5139-9483
View Profile

,
Dorota Glowacka

Department of Computer Science, University of Helsinki, Finland

Department of Computer Science, University of Helsinki, Finland

0000-0001-7008-3380
View Profile

RecSys '23: Proceedings of the 17th ACM Conference on Recommender SystemsSeptember 2023Pages 1152–1157https://doi.org/10.1145/3604915.3610651

Published:14 September 2023Publication History

RecSys '23: Proceedings of the 17th ACM Conference on Recommender Systems

Pages 1152–1157

ABSTRACT

Negative item sampling in offline top-n recommendation evaluation has become increasingly wide-spread, but remains controversial. While several studies have warned against using sampled evaluation metrics on the basis of being a poor approximation of the full ranking (i.e. using all negative items), others have highlighted their improved discriminative power and potential to make evaluation more robust. Unfortunately, empirical studies on negative item sampling are based on relatively few methods (between 3-12) and, therefore, lack the statistical power to assess the impact of negative item sampling in practice.

In this article, we present preliminary findings from a comprehensive benchmarking study of negative item sampling based on 52 recommendation algorithms and 3 benchmark data sets. We show how the number of sampled negative items and different sampling strategies affect the consistency and discriminative power of sampled evaluation metrics. Furthermore, we investigate the impact of sparsity bias and popularity bias on the robustness of these metrics. In brief, we show that the optimal parameterizations for negative item sampling are dependent on data set characteristics and the goals of the investigator, suggesting a need for greater transparency in related experimental design decisions.

References

Alejandro Bellogín, Pablo Castells, and Iván Cantador. 2017. Statistical biases in information retrieval metrics for recommender systems. Information Retrieval Journal 20 (2017), 606–634.Google ScholarDigital Library
Rocío Cañamares and Pablo Castells. 2020. On target item sampling in offline recommender system evaluation. In Proceedings of the 14th ACM Conference on Recommender Systems. 259–268.Google ScholarDigital Library
Weiyu Cheng, Yanyan Shen, Yanmin Zhu, and Linpeng Huang. 2018. DELF: A dual-embedding based deep latent factor model for recommendation.. In Proceedings of the 27th International Joint Conferences on Artificial Intelligence Organization, Vol. 18. 3329–3335.Google ScholarCross Ref
Alexander Dallmann, Daniel Zoller, and Andreas Hotho. 2021. A case study on sampling strategies for evaluating neural sequential item recommendation models. In Proceedings of the 15th ACM Conference on Recommender Systems. 505–514.Google ScholarDigital Library
Travis Ebesu, Bin Shen, and Yi Fang. 2018. Collaborative memory network for recommendation systems. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 515–524.Google ScholarDigital Library
Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cremonesi, and Dietmar Jannach. 2021. A troubling analysis of reproducibility and progress in recommender systems research. ACM Transactions on Information Systems 39, 2 (2021), 1–49.Google ScholarDigital Library
F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems 5, 4 (2015), 1–19.Google ScholarDigital Library
Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web. 507–517.Google ScholarDigital Library
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. 173–182.Google ScholarDigital Library
Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 426–434.Google ScholarDigital Library
Walid Krichene and Steffen Rendle. 2020. On sampled metrics for item recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1748–1757.Google ScholarDigital Library
Aleksandr Petrov and Craig Macdonald. 2022. A systematic review and replicability study of BERT4Rec for sequential recommendation. In Proceedings of the 16th ACM Conference on Recommender Systems. 436–447.Google ScholarDigital Library
Joseph P Simmons, Leif D Nelson, and Uri Simonsohn. 2011. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science 22, 11 (2011), 1359–1366.Google ScholarCross Ref
Zhu Sun, Di Yu, Hui Fang, Jie Yang, Xinghua Qu, Jie Zhang, and Cong Geng. 2020. Are we evaluating rigorously? Benchmarking recommendation for reproducible evaluation and fair comparison. In Proceedings of the 14th ACM Conference on Recommender Systems. 23–32.Google ScholarDigital Library
Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells. 2018. On the robustness and discriminative power of information retrieval metrics for top-N recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems. 260–268.Google ScholarDigital Library
Xiang Wang, Dingxian Wang, Canran Xu, Xiangnan He, Yixin Cao, and Tat-Seng Chua. 2019. Explainable reasoning over knowledge graphs for recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5329–5336.Google ScholarDigital Library
Wayne Xin Zhao, Zihan Lin, Zhichao Feng, Pengfei Wang, and Ji-Rong Wen. 2022. A revisiting study of appropriate offline evaluation for top-N recommendation algorithms. ACM Transactions on Information Systems 41, 2 (2022), 1–41.Google ScholarDigital Library
Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, 2021. Recbole: Towards a unified, comprehensive and efficient framework for recommendation algorithms. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management. 4653–4664.Google ScholarDigital Library

Index Terms

On the Consistency, Discriminative Power and Robustness of Sampled Metrics in Offline Top-N Recommender System Evaluation
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Recommender systems

Recommendations

What We Evaluate When We Evaluate Recommender Systems: Understanding Recommender Systems’ Performance using Item Response Theory
RecSys '23: Proceedings of the 17th ACM Conference on Recommender Systems

Current practices in offline evaluation use rank-based metrics to measure the quality of top-n recommendation lists. This approach has practical benefits as it centres assessment on the output of the recommender system and, therefore, measures ...
Read More
A Scalable, Accurate Hybrid Recommender System
WKDD '10: Proceedings of the 2010 Third International Conference on Knowledge Discovery and Data Mining

Recommender systems apply machine learning techniques for filtering unseen information and can predict whether a user would like a given resource. There are three main types of recommender systems: collaborative filtering, content-based filtering, and ...
Read More
Injecting Semantic Diversity in Top-N Recommender Systems Using Determinantal Point Processes and Curated Lists
UMAP '18: Adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization

Top-N Recommender Systems usually suffer from intra-list diversity as they are tailored for relevance and predicted rating accuracy. This problem is magnified in the case of cold start setting - resulting in users being restricted to popular set of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
RecSys '23: Proceedings of the 17th ACM Conference on Recommender Systems
September 2023
1406 pages
ISBN:9798400702419
DOI:10.1145/3604915
Editors:
Jie Zhang,
Li Chen,
Shlomo Berkovsky,
Min Zhang,
Tommaso di Noia,
Justin Basilico,
Luiz Pizzato,
Yang Song
Copyright © 2023 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 September 2023
Check for updates
Author Tags
offline evaluation
recommender systems
Qualifiers
- extended-abstract
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate254of1,295submissions,20%
Upcoming Conference
RecSys '24

Sponsor:

sigchi

18th ACM Conference on Recommender Systems

October 14 - 18, 2024

Bari , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 142
  Total Downloads
- Downloads (Last 12 months)142
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

On the Consistency, Discriminative Power and Robustness of Sampled Metrics in Offline Top-N Recommender System Evaluation

RecSys '23: Proceedings of the 17th ACM Conference on Recommender Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

What We Evaluate When We Evaluate Recommender Systems: Understanding Recommender Systems’ Performance using Item Response Theory

A Scalable, Accurate Hybrid Recommender System

Injecting Semantic Diversity in Top-N Recommender Systems Using Determinantal Point Processes and Curated Lists