introduction

Free Access

Introduction to the Special Issue on Perspectives on Recommender Systems Evaluation

Authors:
Christine Bauer

Paris Lodron University Salzburg, Salzburg, Austria

Paris Lodron University Salzburg, Salzburg, Austria

0000-0001-5724-1137
View Profile

,
Alan Said

University of Gothenburg, Gothenburg, Sweden

University of Gothenburg, Gothenburg, Sweden

0000-0002-2929-0529
View Profile

,
Eva Zangerle

University of Innsbruck, Innsbruck, Austria

University of Innsbruck, Innsbruck, Austria

0000-0003-3195-8273
View Profile

Authors Info & Claims

ACM Transactions on Recommender Systems Volume 2 Issue 1Article No.: 1pp 1–5https://doi.org/10.1145/3648398

Published:07 March 2024Publication History

ACM Transactions on Recommender Systems

Abstract

Evaluation plays a vital role in recommender systems—in research and practice—whether for confirming algorithmic concepts or assessing the operational validity of designs and applications. It may span the evaluation of early ideas and approaches up to elaborate implementations of systems integrated into everyday product settings; it may target a wide spectrum of different factors being evaluated. In this special issue, we explore recommender systems evaluation—theory and practice—while considering a diverse set of perspectives. These include recommender systems purposes, stakeholders, methodological approaches, and consequences. The collection of articles in this special issue offers insightful analyses of current recommender system evaluation practices, acknowledging their limitations, and setting out future research directions. As recommender systems evolve, the need for adequate evaluation methods and approaches increases. This special issue sheds light on areas undergoing development or requiring added attention from the research and practitioner communities in recommender systems. The compilation serves as a call to the recommender systems research community, motivating continued research and exploration of evaluation metrics, methods, and strategies.

1 INTRODUCTION

Recommender systems simplify our choices, reshape our interactions, and aid us in discovering relevant items. Evaluation is central to recommender system research and practice, determining the systems’ reliability, effectiveness, business value, and user satisfaction. Given this significant role, considerate evaluation is a prerequisite for progress. Effectively, recommender systems research revolves around their evaluation. Thereby, evaluation encompasses assessing various aspects, from initial ideas to fully operational systems. This special issue—Perspectives on Recommender Systems Evaluation—aims to present the current state of the art and emerging trends in recommender systems evaluation. The number of articles and the scope of topics highlight the importance of taking diverse viewpoints into account when evaluating recommender systems.

Further, the articles in this special issue collectively address the following question: “Why are diverse evaluation approaches important for recommender systems?” They do so by showcasing the adaptability of recommender systems to various applications and domains, catering to different purposes and stakeholders, and pointing out why a uniform evaluation methodology or set of metrics is insufficient. Recommender systems’ quality differs depending on the use case and the goals set for the specific application; evaluation should thus reflect these varied expectations.

The consequences of the evaluation perspective chosen for recommender systems are significant, affecting user satisfaction, business outcomes, retention, computational performance, and a multitude of technical objectives and stakeholder-related subjective values. Simply put, a one-size-fits-all approach is insufficient, failing to capture the varied values recommender systems are expected to bring. Instead, we need diverse evaluation perspectives to maximize the potential of recommender systems.

This special issue presents articles examining, developing, and analyzing various perspectives of recommender systems evaluation, exploring the methodologies, and presenting challenges across the various application domains of recommender systems. Altogether, the articles in this special issue give refreshing new perspectives to the evaluation landscape, paving the way for new directions concerning evaluation efforts in the recommender systems research community. Overall, this special issue captures the big picture of the current challenges and trends in recommender systems evaluation, identifying and describing many strengths and weaknesses in current evaluation practices in recommender systems research. The articles in this issue also show that specific weaknesses require particular attention and improvement. Beyond this, the articles contribute novel ideas and guides for specific improvements.

While the articles in this special issue give perspective on where we need to head as a research community, it needs collaborative effort and the collective expertise of the entire recommender systems research community to drive significant progress in the field. Thus, this special issue serves as a call to the recommender systems research community, motivating continued research and exploration of evaluation metrics, methods, and strategies.

2 PERSPECTIVES ON RECOMMENDER SYSTEMS EVALUATION

This special issue contains a selection of 10 conference papers on topics revolving around the evaluation of recommender systems. In the following, we provide a brief overview of the accepted papers.

Taking the user perspective, Jin et al. [7] introduce CRS-Que, a user-centered framework designed for evaluating conversational recommender systems from the user perspective. Extending the well-established ResQue framework [13], CRS-Que enables the evaluation of conversational qualities, including adaptability and understanding. To empirically validate this novel framework, two user studies are conducted, examining music exploration and mobile phone purchasing scenarios. Similarly focusing on the user and the effect of recommendations, Porcaro et al. [12] present a longitudinal user study on the impact of diversity in music recommendation. In particular, this study investigates the impact of exposure to unfamiliar genres via recommendations and the effects on listener attitudes such as openness or willingness to discovery. Among further noteworthy findings, the study shows that diversified recommendations have the potential to increase the users’ willingness to explore new genres.

From a rather technical perspective, the work by Michiels et al. [10] introduces a test framework for assessing the correctness of implemented recommender systems algorithms (including RecPack Tests, an open source Python package). Drawing inspiration from software testing methodologies, the framework provides a range of black box and white box tests across all levels of abstraction, including unit, integration, and system testing. Furthermore, Daniil et al. [4] present a comprehensive reproducibility study on the propagation of popularity bias and particularly investigate the differences in results found by previous studies on popularity bias [1, 8, 11]. Beyond mere reproducibility, this work analyzes the properties and implementation of the recommender systems and the evaluation thereof. The findings show that discrepancies in results can be attributed to factors such as the data used, employed algorithms, methods for dividing users into groups, and the employed evaluation strategies.

Focusing on the evaluation process, the work by Ekstrand et al. [5] advocates for attending to the distributions of evaluation metrics across, for example, user groups or item providers and going beyond measuring pointwise effectiveness, the current evaluation practice. Furthermore, the article presents a variety of tools for performing distributional evaluations and analyzing the results thereof. Four further articles of this special issue focus on evaluation methods: Li et al. [9] study the impact of performing item sampling-based evaluations (computing a ranking for a target item and a set of random further items) compared to global evaluation (ranking the items of a hold-out set). This work shows that sampling-based recall@K can be mapped to the global recall@K. In addition, notably, the authors propose an adaptive sampling approach that dynamically samples negative items for given users. Al Jurdi et al. [2] propose an evaluation framework that enables the detection of data samples (automatically determined subgroups of users) on which a recommendation algorithm performs poorly. With this approach, the authors address the challenge that the overall metric results across all users may remain unchanged (or even improve) while harming the experience for coherent groups of users. Rahdari et al. [14] contribute with a click model for 2D carousel-type interfaces. Such interfaces are commonly used in practice. Still, prior linear models cannot be used to evaluate such interactive recommendations in data-driven offline evaluations. The proposed click model simulates how users interact with recommendations in labeled carousel interfaces, allowing a low-cost and more accessible alternative to the online empirical method of assessing the quality of a carousel-type recommender system interface. Ferraro et al. [6] argue that recommendations for cultural content (e.g., music, movies, and literature) should align with principles of cultural citizenship, recognizing the social role of such content. The article introduces a commonality metric designed to gauge the extent to which a recommender system “contributes to the strengthening of cultural citizenship by systematically promoting diversity of source and content within a given type of cultural content.”

Finally, Bauer et al. [3] present a systematic literature study on the current state of recommender systems evaluation practices. Spanning a period of 6 years (2017–2022) and analyzing 57 papers, the study specifically addresses experiment types, datasets, and metrics employed in the evaluation process. The results show that offline studies are predominantly used, and only a few datasets and evaluation metrics are widely used. Conversely, a variety of datasets and metrics is used in only a few papers.

3 MOVING ON

Current evaluation practices in the recommender systems research community tend to focus on offline evaluation, using a wide scale of datasets, but still extensively using MovieLens datasets [3]. Many contributions to the evaluation landscape propose evaluation models or introduce (or adapt) evaluation metrics to increase validity and have more informative results. However, at the same time, this also impacts comparability across papers [3].

With this backdrop, it is interesting to consider the wider research and practitioner perspectives used in recommender systems evaluation research. Considering the analysis of the evaluation landscape performed in the work of Bauer et al. [3], we can extrapolate the evaluation perspectives of the analyzed body of literature (57 papers). Bauer et al. [3] classify the type of contribution as one of five types: “Benchmark—Providing an ... evaluation across a ... set of datasets,” “Framework—Introducing a framework for evaluation ... ,” “Metrics—Analyzing ... metrics of evaluation,” “Model—Introducing a ... model,” and “Survey—A literature survey.” One possible extrapolation of a wider set of perspectives would be “End user—Aligning with the perspectives of the end users,” “Service—Aligning with the perspectives of the service (and service provider),” and “Community—Aligning with the perspectives of the research and practitioner communities in recommender systems.” A further perspective is to split the service perspective into two separate perspectives: one where there is an inherent profit perspective from a for-profit business and one where the service does not involve a financial perspective. Table 1 shows a suggested classification of the contribution types into the three perspectives. This classification is one interpretation of the types of perspectives research contributions may have; unquestionably, other types of perspectives and classifications could be made too.

Table 1.

Perspective	Contribution	Datasets	Articles in This Special Issue
End user	Metrics, model	MovieLens, Amazon review, LastFM, citeulike	[6, 7, 12, 14]
Service	Benchmark, metrics, framework		[6, 10]
Profit		NetflixPrize, Yahoo R3, Yelp
Nonprofit		Amazon review, MovieLens, epinions
Community	Survey, framework, benchmark	MovieLens, Amazon review, LastFM, citeulike, epinions	[2, 3, 4, 5, 5, 7, 9, 10, 14]

View Table

Table 1. Three Extrapolated, Not Mutually Exclusive, Perspectives and the Five Contribution Types

Using this classification as a background for the articles in this special issue, we can extend the classification to the works published in the context of this special issue (column Articles in this Special Issue in Table 1).

Analogously, we can extrapolate the perspectives of datasets commonly used in recommender systems evaluation research. Considering the most common datasets identified in the work of Bauer et al. [3] (i.e., MovieLens, Amazon, LastFM, citeulike, NetflixPrize, Yelp, Yahoo, and epinions), we can classify the perspectives of the datasets. In this context, we refer to the perspective from which the dataset is spun. For instance, the MovieLens dataset comes from a service without underlying financial mechanisms—it is the result of a long-spanning research project/infrastructure hosted by the University of Minnesota’s GroupLens Lab; however, the Netflix Prize dataset was created and released by a for-profit organization relying on the quality of their recommendation to increase the profit margin and customer satisfaction. Datasets such as Amazon Review and citeulike represent a nonprofit perspective, as they have been crawled and released by third parties (researchers) by means of, for example, scraping websites of these services. While the services themselves have underlying profit perspectives, these were not instrumental in creating these datasets. The overarching service perspective conveys the fact that the data, nevertheless, is a result of the recommender systems that these services run—that is, recommender systems employed by the services directly influence how the data was constructed.

While many possible perspectives can be taken with regard to data, algorithms, and metrics, it remains important to remember that recommender systems provide decision support for their users and, in doing so, must keep track of the end user perspective independent of remaining parameters.

ACKNOWLEDGMENTS

We thank all authors who submitted their work to this special issue and all reviewers who devoted a substantial amount of time and effort to providing constructive and insightful feedback to the authors.

REFERENCES

[1] Abdollahpouri Himan, Mansoury Masoud, Burke Robin, and Mobasher Bamshad. 2019. The unfairness of popularity bias in recommendation. CoRR abs/1907.13286 (2019). http://arxiv.org/abs/1907.13286Google Scholar
Reference
[2] Jurdi Wissam Al, Abdo Jacques Bou, Demerjian Jacques, and Makhoul Abdallah. 2024. Group validation in recommender systems: Framework for multi-layer performance evaluation. ACM Transactions on Recommender Systems 2, 1 (2024). DOI:Google ScholarDigital Library
Reference 1Reference 2
[3] Bauer Christine, Zangerle Eva, and Said Alan. 2024. Exploring the landscape of recommender systems evaluation: Practices and perspectives. ACM Transactions on Recommender Systems 2, 1 (2024). DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
[4] Daniil Savvina, Cuper Mirjam, Liem Cynthia C. S., Ossenbruggen Jacco van, and Hollink Laura. 2024. Reproducing popularity bias in recommendation: The effect of evaluation strategies. ACM Transactions on Recommender Systems 2, 1 (2024). DOI:Google ScholarDigital Library
Reference 1Reference 2
[5] Ekstrand Michael D., Carterette Ben, and Diaz Fernando. 2024. Distributionally-informed recommender system evaluation. ACM Transactions on Recommender Systems 2, 1 (2024). DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[6] Ferraro Andreas, Ferreira Gustavo, Diaz Fernando, and Born Georgina. 2024. Measuring commonality in recommendation of cultural content: Recommender systems to enhance cultural citizenship. ACM Transactions on Recommender Systems 2, 1 (2024). DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[7] Jin Yucheng, Chen Li, Cai Wanling, and Zhao Xianglin. 2024. CRS-Que: A user-centric evaluation framework for conversational recommender systems. ACM Transactions on Recommender Systems 2, 1 (2024). DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[8] Kowald Dominik, Schedl Markus, and Lex Elisabeth. 2020. The unfairness of popularity bias in music recommendation: A reproducibility study. In Advances in Information Retrieval. Lecture Notes in Computer Science, Vol. 12036. Springer, 35–42. DOI:Google ScholarDigital Library
Reference
[9] Li Dong, Jin Ruoming, Liu Zhenming, Ren Bin, Gao Jing, and Liu Zhi. 2024. On item-sampling evaluation for recommender system. ACM Transactions on Recommender Systems 2, 1 (2024). DOI:Google ScholarDigital Library
Reference 1Reference 2
[10] Michiels Lien, Verachtert Robin, Ferraro Andres, Falk Kim, and Goethals Bart. 2024. A framework and toolkit for testing the correctness of recommendation algorithms. ACM Transactions on Recommender Systems 2, 1 (2024). DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[11] Naghiaei Mohammadmehdi, Rahmani Hossein A., and Dehghan Mahdi. 2022. The unfairness of popularity bias in book recommendation. In Advances in Bias and Fairness in Information Retrieval. Springer, 69–81. DOI:Google ScholarCross Ref
Reference
[12] Porcaro Lorenzo, Gómez Emilia, and Castillo Carlos. 2024. Assessing the impact of music recommendation diversity on listeners: A longitudinal study. ACM Transactions on Recommender Systems 2, 1 (2024). DOI:Google ScholarDigital Library
Reference 1Reference 2
[13] Pu Pearl, Chen Li, and Hu Rong. 2011. A user-centric evaluation framework for recommender systems. In Proceedings of the 5th ACM Conference on Recommender Systems (RecSys’11). ACM, New York, NY, USA, 157–164. DOI:Google ScholarDigital Library
Reference
[14] Rahdari Behnam, Brusilovsky Peter, and Kveton Branislav. 2024. Towards simulation-based evaluation of recommender systems with carousel interfaces. ACM Transactions on Recommender Systems 2, 1 (2024). DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3

Index Terms

Introduction to the Special Issue on Perspectives on Recommender Systems Evaluation
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Retrieval tasks and goals
      1. Recommender systems

Recommendations

Perspectives on the Evaluation of Recommender Systems (PERSPECTIVES)
RecSys '21: Proceedings of the 15th ACM Conference on Recommender Systems

Evaluation is a cornerstone in the process of developing and deploying recommender systems. The PERSPECTIVES workshop brought together academia and industry to critically reflect on the evaluation of recommender systems. Particularly, the workshop ...
Read More
Third Workshop: Perspectives on the Evaluation of Recommender Systems (PERSPECTIVES 2023)
RecSys '23: Proceedings of the 17th ACM Conference on Recommender Systems

Evaluation is important when developing and deploying recommender systems. The PERSPECTIVES workshop sheds light on the different, potentially diverging or contradictory perspectives on the evaluation of recommender systems. Building on the discussions ...
Read More
Second Workshop: Perspectives on the Evaluation of Recommender Systems (PERSPECTIVES 2022)
RecSys '22: Proceedings of the 16th ACM Conference on Recommender Systems

Evaluation of recommender systems is a central activity when developing recommender systems, both in industry and academia. The second edition of the PERSPECTIVES workshop held at RecSys 2022 brought together academia and industry to critically reflect ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Recommender Systems Volume 2, Issue 1
March 2024
346 pages
EISSN:2770-6699
DOI:10.1145/3613520
Editors:
Li Chen
Hong Kong Baptist University, China
,
Dietmar Jannach
University of Klagenfurt, Austria
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 March 2024
- Accepted: 12 February 2024
- Received: 5 February 2024
Published in tors Volume 2, Issue 1

Check for updates
Author Tags
Evaluation
recommender systems
reproducibility
datasets
metrics
Qualifiers
- introduction
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 173
  Total Downloads
- Downloads (Last 12 months)173
- Downloads (Last 6 weeks)137
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.