Information Processing and Management

Although recommender systems (RSs) play a crucial role in our society


Introduction
Recommender systems (RSs) play an indisputable role in our lives by taking part in a variety of day-to-day decisions, influencing which content we are exposed to on digital platforms.Such decisions include selecting a book to buy, a movie to watch, a hotel to book, or a song to listen to.While RSs, therefore, offer great support in accessing otherwise barely manageable amounts of data, several studies revealed that their performances may differ between groups of users depending on their characteristics (e.g., gender, race, ethnicity, age, country of origin, or personality Datta, Tschantz, & Datta, 2015;Lambrecht & Tucker, 2019;Melchiorre, Zangerle, & Schedl, 2020;Schedl, Hauger, Farrahi and Tkalcic, 2015).Some of these performance disparities can disadvantage certain user groups in accessing opportunities, and therefore disregard the principle of fairness, namely the ''absence of any prejudice or favoritism toward an individual or a group based on their intrinsic or acquired traits'' (Mehrabi, Morstatter, Saxena, Lerman, & Galstyan, 2019).In this article, we set out to study group fairness from the point of view of the users of RSs.This form of unfairness in RSs can be traced back to biases resulting from the (interplay between) data, model/algorithm, and users (Chen, Dong, Wang, Feng, Wang, & He, 2020).Here, we focus on biases resulting from an imbalanced number of data points regarding various demographics.These imbalances can be classified as population bias (Olteanu, Castillo, Diaz, & Kiciman, 2019), which is part of a more general data bias (Baeza-Yates, 2018).In the presence of population bias, a machine learning model (such as those created by a RS algorithm) captures the interaction patterns of the majority group more prominently, which can lead to better model performance for the majority group in comparison to the one for the minority group (an unfair system) (Hardt, Price, Price, & Srebro, 2016).We refer to the cause of this unfairness as model/algorithmic bias.Clearly, data is a major reason for model bias, but different models still can lead to different degrees of unfairness or even intensify data bias.As shown in previous studies in the contexts of text classification (De-Arteaga et al., 2019) and passage retrieval (Rekabsaz & Schedl, 2020), a machine learning model may even compound the discrepancies in the collection, such that the distribution of the model's results is even more unfair in comparison with the existing discrepancies in the underlying data.This phenomena is referred to as compounding imbalances (De-Arteaga et al., 2019;Hellman, 2018).In this case, the system is not only unfair toward minority groups but even intensifies the existing imbalances.

Research questions
In the work at hand, we comprehensively study both population and model bias in the context of unfairness with regard to gender in one important domain of RS research, music recommendation (Schedl, Knees, McFee, Bogdanov and Kaminskas, 2015).Concretely, we explore the following research questions: RQ1: Do recommender algorithms of various categories yield different performance scores (in terms of accuracy and beyond-accuracy metrics) for different user groups with respect to gender?If so, how can these differences be characterized?RQ2: What is the effect of a resampling strategy, commonly used as debiasing method, on the performance and fairness of algorithms?RQ3: Do RS algorithms compound population bias?If so, how can this be characterized?
These questions can only be approached using a music recommendation dataset that contains gender information about users.Existing (publicly available) datasets of this kind are exclusively composed of data from the music streaming platform Last.fm, and include Last.fm 1K (Celma, 2010), LFM-1b (Schedl, 2016), and Music Listening Histories Dataset (MLHD) (Vigliensoni & Fujinaga, 2017).Most of them are either small (Celma, 2010) or do not contain up-to-date listening information (Schedl, 2016;Vigliensoni & Fujinaga, 2017).Therefore, we introduce -as an additional contribution -the LFM-2b dataset, a novel up-to-date large-scale realworld collection of music listening records, gathered from Last.fm, which considerably extends LFM-1b.Unlike existing datasets, LFM-2b provides more than two billion listening records for more than 120,000 users who listened to more than 50 million unique tracks in total.Another remarkable difference to LFM-1b is the large temporal coverage of listening records (2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020), which allows to track users' listening behavior over considerable periods of time.LFM-2b contains a dedicated subset, LFM-2b-DemoBias, which we especially developed to study and evaluate fairness and biases in music RS in terms of users' gender, age, and country of origin.Because of these characteristics as well as a higher average number of listening records per user in comparison to LFM-1b (Table 1, rows LFM-1b, LFM-2b, and LFM-2b/1b   ), LFM-2b-DemoBias is well-suited to approach the research questions under investigation.All data is publicly available.
In approaching RQ1, we study gender-related unfairness on a variety of RS algorithms.Using LFM-2b-DemoBias, we train the algorithms and compute evaluation metrics separately on the subsets of male and female users in the test set, using accuracy-based (recall and NDCG) as well as beyond-accuracy metrics (diversity and coverage).We define unfairness of a RS algorithm regarding an evaluation metric as the mean absolute differences of average evaluation scores across pairs of sensitive attributes (genders in our experiments). 1This notion of fairness closely corresponds to the equality of opportunity metric of Hardt et al. (2016).
To address RQ2, we repeat the above-mentioned experiments by considering a debiasing method where -following Geyik, Ambler, and Kenthapadi (2019) -data points of the minority group (female) in training data are resampled up to the number of data points of the majority group (male).
To answer RQ3, we extend the concept of compounding imbalances, introduced by De-Arteaga et al. (2019) on the true-positive metric, to any arbitrary metric and refer to it as compounding factor.The compounding factor regarding a model and an evaluation metric is defined as the divergence of the distribution of the metric's results over the users' groups, from the population distribution in the dataset.A higher compounding factor indicates that the model intensifies the existing bias in data, toward the majority group. of these two genders, or their gender is not specified.We are however fully aware that a gender binary model is not representative of all individuals; yet, working with in-the-wild data (as in our case) entails an unavoidable caveat, derived from the still predominant belief that human beings can be sorted into two discrete categories (Hyde, Bigler, Joel, Tate, & van Anders, 2019).All introduced metrics, however, are defined for generic non-binary settings, and can be applied to gender or any other sensitive attribute.
In the study at hand, we focus our investigation on CF algorithms for several reasons.First, they are more widely adopted than CBF.Second, they typically yield better performance than CBF systems, and are therefore used in (or as part of) almost all stateof-the-art systems.Third, investigating hybrid systems instead would render it hard to disentangle unfairness aspects originating from the collaborative information from those originating from content information.Fourth, the output of CF systems is particularly sensitive to data biases, as shown for instance in Melchiorre et al. (2020) for differences in interaction data resulting from different personality traits of users, in Lambrecht and Tucker (2019) for gender-specific differences in job advertisements, in Bauer and Schedl (2019) for differences in terms of users' inclination to listen to mainstream music, and in Abdollahpouri, Mansoury, Burke, and Mobasher (2019) and Kowald, Schedl, and Lex (2020) for differences resulting from varying item popularity in the music and movie domains.
Fairness has been investigated in various domains.For instance, in job recommendation, studies have found that highly paid jobs are more often recommended to men than to women, both on Facebook by Lambrecht and Tucker (2019) and on Google by Datta et al. (2015).Lambrecht and Tucker identify as the reason the cost-minimizing strategy of advertising algorithms.More precisely, platform owners charge for showing a job advertisement to users; and the cost is different for different demographic target groups.Since young women belong to a particularly expensive group, well-paid ads that are meant to be gender-neutral are, in fact, more frequently presented to male users by algorithms adopting a cost-minimizing strategy, because the latter group is ''less expensive''.In the domain of books, Ekstrand, Tian, Imran, Mehrpouyan and Kluver (2018) investigate the disparity of book authors' gender distribution in the user profiles and in the recommendation lists.They find that particularly CF algorithms often create recommendation results that are biased toward male authors.In the movie domain, Lin, Sonboli, Mobasher, and Burke (2019b) study how different recommender system algorithms amplify or dampen preferences for specific item categories (e. g., Action versus Romance) for male and female users.They show, for instance, that neighborhood-based models intensify the preferences, for all users, toward the preferred item category of the dominant group (males), while some other algorithms, such as SVD++ and BiasedMF, dampen these preferences.Similarly, Mansoury, Abdollahpouri, Pechenizkiy, Mobasher, and Burke (2020b) assess the amplification of popularity bias in recommender systems due to the feedback loop, i. e., recommending popular items makes the popular items even more popular, showing that the bias amplification is stronger for the minority group (i.e., females).To which extent different recommender system algorithms reflect the user group preferences for item categories in input has also been investigated by Mansoury et al. (2019), who follow a similar approach to Lin et al. (2019b).In the music domain, which is also the target domain of our study, bias in recommender systems (Ekstrand, Tian, Azpiazu et al., 2018;Melchiorre et al., 2020;Schedl, Hauger et al., 2015;Shakespeare, Porcaro, Gómez, & Castillo, 2020) and gender representation in music streaming and broadcasting services (Epps-Darling, Bouyer, & Cramer, 2020;Watson, 2020), have recently been investigated.In particular, Schedl, Hauger et al. (2015) show that precision and recall obtained by simple CF and CARS algorithms substantially diverge for users of different gender, age, and country.Melchiorre et al. (2020) show that state-of-the-art CF algorithms yield different performance scores (recall and NDCG) for different user groups with respect to their personality traits, in particular for users with high versus low openness and neuroticism.Ekstrand, Tian, Azpiazu et al. (2018) reveal performance disparities (with respect to the NDCG metric) of simple CF algorithms in the music and movie domains, resulting in unfairness with regards to age and gender.They also find that biases do not necessarily correlate with user group size.
Unfairness and bias caused by algorithms are studied in various related tasks to recommendation, indeed, fairness-aware recommendation algorithms have been presented in the literature (Steck, 2018;Yao & Huang, 2017).Rekabsaz and Schedl (2020) show that ranking models based on neural networks increase the gender bias toward male in retrieval results in comparison with the classical exact-matching models.In the direction of analyzing algorithmic bias, De-Arteaga et al. (2019) discuss compounding imbalances, a concept related to compounding injustices (Hellman, 2018) in political philosophy.De-Arteaga et al. show that if a classifier performs with a lower sensitivity, i.e. true-positive rate (TPR), on the minority group in comparison with the majority group, the imbalance between the groups in the final TPRs becomes larger than the initial imbalance in the underlying dataset.In this case, the model (classifier) intensifies the existing imbalances in the dataset.
Besides fairness, recent work sheds light on various types of biases involved in RSs.For instance, recent research reveals a popularity bias in current recommendation algorithms.In particular, it was shown that users are recommended items that do not match their preference toward a certain popularity level (niche songs/artists are undervalued) (Abdollahpouri et al., 2019;Kowald et al., 2020).
Debiasing and improving fairness.State-of-the-art debiasing methods which are applicable to RSs are commonly categorized into four approaches (Chen et al., 2020): (1) rebalancing, (2) regularization, (3) counterfactual intervention, and (4) adversarial training.In the first category, the data or recommendation results are rebalanced in order to satisfy a certain fairness measure (e.g., demographic parity).In such methods, debiasing is approached as a pre-or post-processing step.Common pre-processing approaches are: relabeling training data to achieve an equal number of relevant labels across the groups (Pedreshi et al., 2008), or resampling data to have an equal number of training data (Geyik et al., 2019).Post-processing methods typically aim to change the output list of RSs in a way that the results over each recommendation or the expectation of results satisfy targeted fairness measures (Biega et al., 2018;Zehlike et al., 2017).
In the second category, debiasing is done by steering the optimization process of the recommendation model during training through including a regularization term for fairness.Zemel, Wu, Swersky, Pitassi, and Dwork (2013) propose a general framework which seeks to learn representations that contain sufficient information for the task in hand but are invariant regarding sensitive attributes.Kamishima et al. adopted this framework to the context of RS and later generalized it to implicit feedback-based recommender systems (Kamishima & Akaho, 2017;Kamishima, Akaho, Asoh, & Sakuma, 2012).Regularization-based approaches are also studied in the context of biases in RSs, for instance by Abdollahpouri, Burke, and Mobasher (2017) to address popularity bias.
Regarding the third category, Kusner, Loftus, Russell, and Silva (2017) introduce counterfactual fairness to RSs.In this method, fairness criteria are satisfied when the evaluation of an individual in the counterfactual world -where the individual's sensitive attribute is changed by intervention -and in the real world are identical.
Finally, the category of debiasing through adversarial learning approaches the topic by creating fair representations through a min-max game, which are agnostic to sensitive attributes (Bose & Hamilton, 2019).In this direction, recently, Beigi et al. (2020) propose an adversarial training method which seeks to protect users' sensitive attributes from an attacker who has access to users' item list and recommendations.
In the study at hand, we investigate a method from the first category of approaches, namely rebalancing.In particular, we investigate the use of a rebalancing technique on the training data of the RS algorithm to achieve statistical parity.For this purpose, we resample the data points of the users of the minority group (female) in training data to achieve an equal number of users with the ones of the majority group (male).

Datasets for music recommendation experiments
Investigating (music) recommendation algorithms in such a way that insights gained can generalize to real-world applications requires access to suitable datasets containing data obtained in-the-wild.Although many corpora have been publicly released in the last decade for the study of music RSs, the majority of these -unlike the proposed LFM-2b dataset -do not include users' demographic information.This omission of users' demographics is particularly the case for corpora containing data from Spotify (Brost, Mehrotra, & Jehan, 2019;Pichl, Zangerle, & Specht, 2015;Zamani, Schedl, Lamere, & Chen, 2019), Yahoo! (Dror, Koenigstein, Koren, & Weimer, 2011), Echo Nest (Bertin-Mahieux, Ellis, Whitman, & Lamere, 2011), or Art of the Mix (McFee & Lanckriet, 2012).
However, to investigate bias in music RSs in general, and the so-called population bias (Olteanu et al., 2019) in particular, users' demographic information becomes essential.The population bias, an aspect of which we study here, is a type of bias contained in the data itself that arises from the distortion of a given population with respect to a target population.This is typical, for instance, of social media, since some platforms are more frequently used by a specific group, e. g., females on Pinterest, while others by another group, e. g., males on Twitter. 3 To investigate how state-of-the-art recommender algorithms might increase or mitigate a bias already present in the data, a dataset containing users' demographic information becomes indispensable.From the publicly available datasets already presented in the literature, those containing such information are: (1) datasets collected from the music platform Last.fm, i. e., Last.fm 360K and Last.fm 1K (Celma, 2010), LFM-1b (Schedl, 2016(Schedl, , 2019)), and Music Listening Histories Dataset (MLHD) (Vigliensoni & Fujinaga, 2017), which include users' gender, country, and age gathered at the time of their registration; (2) datasets created from data shared on other social media sites, such as Twitter, containing music-related hashtags mapped onto musical metadata through open music encyclopediae such as MusicBrainz,4 i. e., MusicMicro (Schedl, 2013), Million Musical Tweets Dataset (MMTD) (Hauger, Schedl, Košir, & Tkalčič, 2013), and #nowplaying-RS (Poddar, Zangerle, & Yang, 2018).Nevertheless, none of the datasets containing demographic information has been developed by having in mind the evaluation of population bias beforehand, which impairs a clear understanding of their real potential for bias-related research: considering that users lacking on demographics would be discarded for such a study, the actual value of these datasets for the assessment of bias in music RSs is unknown.Furthermore, although the use of online music platforms and social media -namely the main sources for retrieving users' demographic information in this context -has particularly increased in the last years,5 up-to-date datasets of this nature have not been recently presented.Therefore, we introduce the LFM-2b dataset, an up-to-date large-scale corpus containing listening histories from Last.fm users collected over the last 15 years (from 2005 until 2020).In addition to this, the LFM-2b-DemoBias, i. e., a subset of the former demographics specially tailored to assess population bias in music RSs, is also presented.

LFM-2b Dataset
In this section, the LFM-2b and the LFM-2b-DemoBias datasets are introduced.Aspects such as data acquisition procedures, accessibility, as well as the main characteristics of each collection, will be discussed in the following.

Data acquisition and accessibility
The LFM-2b(illion) dataset is a large collection of music listening events (LEs), i. e., users' interactions with the music online platform Last.fm,6 enriched by users' demographic information (i.e., users' age, country, and gender), music-related metadata (e. g., artist and track names), and timestamps (specific time when a particular track was listened to by a given user).Following the methodology applied in the acquisition of the LFM-1b dataset (Schedl, 2016), the LFM-2b7 was collected from the web streaming service Last.fm using the Last.fmAPI. 8 The LFM-2b (encompassing more than 2 billion of LEs) is an extension of the former, containing the same 120,322 users but with listening histories extended over 15 years: from 14 February 2005, until 20 March 2020; which yields 2014, 164, 872 LEs in total. 9n order to enable reproducibility of our results and to foster further experiments on bias and fairness in the music recommendation domain, the LFM-2b dataset is stored in form of tabular data encoded in UTF-8.LEs are codified in a unique file containing for each row a LE, for each column users' demographics, music metadata, and the timestamp.Users' demographics are: user ID (unique for each user), gender information (female or male), country, and age.Musical attributes are: track ID (unique for each track) and track identifier (track and artist names combination).Note that, with track we refer to each unique musical item produced by a specific artist.Furthermore, user-track-playcount matrix (UTM) and user-artist-playcount matrix (UAM), i. e., two 2-dimensional matrices containing the interactions between unique user-track and user-artist, respectively, are also provided as sparse matrices in Python NumPy format.10

From LFM-1b to LFM-2b
In Table 1, descriptive statistics for the LFM-1b, the LFM-2b, the LFM-2b\1b   (set difference between LFM-1b and LFM-2b), and the LFM-2b-DemoBias (Demographic Bias subset), are displayed.11For each dataset, the number of Users, Tracks, Artists, and Melchiorre et al.Listening Events (LEs), as well as the mean and standard deviation (indicated after ±) of users' interactions in terms of (unique) tracks per user, (unique) artists per user, and total LEs per user (see Tracks/User, Artists/User, and LEs/User, respectively) are reported.
Although there is only a difference of 6 years in the length of the listening histories collected for the LFM-1b 12 w.r. t. the LFM-2b dataset, the latter represents a considerably larger range of listened tracks and artists: 31,413,999 versus 50,813,373, and 3,116,790 versus 5,217,014 for LFM-1b versus LFM-2b (see Track and Artist, respectively, in Table 1).Similarly, the LFM-2b contains approximately double the number of LEs than the LFM-1b: 1,088,161,692 versus 2,014,164,872 (see LEs in Table 1).When evaluating the set difference between LFM-1b and LFM-2b, i. e., the LEs from LFM-2b collected only during the last 6 years (see LFM-2b\1b   ), we observe that the number of LEs from LFM-2b\1b   is comparable to the one from LFM-1b (collected during 9 years), which indicates that the users have increased their music consumption within the platform in the last 6 years: 1,088,161,692 versus 926,003,180, respectively, for LFM-1b versus LFM-2b\1b   (see LEs in Table 1).This goes along with the general raise in social media usage displayed in recent years, which emphasized the importance of using up-to-date datasets in the evaluation of users' music consumption.
By calculating the differences in the coefficient of variation (CV   ), i. e., the difference in the ratio of the standard deviation to the mean (between LFM-1b and LFM-2b\1b   ) for each type of interaction, a general increment in the variability of the users' consumption habits is revealed.The smallest increment is shown for the interactions in terms of LEs (see LEs/User for LFM-1b versus LFM-2b\1b   in Table 1), yielding CV   = 20%.The largest increment is found for the interactions in terms of artist (see Artists/User for LFM-1b versus LFM-2b\1b   in Table 1), yielding CV   = 70%.In between, the interactions in terms of track (see Tracks/User for LFM-1b versus LFM-2b\1b   in Table 1), yielding CV   = 50%.Overall, this indicates that in the last 6 years, users' listening behavior changed especially concerning artist variability, meaning that many users increased substantially the number of artists they listen to.It is also evidenced by comparing the coefficient of variation (CV) for the interaction Artists/User of each collection: CV = 190% versus CV = 120% for Artists/User in LFM-2b\1b   versus LFM-1b, respectively.Differently, the amount of interactions within the platform remained more stable across users: CV = 200% versus CV = 180% for LEs/User in LFM-2b\1b   versus LFM-1b, respectively.As expected when working with data collected in-the-wild, both LFM-1b and LFM-2b\1b   display great differences across users' consumption behavior, i. e., some users have a much lower number of interactions than others.Yet, LFM-2b\1b   indicates that this differences across users have become more salient in terms of artist, which indicates an increased interest of users toward musical diversity.All in all, LFM-2b, being the union between LFM-1b and LFM-2b\1b   , presents a considerably higher amount of LEs per user w. r. t.LFM-1b.Furthermore, it is more up-to-date and shows also a much higher artist variety; thus, being particularly suitable for assessing the performance of music recommender systems in general and for studying their bias in particular.

LFM-2b-DemoBias: a collection for studying fairness
In order to examine data and algorithmic/model bias in music RSs, along to the LFM-2b dataset, we introduce the LFM-2b-DemoBias (Demographic Bias) subset, which contains music LEs for users with valid demographic information in terms of age, gender, and country.Since bias might be independently investigated for gender, age, and country, we intentionally consider in the LFM-2b-DemoBias users who have at least one of the demographic attributes, instead of all three.Note that all the filtered collections, i. e., the subsets at the intersection between several demographic attributes (e. g., LEs for users with a valid gender and age information) as well as the playcount matrices, can be inferred from the LFM-2b directly.
The LFM-2b-DemoBias encompasses a total of 60,972 users (see All for LFM-2b-DemoBias in Table 1), from which 55,771 provide gender information (see All for Gender in Table 1), 55,190 country information (see All for Country in Table 1), and 46,120 age information (see All for Age in Table 1).Within each demographic group, i. e., gender, country, and age, there is an unbalanced distribution of users between sample: for instance, 15,802 female versus 39,969 male (see F and M for User in Table 1); 10,255 users from the USA, i. e., the first ranked country, which consists of more than twice as many users in comparison to any other country.(see US for User in Table 1); 37,228 users under thirty years versus 8,892 users above (see <30 and ≥ 30 for User in Table 1).As expected, this is similarly displayed when evaluating the collection at the track and artist level (see differences between sample within the Gender, Country, and Age groups for Track and Artist in Table 1).We observe, indeed, a great diversity between countries concerning the unique artists listened to: despite their differences in number of users, the USA and Russia both show a high number of unique artists (see 1,234,159 and 1,167,304, for Artists in US and RU in Table 1); Germany, the UK, and Poland -while similar in number of users -show a considerably lower diversity (750,767, 773,461, and 662,239 Artists in Table 1).Such unbalanced distributions, characteristic of datasets collected in-the-wild, indicates the population bias of LFM-2b, in which male, US, and young listeners represent the dominant group of users.In other words, the population of the LFM-2b-DemoBias is distorted w. r. t. the real world population, which does not contain such a bias.The population bias shown in LFM-2b-DemoBias makes this collection particularly suited to investigate model/algorithm bias, since it enables to assess to which extent a given recommender algorithm might create a model that reflects, amplifies, or alleviates this data bias.
Since the aim of the study at hand is to investigate gender fairness, we further inspect the interaction of the gender attribute in LFM-2b-DemoBias with the other two demographics, i. e., country and age.Therefore, in Fig. 1, the distribution of listening events across users containing information for at least two demographic attributes, i. e., gender and country (see Fig. 1a), or gender and The median and the four quartiles are indicated: first quartile, second quartile, median, third quartile, and fourth quartile (from left to right).Countries with more than 2000 users are indicated individually, with LEs than 2000 aggregated (and denoted as ''Other''); for the countries' abbreviations and number of users per country, see caption of Table 1.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)age (see Fig. 1b) is displayed.Although the US is the country with more users, their interactions within the platform are generally lower than in other countries with much users, such as Poland.This is particularly clear for male users: those from the US show the lowest median across male (16,108), those from Poland the highest (21,223); see median for male US and male PL, respectively, in Fig. 1 (upper plot).This higher consumption from Polish users is also clearly displayed for female, whose median (19,813) is not only almost as high as that displayed by male users, but is as high as the median for all the other countries with less than 2000 users (i.e., other), and overtakes that displayed by male users from many countries, including the US; see median for female PL and male other and US, in Fig. 1 (upper plot).Concerning age, similar trends between female and male can be observed for the two displayed age groups, i. e., above and below the thirties: for both, female and male, only one fourth of the users per group (under and over 30) produces more than half of the LE; see the fourth quartile for each gender and group, in Fig. 1 (lower plot).This skewed distribution was already shown in the LFM-1b dataset, and it is also clearly shown for the demographic information of country (see Fig. 1, upper plot).

Measuring user group fairness in recommendation
In this section, we define and formulate our approach to quantify fairness/bias of a RS, as well as the compounding factor.We first explain RecGap, a metric of measuring unfairness of RSs from the point of view of user groups (Section 4.1).The proposed RecGap metric is closely related to the Gap metric introduced by De-Arteaga et al. ( 2019) while expands Gap to any arbitrary evaluation measure, making it suitable for RSs.We then describe the concept of compounding imbalances in RSs, and suggest a metric to capture this concept (Section 4.2).The metrics discussed in this work are formulated for an arbitrary number of user groups (non-binary setting) for the sensitive attribute.The metrics are defined for the set of user groups , which, in the case of the binary setting of the sensitive attribute gender, is equal to  = { , }.

RecGap : Recommendation unfairness metric
Our notion of fairness of a RS closely relates to the equal opportunity metric of Hardt et al. (2016).We consider a RS to be fair if it performs equally good across the groups of users, according to any arbitrary evaluation metric.To measure the fairness of a recommender algorithm, first an RS model is trained -in the case that the algorithm requires training -and then its recommendation predictions are evaluated separately for each groups of users in the test set.The fairness of the RS model regarding the given evaluation metric is defined as the mean absolute differences between the average evaluation scores (across all users) in each pair of groups.We refer to this metric as RecGap, formulated as follows: where   is the set of groups pairs,13   denotes the set of users in group , and () returns the evaluation results of the metric  for user  (assuming the availability of corresponding ground-truth data).Intuitively, the RecGap metric quantifies the disparity between average results, in terms of (), of different user groups.
The metric  can be any evaluation measure that can be calculated on user level (and aggregated on user-group level) such as NDCG, diversity, recall, etc.According to this formulation, RecGap uses the average values across users (as common in evaluation metrics), and does not impose the groups to have an equal number of users.This characteristic of RecGap is particularly important in real-world scenarios, as the number of users across sensitive attributes are typically different.  can have zero or positive values, where zero indicates that the recommendation algorithm is fair, while a higher positive value shows the existence of a higher degree of unfairness.Interpreting the magnitude of   depends on the considered  function and the number of groups.As a simple example for the recall metric, let us consider a binary gender setting with the average recall scores of 0.5 and 0.3 for male and female, respectively.In the example,  Recall is equal to 0.2, which implies that the average male user receives in their top-10 results an increase of 20 percentage of the relevant items, when compared with the ones provided to the average female user.
In the particular case of  being a binary set,   is simplified to the absolute difference of the groups' means of evaluation scores.In this case, the magnitude of   can be assessed by applying hypothesis testing, which examining whether the difference between groups is statistically significant, indicates to which extent the   is meaningful or not.Finally, the direction of bias/unfairness in RecGap can simply be identified by comparing the degrees of the groups' mean values: the model is biased toward the group with the higher mean score (considering that higher  values means higher gains).

Compounding imbalances in recommender systems
Our definition of compounding imbalances follows the work of De-Arteaga et al. ( 2019) by extending the concept of compounding to RSs according to any arbitrary evaluation metric.Considering a recommendation algorithm with some degree of unfairness across user groups ( ≠ 0), we aim to quantify the extent to which the algorithm compounds the initial imbalances in the data given any arbitrary evaluation metric.More concretely, similar to De-Arteaga et al. ( 2019), we expect that the gain of each group according to an evaluation metric to be proportional to its population, where otherwise the population bias is compounded by the algorithm.
To define such a metric, we first introduce the population distribution  as the distribution of the portions of users in groups.For instance in the case of a binary gender setting,  can be  = [0.8,0.2], meaning that 80% of the users are male and the rest are female.Given a recommendation algorithm and the evaluation metric , we define the metric scores distribution over user groups, denoted by   = {   | ∈ }.Each    element of   is a probability, defined as the sum of the evaluation scores of all users in group  divided by the sum for all users in all groups: The   distribution contains the portion of the score of the metric  regarding each group.Following the above example of the binary genders for a RS biased toward males according to NDCG ( NDCG ≠ 0), the distribution of NDCG scores might results in  NDCG = [0.85,0.15], namely the value 0.85 for male and 0.15 for female.Now, comparing  NDCG with the population distribution  = [0.8,0.2], we observe that the bias of the population toward male (80% to 20%) has now intensified (toward male) in the results of the recommnender (85% to 15%).In other words, the recommender algorithm compounds the population bias.
To characterize the differences between these distributions as one single number, we introduce the compounding factor metric, defined as the Kullback-Leibler (KL) divergence between the distributions.The compounding factor of a recommendation algorithm regarding the metric  (   ) is therefore formulated as: The value of    metric shows the extent of the divergence of   from , where higher values indicate higher degrees of bias compounding by the algorithm.This value in the example above, given  = [0.8,0.2] and  NDCG = [0.85,0.15], is   NDCG = 0.0101.

Experiment setup
In this section, we explain our experiment setups.Overall, we conduct all the experiments on the data of the Gender subgroup in LFM-2b-DemoBias considering two experiment scenarios.In the first one, i. e., Standard, we carry out the experiments on the dataset without any intervention.This scenario reflects how the recommender systems algorithms under investigation would behave in the wild and what would be their unfairness treatments of users on the basis of gender.The second scenario, i. e., Resampled, considers instead a debiasing procedure applied on the original dataset.The debiasing attempts to reduce the difference in treatment of males and females and corresponds to a scenario where the acknowledged gender-based differences are addressed.In each scenario, we study a variety of core collaborative filtering (CF) algorithms of a different nature (e. g., matrix factorization Billsus, Pazzani, et al., 1998 andautoencoders Zhang, Yao, Sun, &Tay, 2019), as well as several evaluation metrics (both accuracyrelated and beyond-accuracy).In the subsequent subsections, we detail the processing steps performed to obtain the dataset used in the experiments (Section 5.1), all algorithms investigated (Section 5.2), the procedure of training and evaluation of algorithms (Section 5.3), the experiments scenarios (Section 5.4), the used metrics (Section 5.5), the used significance tests (Section 5.6), and our approach to hyper-parameter tuning (Section 5.7).Our code for reproducing the experiments is publicly available at https://github.com/CPJKU/recommendation_systems_fairness.

Data processing and preparation
In our experiments, we focus on the Gender subgroup of the LFM-2b-DemoBias (see the part related to Gender in Table 1).We process the data of this subgroup according to the following filtering criteria.First, we consider only user-track interactions with a playcount (PC) > 1.This removes possibly-noisy user-track interactions likely introduced by single interactions.Second, we consider only tracks listened to by at least 5 different users and users that listened to at least 5 different tracks.These thresholds, commonly used in previous work (Bauer & Schedl, 2019;Liang, Krishnan, Hoffman, & Jebara, 2018;Ning & Karypis, 2011;Schedl, 2017), are necessary for a meaningful use of collaborative filtering algorithms.Third, we consider only LEs collected within the last 5 years.This makes our study focused on the users' most recent listening behaviors which, as shown in Section 3, have considerably increased in the last years.Finally, we transform the user-track interactions to binary values, namely by setting a user-track interaction to 1 if the user has listened to the track at least once, to 0 otherwise.
The resulting dataset consists of 23,272 users -a favorable setting for studying fairness from user perspective -but also a very high number of items/tracks (1,606,686 in total), which makes it impractical for large-scale recommendation experiments.We address this issue by randomly sampling 100,000 tracks.Note that applying random sampling guarantees that tracks covering different levels of popularity are included in the final dataset. 14We refer to this final subset as LFM-2b-DemoBias Sub .The statistics of the dataset are reported in Table 2.We should note that even with this reduction, the LFM-2b-DemoBias Sub dataset is still larger than the Million Song Dataset (Bertin-Mahieux et al., 2011) (containing 41,000 items), which is commonly used in research on music RSs.

Recommender system algorithms
We investigate to which extent different recommendation algorithms for implicit data yield different results, depending on users' demographic traits.The selected algorithms cover different types of collaborative filtering approaches.In particular, we study algorithms based on non-personalized recommendation, matrix factorization, -nearest neighborhood, and autoencoders, which have been central in RS research.All the algorithms are applied to a user-item interaction matrix, where users are represented in the rows and items in the columns.If a user interacted with an item, the corresponding value in the matrix is 1 and otherwise 0. In the following, we provide a brief explanation of each of the RS algorithms studied in this work: • Popular Items (POP) provides a simple non-personalized baseline.It recommends to the users the same set of top- tracks, where the tracks are sorted by overall popularity (how many users listened to that track).• Item k-Nearest Neighbors (ItemKNN) (Sarwar, Karypis, Konstan, & Riedl, 2001) is a basic memory-based recommendation approach based on computing item-item similarity.In this approach, an item is recommended to a user if the item is similar to the items previously selected by the user.In the case of CF systems, items selected by the same group of users are considered more similar than items with non-overlapping user groups.• Alternating Least Squares (ALS) (Hu, Koren, & Volinsky, 2008) falls in the category of matrix factorization approaches, a widespread family of algorithms since the Netflix challenge (Billsus et al., 1998).ALS employs an alternating training procedure to obtain a set of user and item embeddings, in such a way that the dot product of the embeddings approximates the original user-item matrix.
• Bayesian Personalized Ranking (BPR) (Rendle, Freudenthaler, Gantner, & Schmidt-Thieme, 2012) provides an optimization function that, instead of predicting the rating for a specific pair of user and item, ranks the items consumed by the users according to their preferences (hence, personalized ranking).To this end, BPR defines an implicit order between pairs of items.BPR maximizes the difference between the rating prediction of items that have interactions with the user and the ones with no interaction.We apply the BPR objective function on matrix factorization embeddings.
• Sparse Linear Methods (SLIM) (Ning & Karypis, 2011) is a linear model that aims to compute top- recommendations, by factorizing the item-item co-occurrence matrix under the non-negativity,  1 , and  2 constraints.The learned item coefficients are used to sparsely aggregate past user interactions and predict the recommended items of the user.• Variational Autoencoders (MultiVAE) (Liang et al., 2018) is a variational autoencoder architecture that first projects the sparse user's interaction vector to a latent distribution space, used afterwards to generate a probability distribution over all the items.MultiVAE employs multinomial likelihood and a different regularization procedure involving linear annealing.

Experiment procedure
In this section, we describe in detail the procedure of training and evaluation of the algorithms, accompanied with an applied cross validation method.
For the recommendation task at hand, different evaluation objectives and data splits have been proposed in the RS literature (Meng, McCreadie, Macdonald, & Ounis, 2020;Sun et al., 2020).In our experiments, we employ a User Split strategy (Meng et al., 2020) of the dataset, commonly used for autoencoder-like algorithms (Liang et al., 2018;Sachdeva, Manco, Ritacco, & Pudi, 2019;Steck, 2019), among others.The splitting strategy is shown in Fig. 2a.The 19,972 users of the LFM-2b-DemoBias Sub dataset are partitioned in train, validation, and test set using a common 60-20-20 ratio split.The users in the training set, along with all their interactions, are used to train the algorithms in analysis (Fig. 2b).The evaluation procedure (either validation or testing) is carried out by feeding 80% of the users' items sampled uniformly at random to the models and using the remaining 20% as ground truth for calculating the metrics (Fig. 2c).Intuitively, the evaluation procedure forces the models to learn the broad music tastes of the users instead of only predicting what the user is going to listen next (e.g., as in leave-k-out strategies Meng et al., 2020).This experiment setup is also referred to as strong generalization (Liang et al., 2018;Marlin, 2004) since we evaluate the recommender systems on novel users not encountered during training. 15n order to provide evaluation for all users in the dataset and also avoid possible biases introduced by the user-sampling strategy mentioned above, we follow the standard practice in machine learning and perform 5-fold cross validation as shown in Fig. 3a.In more detail, we split the users in 5 equal-sized groups and use 3 groups for training (60% of the users), 1 for validation (20% of the users), and 1 for testing (20% of the users).For each fold, we follow the training and evaluation procedure described above.We switch the groups in a round-robin fashion until each one of the user groups is used as a test set.Applying cross-validation provides better estimates for the metrics and also allows testing all the users in our dataset, leading to a better comparison based on the gender attribute.2a, 60% of users in the datasets are used as training data, 20% as validation data, and 20% as test data.The users, and all their items, are used to train the model under investigation (cf.Fig. 2b).For the evaluation procedure (cf.Fig. 2c), 80% of the test user's items randomly selected are provided as input to the model.Subsequently, using the remaining 20% of the data as holdout set ( ℎ ), the metrics are computed on the model output (predicted items   ) and the holdout set.Fig. 3. 5-fold cross validation and the resampling procedure (cf.Figs.3a and 3b, respectively).The 5-fold cross validation cyclic procedure guarantees that every user ends up in a test set once (shown with the darker texture for each of the five iterations) and in a validation set once (shown with the lighter texture).In Fig. 3b, the resampling procedure is illustrated for a specific fold: the female users in the training data are resampled until they match in frequency the male users.

Experiment scenarios
We perform all the experiments bearing in mind two scenarios: Standard and Resampled.In the Standard scenario, we train the system without any intervention on the data, which corresponds exactly to the procedures described in Section 5.3.For the Resampled scenario, we attempt to debias the recommendation algorithm by intervening on the underlying dataset using resampling, as shown in Fig. 3b.Firstly, following the procedure outlined in Section 5.3, we split the users in training, validation, and test, as done in the Standard scenario.Secondly, we resample the users of the minority group (female) in the training set until they match the number of training data points of the majority group (male).Note that when a user is resampled, her listening history is duplicated and used fully in the training procedure.Following Geyik et al. (2019), by providing a balanced representation of male and female users during training, we aim to promote equally good recommendations at inference time.Furthermore, since the validation and test sets are left untouched, the evaluation results from the Standard and Resampled scenarios are comparable.

Evaluation metrics
We evaluate the performance of the algorithms using two accuracy-based metrics: recall and Normalized Discounted Cumulative Gain (NDCG).We also evaluate the results using two beyond-accuracy metrics: diversity and coverage.All metrics are calculated over a ranking result up to the position .We provide a brief explanation of the metrics in what follows.
Recall@K for user  is defined as: where   is the number of items in the test set which are relevant to , and () is an indicator function signaling whether the recommended track at rank  is relevant to  (i.e., () = 1) or not relevant to  (i.e., () = 0).Recall@K quantifies the ability of retrieving relevant items for the user in analysis.It ranges from 0, where no relevant items for the user are retrieved, to 1, where all relevant items are present in the first k position.NDCG@K is defined as where @() is the ideal @ for user , obtained when all items in 's test set are ranked at the top , and @() is the discounted cumulative gain at position  for user , given by where () is the same indicator function as above.Compared to Recall@K, NDCG@K is an accuracy metric that not only quantifies the ability of retrieving relevant items, but also the ability of ranking them.A recommender system algorithm that provides relevant items at the top of the list will score higher in NDCG@K than an algorithm for which the relevant items are at the bottom.This behavior is enforced by ''discounting'' items according to their position, i. e., computing the Discounted Cumulative Gain (DCG).
To normalize the score between 0 and 1, the DCG is then compared to the so-called ''ideal ranking'' obtained by placing all the relevant items at the top of the list.Diversity is calculated for each user as normalized Shannon entropy on the artist level: where   is the set of unique artists whose tracks were recommended (in top ) to the user , and (  ) is the proportion of the tracks by the artist   in top  of the recommendation list.@() is therefore equal to 1 if every track among the top  tracks recommended to the user  has a different artist.@() becomes 0 if all top  recommended tracks come from the same artist.Finally, @ is defined as the fraction of tracks in the test set that are included in the top  recommendation list of at least one user.
In our experiments, we compute the metrics for  = {5, 10, 50}.This aims to model different user needs, ranging from a user interested in only a few top recommendations, to a user who inspects a longer list of recommended items.When discussing results in detail (in Section 6), we focus on the setting  = 10 because this is the number of tracks Last.fm's recommender displays to their users by default.Results for  = {5, 50} are provided in Appendix A.

Significance test
We test the significance of the differences of results in two settings.The first setting (considered to examine the results of RecGap) regards the differences of one recommendation algorithm between two groups with different sizes, i. e., females versus males.The second setting concerns the differences of one recommendation algorithm between two application scenarios, i. e., the Standard and the Resampled (see Section 5.4).In both cases, we perform the Mann-Whitney U test, also known as the Wilcoxon rank-sum test (McKnight & Najab, 2010). 16In addition, pairwise comparisons across the different models for each scenario are carried out.For these, Dunn test and Bonferroni correction for the -values adjustment is applied.Considering that we carry out the experiments through 5-fold cross-validation, in each experiment, five independent statistical tests (one for each fold) are performed.Subsequently, the resulting -values were combined by applying the weighted Stouffer's Z-method (Mosteller, Bush, & Green, 1954;Stouffer, Suchman, DeVinney, Star, & Williams Jr, 1949). 17We select the weighted Z-method as it is robust to asymmetry problems (Whitlock, 2005) and less sensitive to a single low -value, i. e., in order to achieve a low combined -value, several consistently low -values are required (Darlington & Hayes, 2000).Furthermore, the weighted Z-method is also suitable when the combined -values come from multiple tests of the same hypothesis (Whitlock, 2005), as in our study.In all our experiments, we consider the results with  < 0.01 as significant.

Hyper-parameters and training
We select the hyper-parameters of the algorithms under investigation by performing a grid search over different parameters and finding the best set of parameters according to the NDCG@50 results on the validation set.After validation, the best-performing model is selected and finally evaluated on the test set.We reselect the hyper-parameters for each fold.
In the following, we report the range of hyper-parameters relevant to each model.
For ItemKNN, we select the number of neighbors from {3, 5, 10}, and the similarity function among the cosine metric, Pearson correlation coefficient, and Jaccard coefficient.We also examine the effect of removing normalization in the above-mentioned similarity functions (by dismissing the denominator).We select the value for shrinkage (Bell & Koren, 2007) between {0, 10, 100}.
For MultiVAE, we explore different (symmetric) architectures and annealing procedures.We set the total number of epochs to 100 and the learning rate to 1e−3.We examine various architectures, namely I-500-I, I-1000-I, and I-1000-500-1000-I.In the architectures, I is the total number of tracks, the numbers in the middle denote the dimensions of the latent embeddings, and the numbers in between (when existing) are the intermediary dimensions of the feed-forward networks with hyperbolic tangent non-linearity. 18Regarding the annealing procedure described in the original paper (Liang et al., 2018) , we linearly anneal the regularization parameter by choosing the beta steps from the values {5000, 100 000}.We set the annealing cap to 1, i. e., the regularization is performed until its maximum value.

Results and discussion
In this section, we report the results of our experiments and discuss the findings.We first present the overall performance of the recommendation algorithms/models (Section 6.1), and then report the results of measuring fairness in recommendations (Section 6.2).In the current section, we only report the results of the ranking up to the position  = 10.The results regarding other positions (5 and 50) are reported in Appendix A. Furthermore, we carry out experiments also on the LFM-1b dataset with the exact experiment settings, whose results are reported in Appendix B.

Performance evaluation results
Table 3 shows the evaluation results of experiments, averaged over all users, for the two experiment scenarios, namely Standard and Resampled.We conduct significance tests (see Section 5.6) between Standard and Resampled for the results of each algorithm and metric.
Overall, SLIM shows the best performance in terms of the accuracy-based metrics (NDCG and recall) as well as for diversity across the two experiment scenarios.For all the pairwise comparisons, the difference between the scores achieved by SLIM in comparison with the other models is statistically significant.The lowest difference is observed between SLIM and ItemKNN in the Resampled scenario for diversity ( = .0002).On the other hand, ItemKNN has the highest score in terms of coverage.Except for the comparisons with SLIM and BPR (in both the Resampled and the Standard scenarios), all the other pairwise comparisons between ItemKNN and the other models show statistically significant differences.As expected, the non-personalized approach (POP) shows the lowest performance on the accuracy-based metrics and a value of 0.0 on coverage, as POP recommends the same set of items to all users. 19Matrix factorization approaches (ALS and BPR) perform generally inferior than the memory-based ItemKNN in terms of the accuracy-based metrics, while BPR has a higher diversity in comparison with ItemKNN.Finally, MultiVAE generally performs weaker especially on the accuracy-based metrics.This observation is in contrast to the results reported in previous studies on smaller datasets (Dacrema, Boglio, Cremonesi, & Jannach, 2019).We suspect that the lower performance of MultiVAE is due to the large number of items in our datasets, which makes it harder for the algorithm to provide effective distributions of output predictions.
We consider further analysis of this behavior of MultiVAE as future work.
Comparing the results of Standard with the corresponding ones of Resampled, we observe an overall decrease in performance, while in the majority of the cases no significant differences are observed.The cases where debiasing significantly harms the performance are for ItemKNN on all metrics, MultiVAE on NDCG and recall, and ALS on diversity and coverage. 20  In the rest of this section, we study the results of fairness and compounding factors on these algorithms.

Fairness gap in recommender systems
In this section, we first discuss the evaluation results of RecGap, and then analyze the outcomes of the compounding factor.We calculate the RecGap metric, as explained in Section 4, on Standard and Resampled by evaluating the test set results separately on male and female user groups.Tables 4-7 repeat the overall evaluation results, but also report the results on each user group, as well as the calculated RecGap metric.The higher absolute values of RecGap indicate a higher degree of unfairness, while (m) or (f) indicates the direction of favored treatment, respectively, wherever it is toward (m)ales or (f)emales.We also report the significance of differences between the evaluation results related to the group of male versus female, shown by the dagger sign.Highest absolute values for each scenario and metric are shown in bold.
As shown, the majority of algorithms show the existence of a significant gap (RecGap) in performances in favor of the male user group, in particular on NDCG, recall, and coverage. 21The diversity metric in general shows a very low gap with a slight tendency toward the female group. 19In the case of POP, the number of distinct items in all users' recommendation lists is obviously 10, resulting in a Coverage@10 value of only 10 ∼100,000 . 20Notably, the debiasing results of POP on diversity shows a significant improvement. 21The only exception is POP which shows counter-bias, namely a RecGap value leaned toward female users.This case is discussed later in the section.Across the algorithms, SLIM has the highest degree of unfairness on the accuracy-based metrics.This is particularly concerning as SLIM performs the best across all algorithms, making it a strong candidate for a potential RS when not taking into account the fairness measure.In fact, we can even observe an inverse relationship between the accuracy-based and fairness metric, namely RecGap becomes larger for the better performing algorithms such as ItemKNN and SLIM, while it decreases for the algorithms that perform worse in terms of accuracy, reaching the minimum with BPR and POP.
Looking at the effect of the debiasing method, we observe that RecGap only slightly decreases on Resampled in comparison with Standard on NDCG, recall, and coverage.This decrease, while marginal, is still valuable considering the fact that the debiasing method does not deteriorate the performance of the majority of the algorithms, such as SLIM, ALS, and BPR.These results indicates the need for further studying other algorithmic debiasing methods on this dataset, which we consider as a future direction.
We now inspect the relationship between the results of a performance measure, NDCG, with its corresponding RecGap.Fig. 4a depicts the NDCG@10 and its RecGap for different recommendation algorithms, evaluated on the LFM-2b-DemoBias Sub dataset.The darker mark for each recommendation algorithm indicates the result of the Standard scenario, and the lighter mark, the ones in the Resampled scenario.The dashed line shows the linear fit over the results of Standard, indicating the correlation between NDCG@10 and RecGap.As shown, the performance metric and RecGap highly correlate, such that the algorithms with higher NDCG also show higher unfairness.In other words, the recommendation algorithm achieve better performances by more strongly improving the majority group, and hence increasing the gap.Our results highlight the importance of studying fairness of recommendation algorithms in parallel and side-by-side with their performances.
For completeness, we also report the results of the same experiments on LFM-1b-DemoBias Sub (see Appendix B for details).As shown, the same pattern of correlation can be observed on this dataset.However, since the NDCG@10 results of the algorithms on LFM-1b-DemoBias Sub are consistently lower than the corresponding ones in LFM-2b-DemoBias Sub , the corresponding correlation coefficient is relatively smaller.
As for the results of the compounding factor, namely the metric scores distributions and the corresponding CompFct , they are provided in the last two columns of Tables 4-7, respectively.These results provide a complementary view on unfairness in RSs by quantifying in what extent the data/population bias is intensified by model bias.
The distribution of male and female users in the population distribution is  = [0.779,0.221].Based on this population distribution, if the value of male in a metric gain distribution becomes higher than 77.9%, the corresponding model has compounded the population bias toward the male group, which is accordingly reflected in CompFct .Highest absolute values of CompFct across each algorithm, metric, and dataset are shown in bold.As expected, the majority of algorithms (with the exception of POP) compound the existing data bias in the final results.Similar to RecGap, the debiasing approach generally decreases the absolute values of CompFct (except in POP and ItemKNN), while such decreases are marginal.These results highlight how the existence of unfairness in RS models amplify the underlying biases in data, and motivate future work for addressing this issue.
Extending the previous description of results, we analyze the differences between the recommendation algorithms in terms of the RecGap measure, and provide possible explanations for these differences.POP expectedly does not perform well because it only considers popular items and ignores the subtleties of personalized recommendation algorithms.Our results imply that female users, on average, consume slightly more popular tracks compared to male users, which is inline with previous findings (Schedl, Hauger et al., 2015).One particular observation about POP is that RecGap results in the Resampled scenario slightly increases, while decreasing in other algorithms.We argue that the cause of this is due to the increase in popularity of the items listened by female users, resulted from the resampling of female users.Such increase eventually leads to slight improvements of POP in terms of accuracy metrics (NDCG@10 and Recall@10) for the female group, and consequently a marginal increase in RecGap toward female.
For the algorithms that create personalized models, the observations show considerably different characteristics.The most unfair algorithms in terms of accuracy are SLIM and ItemKNN (highest RecGap).Both of these algorithms rely on an item-item similarity measure, computed from the user-item interaction matrix.This suggests the possible effect of item-item similarity measure on reflecting the preferences of the majority group.We consider further in-depth analyzes of such an effect as a future direction.
Finally, we compare the results of the two investigated matrix factorization approaches, namely ALS and BPR.The two algorithms yield substantially different RecGap, where BPR provides the most fair results among the personalized models (though also the second poorest in terms of NDCG and Recall), while ALS's results are much highly unfair.Based on these experiments, we do not observe any direct effect of matrix factorization method on the fairness of recommendation algorithms.

Conclusions, limitations, and future work
In this work, we study the effect of population and model/algorithm bias regarding genders in the context of music recommendation.To this end, we first introduce LFM-2b, a novel large-scale real-world dataset of music listening records, which comprises LFM-2b-DemoBias, a subset containing the listening records of users for which demographic information in terms of gender, age, and country of origin is available.Using LFM-2b-DemoBias, we explore the group fairness of RS algorithms regarding users' genders, according to the discrepancies in the evaluation measures of algorithms.We study different collaborative filtering algorithms common in the literature, and consider accuracy and beyond-accuracy metrics.In addition, we formulate the compounding factor for RSs, and study in what extent the RS algorithms intensify the underlying biases in data.Furthermore, we exploit a debiasing method applied to data, which aims to mitigate the model bias.In the following, we summarize our findings regarding the considered research questions: RQ1: Do recommender algorithms of various categories yield different performance scores (in terms of accuracy and beyond-accuracy metrics) for different user groups with respect to gender?If so, how can these differences be characterized?Our research outcomes show that most of the collaborative filtering algorithms considered in our study tend to be unfair toward the female group (minority), in terms of NDCG, recall, and coverage metrics, particularly on shorter recommended lists, i. e., with ranking list up to positions 5 and 10.Furthermore, we notice a (reverse) relation between the accuracy-based and fairness metrics: better performing algorithms, such as SLIM and ItemKNN, show larger degrees of unfairness compared to the less accurate algorithms such as BPR and POP.
RQ2: What is the effect of a resampling strategy, commonly used as debiasing method, on the performance and fairness of algorithms?Overall, the studied debiasing approach marginally improves the fairness of recommendation results (by reducing RecGap) across the various RS algorithms.Applying debiasing only slightly deteriorates the performance of RS algorithms in terms of accuracy and beyond-accuracy metrics (no significant changes are observed in the majority of cases), which indicates the benefit of using the debiasing method.
RQ3: Do RS algorithms compound data bias?If so, how can this be characterized?The algorithms that lead to unfair results in RSs also compound the bias in data.In such cases, the distributions of final gains of the algorithms are even more biased than the distribution of genders in data.We observe that this compounding of imbalances is particularly high for ALS and ItemKNN on the accuracy-based metrics, and for MultiVAE on coverage.
Our findings translate to reusable insights for the music information retrieval (MIR) and music recommender systems (MRS) communities.First, based on our experimental results, we believe that developing, refining, and adopting debiasing strategies is urgently needed for MIR and MRS tasks that involve personalization, to account not only for differences in performance related to gender -which is shown in this paper -but also for other user-and data-specific biases (e. g., according to age, experience, or popularity).While there already exist several debiasing approaches, to the best of our knowledge, the vast majority of them still lack validation and adoption in the MIR and MRS communities.Second, when conducting evaluation experiments to assess performances of (newly proposed) algorithms, which is still the focus of the (technically oriented) MIR and MRS communities, results should be reported for different user groups.While this is commonly done in MIR and MRS research that explicitly aims at comparing different group-specific characteristics, it is often neglected in more technically driven work.This typically requires adapting the experimental setup, making it vital for MIR and MRS researchers to internalize the corresponding awareness of gender (and other) biases during experimental design.Similarly, user studies conducted in MIR and MRS should critically reflect on potential effects of unequal gender distribution among participants.Even if studies presented in MIR and MRS literature mention the gender distribution of subjects, often the results are not discussed under this perspective, in particular the extent to which differences in results are possibly caused by gender-related aspects.Finally, from a user perspective, methods to increase transparency of recommendation algorithms and explainability of recommendations should be more widely adopted, not only to improve trust in the MRS (which is the motivation commonly mentioned by system providers), but also to raise awareness of potential fairness problems, e. g., through explanations of the form ''You are being recommended song  because other female listeners like it''.
As for limitations of the current study, we acknowledge that the assumption of gender as a binary construct is an oversimplification, and does not reflect the complexity of fairness and bias regarding gender.This decision, however, enables us to take practical steps.In addition, we have centered our study on the evaluation of gender-related bias, while other demographics such as age and country of origin, are not considered.Furthermore, the used datasets contain logs of user interactions with the online platform Last.fm.As such, they can only capture the listening events of people using the platform.All contained information (demographics and listening records) is self-reported by the users, which may be prone to errors and may not necessarily reflect the truth.
These limitations will be addressed in future work by extending our framework to a non-binary setting, which enables the study of the mentioned cases.In addition, we will consider datasets that originate from other platforms, within and beyond the music domain.Finally, a natural future direction of this work is the study of additional algorithmic debiasing methods.In particular, we will explore how/whether fairness of RS algorithms can be achieved without negatively impacting their average performance.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig.1.Distribution of thousands of listening events per user across countries (a) and ages (b) for female and male (indicated in red and blue, respectively).The median and the four quartiles are indicated: first quartile, second quartile, median, third quartile, and fourth quartile (from left to right).Countries with more than 2000 users are indicated individually, with LEs than 2000 aggregated (and denoted as ''Other''); for the countries' abbreviations and number of users per country, see caption of Table1.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 2 .
Fig. 2. User-based split.As shown in Fig.2a, 60% of users in the datasets are used as training data, 20% as validation data, and 20% as test data.The users, and all their items, are used to train the model under investigation (cf.Fig.2b).For the evaluation procedure (cf.Fig.2c), 80% of the test user's items randomly selected are provided as input to the model.Subsequently, using the remaining 20% of the data as holdout set ( ℎ ), the metrics are computed on the model output (predicted items   ) and the holdout set.

Fig. 4 .
Fig. 4. Performance (NDCG@10) versus unfairness (RecGap corresponding to NDCG@10) of different recommendation algorithms on LFM-2b-DemoBias Sub and LFM-1b-DemoBias Sub datasets.The experiments and results related to LFM-1b-DemoBias Sub are explained in Appendix B. For each recommendation algorithm, the darker and lighter marks represent experiments on the Standard and Resampled scenarios, respectively.Linear fit corresponds to the Standard scenario.

Table 1
Descriptive statistics for the LFM-1b, the LFM-2b, the LFM-2b\1b   , the LFM-2b-DemoBias, and the demographic groups Gender: F(emale), M(ale); Country: USA (US), Russia (RU), Germany (DE), UK, Poland (PL), and Brazil (BR), Other (those with LEs than 2000 users); and Age: users under 30 years old and users at least 30 years old (<30 and ≥ 30, respectively).Number of Users, Tracks, Artists, and Listening Events (LEs) are given across groups (All) and for each class.Mean and standard deviation (indicated after ±) of the number of Tracks, Artists, and LEs per User, are also indicated.

Table 2
Statistics of the LFM-2b-DemoBias Sub dataset.Number of Users, Tracks, Artists, and LEs are reported across F(emale) and M(ale) separately and also together (All).Mean and standard deviation (indicated after ±) of the interactions of users with tracks, artists, and listening events are indicated in the last three columns, respectively.

Table 3
Overall results of accuracy (NDCG and recall) and beyond-accuracy (diversity and coverage) metrics on the RS algorithms: POP, ItemKNN, BPR, ALS, SLIM, and MultiVAE; considering two experiment scenarios: Standard and Resampled for all users together (female and male).All results are rounded to the third digit.Statistically significant differences between the Standard and Resampled datasets for each model and metric are indicated with an asterisk ( * ) on the highest value between Standard and Resampled.Highest values for each metric are shown in bold.

Table 4
NDCG@10 results on the users of the male/female (M/F) groups for the Standard and Resampled scenarios.The value of RecGap shows the degree of favorable treatment toward (m)ales or (f)emales.Highest values are shown in bold.Statistically significant differences between the results of female and male are shown with † symbol.Score Dist.columns shows the metric score distributions across males and females.The value of CompFct shows the effect of compounding imbalances in data.The population distribution to calculate CompFct is  = [0.779,0.221].

Table 5
Recall@10 results.Details are identical to Table4.

Table 6
Diversity@10 results.Details are identical to Table4.

Table 7
Coverage@10 results.Details are identical to Table4.

Table A .8
Overall results of accuracy (NDCG and recall) and beyond-accuracy (diversity and coverage) metrics; on the different evaluated models: POP, ItemKNN, BPR, ALS, SLIM, and MultiVAE; considering two settings: Standard (Standard) and Resampled (Resampled); for all users together (female and male).Statistically significant differences between the Standard and Resampled scenarios for each model and metric are indicated with an asterisk ( * ) on the highest value between Standard and Resampled.Highest values for each metric are shown in bold.Overall results of accuracy (NDCG and recall) and beyond-accuracy (diversity and coverage) metrics; on the different evaluated models: POP, ItemKNN, BPR, ALS, SLIM, and MultiVAE; considering two settings: Standard (Standard) and Resampled (Resampled); for all users together (female and male).Statistically significant differences between the Standard and Resampled scenarios for each model and metric are indicated with an asterisk ( * ) on the highest value between Standard and Resampled.Highest values for each metric are shown in bold.

Table A .
10 NDCG@5 results on the users of the male/female (M/F) groups for the Standard and Resampled scenarios.The value of RecGap shows the degree of favorable treatment toward (m)ales or (f)emales.Highest values are shown in bold.Statistically significant differences between the results of female and male are shown with † symbol.Score Dist.columns shows the metric gain distributions across males and females.The value of CompFct shows the effect of compounding imbalances in data.The population distribution to calculate CompFct is  = [0.779,0.221].

Table A .
11Recall@5 results.Details are identical to Table A.10.

Table A .
13Coverage@5 results.Details are identical to Table A.10.

Table A .
14 NDCG@50 results on the users of the male/female (M/F) groups for the Standard and Resampled scenarios.The value of RecGap shows the degree of favorable treatment toward (m)ales or (f)emales.Highest values are shown in bold.Statistically significant differences between the results of female and male are shown with † symbol.Score Dist.columns shows the metric gain distributions across males and females.The value of CompFct shows the effect of compounding imbalances in data.The population distribution to calculate CompFct is  = [0.779,0.221].

Table A .
15Recall@50 results.Details are identical to Table A.14.

Table B .
18Statistics of the LFM-1b-DemoBias Sub dataset.Number of Users, Tracks, Artists, and LEs are reported across F(emale) and M(ale) separately and also together (All).Mean and standard deviation (indicated after ±) of the interactions of users with tracks, artists, and listening events are indicated in the last three columns, respectively.

Table B .
19Overall LFM-1b results of accuracy (NDCG and recall) and beyond-accuracy (diversity and coverage) metrics; on the different evaluated models: POP, ItemKNN, BPR, ALS, SLIM, and MultiVAE; considering two scenarios:Standard (Standard) and Resampled (Resampled); for all users together (female and male).Statistically significant differences between the Standard and Resampled scenarios for each model and metric are indicated with an asterisk ( * ) on the highest value between Standard and Resampled.Highest values for each metric are shown in bold.

Table B .
20 LFM-1b NDCG@10 results on the users of the male/female (M/F) groups for the Standard and Resampled scenarios.The value of RecGap shows the degree of favorable treatment toward (m)ales or (f)emales.Highest values are shown in bold.Statistically significant differences between the results of female and male are shown with † symbol.Score Dist.columns shows the metric gain distributions across males and females.The value of CompFct shows the effect of compounding imbalances in data.The population distribution to calculate CompFct is  = [0.720,0.280].