GAP: Geometric Aggregation of Popularity Metrics

: Estimating and analyzing the popularity of an entity is an important task for professionals in several areas, e.g., music, social media, and cinema. Furthermore, the ample availability of online data should enhance our insights into the collective consumer behavior. However, effectively modeling popularity and integrating diverse data sources are very challenging problems with no consensus on the optimal approach to tackle them. To this end, we propose a non-linear method for popularity metric aggregation based on geometrical shapes derived from the individual metrics’ values, termed Geometric Aggregation of Popularity metrics (GAP). In this work, we particularly focus on the estimation of artist popularity by aggregating web-based artist popularity metrics. Finally, even though the most natural choice for metric aggregation would be a linear model, our approach leads to stronger rank correlation and non-linear correlation scores compared to linear aggregation schemes. More precisely, our approach outperforms the simple average method in ﬁve out of seven evaluation measures.


Introduction
Popularity is without a doubt an abstract notion that is used to express how much attention a certain item, person, or concept has received lately. Today, the estimation of an entity's popularity is desirable in many areas such as music [1], social media [2], science [3], cinema [4], and the Internet [5]. The temporal dynamical patterns of popularity gain vary from entity to entity and can exhibit either viral or steady behavior [6]. Additionally, when multiple metrics concerning performance in general are available for each entity, an optimal approach for the aggregation of the metrics or the rankings is certainly of interest [7,8]. Many such methods have been proposed in the multi-criteria decision analysis (MCDA) research literature [9,10].
This study particularly focuses on the estimation of music artist popularity. For music related products, the traditional way to measure their popularity has been through sales and music top charts. Currently, there is an abundance of online sources that we can draw data from, including streams, downloads, and queries related to music tracks, albums, artists, and musical genres. The consideration of these modern sources as popularity metrics is reasonable for a number of reasons. The music consumer interest is directed to online music sources rather than the traditional record stores and the purchasing of physical albums. Furthermore, not all countries release charts, or if they do, they may not be easy to obtain, so the comparability among countries is hard with the traditional methods of popularity determination. Therefore, we consider web-based artist popularity metrics, such as YouTube views, Spotify popularity, and Facebook mentions, for aggregation.
Determining the popularity of a music track, artist, or genre has attracted increased research interest during the last few years. Many ways to define music popularity have been proposed making use of the online available information from posts on microblog websites [11][12][13][14] and in the blogosphere [15], search queries and the number of shared files in peer-to-peer networks [13,16], play counts in social media music sites such as Last.fm [12,17], the amount of time of radio play, the music industry awards that it received [18], and popularity indices provided by streaming platforms such as Spotify [17]. Of course, the traditional ways of determining music popularity such as the Billboard Magazine chart are also used for comparison with the modern web-based popularity indices [16,19]. In [18], the authors claimed that three factors, the music acoustic content, the artist's reputation, and the number of comments regarding the track, in synergy are able to classify a music track as popular or not, with high accuracy. Furthermore, the level of public recognition of a music track has been investigated providing a different aspect in the evaluation of music entities [20].
Although many studies have been conducted on the estimation of artist popularity, the determination of an evaluation method for such popularity scores remains a challenge as no general agreement regarding an acceptable ground truth has been established. This leads researchers to evaluation through comparison with several other existing popularity metrics such as Spotify popularity, page counts, and the charts. In Table 1, we present the evaluation methods (and ground truth) followed by research papers for their proposed popularity scores. Moreover, according to our knowledge, all popularity scores that have been proposed in the research literature until today are univariate, while the method that we propose herein is the first to combine several diverse sources and metrics of popularity in order to summarize the whole picture of an entity's popularity. Although the most natural choice for metric aggregation is a simple average, the handling of many different sources is clearly not obvious and might be useful to evaluate and compare other non-linear methods as well. Furthermore, being popular with regard to one or some of the monitored metrics is sufficient to characterize an entity as popular; hence, the robustness against such cases is desirable when using a metric aggregation method. Our method leverages the area of geometrical shapes formed by the metrics' values in a non-linear manner; thus, we name it Geometric Aggregation of Popularity metrics (GAP0 and GAP1 are two variations of the same concept). Finally, we conduct a comparative study including the average normalized metric value and two other non-linear metric aggregation methods.
The rest of the paper is structured as follows. In Section 2, the proposed methodology is illustrated. In Section 3, the experimental setup is elaborated and results are presented, and in Section 4, conclusions are given.

Definition
Here, we propose an aggregation method that leverages multi-source web-based information in order to assess the level of an entity's current popularity. In order to determine the popularity of entity e at time t, we first normalize the respective metric values v e,t,i for i = 1, ..., n (where n is the number of monitored metrics for the entity under study) to [0, 1] using a power transformation as in Equation (1): where T is the chosen maximum power transformed value, cf. below, v e,t,i is the initial value of metric i at time t for entity e, P = log(T) log(V t,i ) = log V t,i (T) with V t,i the maximum of v e,t,i over all entities e, and m e,i,t the normalized metric value. The choice of the exponent P derives from the observation that V P t,i = T and 0 P = 0, which result in m e,i,t ∈ [0, 1], given that v e,t,i ∈ [0, V t,i ]. We did not opt for a simple "divide by maximum" or "min-max" normalization because there are metrics with huge variation such as YouTube views that in some cases reach billions, and thus, artists with millions of views would seem non-important. Furthermore, we did not opt for a log transform because there are metrics with a small range such as Spotify popularity, with values from zero to 100. In this case, after the transformation, all normalized values would be between zero and ∼4.61, and a significant Spotify popularity increase, e.g., from 50 to 80 (being 3.91 and 4.38 after log transformation), would not affect the aggregated popularity correspondingly. Power transform alleviates both issues with a relatively high T = 100, which could be optimized if one considers an appropriate ground truth.
After the normalization, we considered the unit circle and n equidistant points k i on it. On each radius from k i to the center, we selected the point l i with distance m e,i,t from k i . Geometric Aggregation of Popularity metrics (GAP0) is then defined as: where E out is the area of the outer regular n-sided polygon determined by k i and E in is the area of the inner polygon determined by l i . If an artist performs best on all metrics, the inner polygon would coincide with the circle's center, and the geometric aggregation of popularity metrics would be 100; while if an artist performs worst on all metrics, the inner polygon would coincide with the outer regular polygon, and the geometric aggregation of popularity metrics would be zero. All other cases result in intermediate values. Of course, different orders of the metrics result in different popularity scores; thus for consistency, we first sorted the metric values and then applied the computations on the sorted sequence of metrics. In Figure 1a, an example case for the computation of Geometric Aggregation of Popularity metrics (GAP0) is exemplified concerning the artist "The Rasmus" on 2 April 2019 , resulting in GAP0 = 62.0. A second approach on Geometric Aggregation of Popularity metrics (GAP1) was to represent the metrics by the sides of the polygon and not by the vertexes. Thus, the inner polygon in this case was the aggregate of n isosceles triangles with side length equal to 1 − m e,i,t , as depicted in Figure 1b. The popularity was then calculated as in the first approach by applying Equation (2), resulting in GAP1 = 60.4. Furthermore, the simple average of the normalized metrics multiplied by 100 was 38.9.

Additional Analytical Results on Geometric Aggregation of Popularity Metrics
The calculation of GAP0(m) and GAP1(m), while not straightforward, is actually very simple given the vector of normalized metric values m = {m e,i,t | i = 1, . . . , n}, for entity e at time t: where m e,n+1,t = m e,1,t .
Proof. For GAP0(m), the inner polygon's area is the sum of n triangles' areas: The outer polygon's area is the sum of n equal triangles' areas: n · ( 1 2 · 1 · 1 · sinθ) = n·sinθ 2 . Hence, according to Equation (2): For GAP1(m), the inner polygon's area is the sum of n isosceles triangles' areas: The outer polygon's area is the same as before: Hence, Furthermore, considering the most natural choice for popularity aggregation, i.e., the average normalized metric values (Average Artist Popularity (AAP)): Proof. The first part of the inequality is pretty straightforward: For the second part of the inequality, we begin with the assumption that m is sorted: The difference D i between the methods GAP1 and GAP0 per metric i is: .., n − 1 and the corresponding difference for i = n is D n = (1 − m e,n,t )(m e,n,t − m e,1,t ) ≥ 0. The total difference between the two models then is: where m r = [m e,2,t , m e,3,t , ..., m e,n,t , m e,1,t ] is m rolled by −1. According to the Cauchy-Schwarz inequality:

Data Set
For this study, our starting point was the list of N = 2349 artists provided by a collaborating record label, called Playground Music. Most of the artists were Swedish, yet artists of several nationalities were also included. For each of these artists, we monitored online popularity metrics from social media and streaming platforms, on a daily basis.
In Table 2, we present the sources and metrics that we used as input to the popularity metric aggregation methods. For each artist, we monitored some or all of these 12 metrics since May 2018, and thus, we could compute the corresponding artist popularity timelines. For Last.fm artist play counts and YouTube channel views, we used as input only the number of plays/views during the last 30 days because the total number may be misleading, in terms of current popularity estimation.

Competitive Aggregation Methods
We employed two non-linear aggregation methods, pertaining to multi-criteria decision analysis and the simple average method (AAP), for evaluation and comparison purposes.
The first non-linear aggregation method was the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) [10], which takes into account the Euclidean distance of the vector containing an entity's metric values from the best and the worst possible alternative. The second was the Preference Ranking Organization method for enrichment evaluation (PRO) [9], which takes into account the number of metrics for which an entity outperforms another entity and finally combines all differences in order to compute each entity's score.

Evaluation
We evaluated all the aggregation methods by comparing the produced artist rankings and actual values with the ground truth using the following measures of similarity: Kendall's tau distance (K) [7] F and K are distance measures, hence the smaller the value, the better, yet all other indices are similarity measures, hence the higher the value, the better. As the ground truth, we used the Last.fm artist play counts and YouTube channel views (summed streams over the last 30 days for both metrics).

Results
In the Introduction, we cited many studies that considered already existing popularity metrics as the ground truth in order to evaluate other popularity scores. We accordingly opted for Last.fm play counts and YouTube channel views (summed streams over the last 30 days) as the ground truth for evaluation purposes. We chose these metrics because we believed that streaming activity reflected artist popularity more accurately than fan count (followers are not always committed to the artist), social media mentions (which are not always related to music), or proprietary "black-box" popularity scores (e.g., Spotify popularity). Furthermore, streaming activity is considered by music business stakeholders as more closely related to artist profits than all other metrics. The five aforementioned aggregation methods, GAP0, GAP1, AAP, TOPSIS, and PRO, were compared and the results are presented here.
In Figure 2, we compare the values of all aggregation methods with the normalized Last.fm artist plays and YouTube channel views with regard to a certain date, being 2 April 2019, using scatter plots. Furthermore, in Table 3, we present the corresponding similarity measures: Pearson correlation (r P ) and Mutual Information (MI) for the linear and non-linear interrelationship between the aggregation methods and the target variables. We also investigated if the best aggregation method differed significantly from the other methods, in terms of similarity to the target. The statistical significance of the differences was estimated as proposed in [21] (dependent overlapping variables) for Pearson correlation and using a randomization test (We denote by y ∈ R N the target variable, by x 1 ∈ R N and x 2 ∈ R N the under comparison aggregation methods, and by θ * = I(x 1 , y) − I(x 2 , y) the test statistic, where I(·) is the mutual information. Considering an approach similar to the permutation test proposed in [22], the test statistic value θ r of the rth Monte-Carlo simulation was computed by the permuted data, which were obtained by pooling x 1 and x 2 and assigning N of them randomly sampled without replacement to the x 1 group. The rest were assigned to the x 2 group. We considered R = 1000 Monte-Carlo simulations for the computation of the p-values, which were then determined by Equation (3): where | · | denotes the cardinality of a set.)for mutual information. To the best of our knowledge, there are many parametric statistical tests for differences in Pearson correlation [23], yet none for differences in mutual information; thus, we opted for the randomization test. It was apparent, from the scatter plots of Figure 2 where the dots are more concentrated and from the correlation analysis of Table 3 where higher similarity scores are illustrated, that all aggregation methods were correlated with Last.fm artist plays to a much higher degree than with YouTube channel views. Thus, we finally chose Last.fm artist plays as the ground truth for our experiments. In Table 4, the similarity of the aggregation methods with Last.fm artist plays on 2 April 2019 is illustrated using all measures of similarity. The statistical significance of the corresponding differences was estimated by Zoo's method [21] for Pearson correlation and also using the previously described randomization test for all measures of similarity. Table 3. Pearson correlation (r P ) and Mutual Information (MI) measures of similarity between the target variables (Last.fm artist plays, YouTube channel views) and the aggregation methods (AAP, GAP0, GAP1, TOPSIS, PRO) on 2 April 2019. With bold letters, we denote the best value per similarity measure and target variable. With the exponents, we denote the best similarity scores for which their difference from the 2nd, 3rd, 4th, or 5th best, respectively, is statistically significant (95% confidence). For r P , the first number in the exponent concerns Zoo's method, while the second concerns the randomization test.  In Table 5, the average similarity between Last.fm artist plays and the aggregation methods across time (from 1 July 2018 until 31 May 2019) is exemplified in terms of linear/non-linear correlation and rank correlation/distance. The results showed that GAP1 exhibited the best performance in three out of seven measures of similarity, while AAP in two, GAP0 and PRO in one each, and TOPSIS in zero. Furthermore, the statistical significance of the differences in average similarity was investigated for all similarity measures, using Student's t-test and by correcting the p-values using the Bonferroni correction for multiple comparisons (α = 0.05). (For each similarity measure, we conducted four comparisons (the best aggregation method vs. each of the rest), so 4 × 7 = 28 comparisons were considered, and the 28 corresponding p-values were modified through the Bonferroni correction). Table 5. Average similarity between the target variable (Last.fm artist plays) and the aggregation methods (AAP, GAP0, GAP1, TOPSIS, PRO) across time (from 1 July 2018 until 31 May 2019). With bold, we denote the best average performance per measure of similarity (column). With the exponents, we denote the best similarity scores for which their difference from the 2nd, 3rd, 4th, or 5th best, respectively, is statistically significant (95% confidence) after Bonferroni correction for multiple comparisons. Although the aggregation methods produced similar artist popularities and rankings (not many statistical significant differences were observed), the correlation analysis showed that GAP produced popularity values that were closer to the target than the other aggregation methods when considering the non-linear similarity measure of mutual information and not when considering the linear correlation. This indicated the advantage of GAP to capture more complex popularity patterns than the simple average, which produced higher values only in linear correlation. In terms of ranking, GAP exhibited less distance from the target's ranking with regard to Spearman's footrule and Kendall's tau distance measures and more proximity to the target's ranking with regard to Kendall's tau. PRO approximated best the target's artist ranking with regard to the Spearman correlation coefficient, and both GAP and AAP showed almost identical rankings with regard to overall rank overlap.

Last.fm
In Figure 3, we present the aggregation methods' timelines for 10 popular artists with the highest discrepancy among the monitored popularity metrics. We focused on artists that exhibited differences in their popularity among different popularity metrics, because otherwise, the aggregation methods would provide the same information as the individual metrics and the comparison among them would not yield noteworthy conclusions. In order to select them, we first uncovered the set A of the 100 most popular artists on a certain date, being 1 April 2019 , by sorting the sums of differences between each artist's metric values and the maximum metric values in our dataset, as shown in Equation (4): where n is the number of metrics, t 0 = 1 April 2019, m :,i,t 0 is the vector of normalized metric values for metric i, time t 0 , and all artists, and max(v) is the maximum value in vector v. Consequently, we employed Shannon entropy [24] as a measure of discrepancy on the distribution of normalized metric values per artist and selected the 10 artists of set A that exhibited the highest discrepancy, namely lowest entropy, as shown in Equation (5): wherem a,:,t 0 is the vector of normalized metric values regarding artist a at time t 0 and E(v) is the Shannon entropy computed on vector v. The vectorm a,:,t 0 was divided by the sum of its elements in order to sum to one, prior to entropy calculation.  It was observed that these 10 artists retained high aggregated popularity values, in terms of GAP, despite the low level of popularity in some individual metrics, while AAP produced lower popularity values as a result of low popularity in some individual metrics. Furthermore, a more stable trajectory was exhibited by GAP0, GAP1, and AAP compared with TOPSIS and PRO, which were more volatile, which partly explained their inferior performance. The fact that GAP produced higher popularity values when the artist was popular in one or some metrics while not popular in the others was considered as a major advantage comparing to AAP. The reason for that was twofold: (a) first, because it was not common for artists to be popular in all platforms; they tended to be active mainly in one or some of them; and (b) second, because being popular in one or some platforms was sufficient for an artist to be characterized as popular in general.
In Table 6, we present a simulated example in order to showcase this advantage. It was observed that although in most metrics, a low popularity level was exhibited, being popular in Metric 4 enabled GAP to also exhibit a relatively high popularity estimate. On the contrary, AAP assigned a relatively low popularity estimate to the same entity. Finally, in Table 7, three cases of the artists of our dataset with metric values distributed as in the simulated case are exemplified, and the same conclusion was drawn again from these example cases.

Discussion
In this study, we proposed an aggregation method for popularity metrics that leveraged diverse sources of popularity information such as metrics derived from social media and streaming platforms. This was the first attempt to aggregate multiple popularity sources in the academic literature related to music information retrieval and admittedly yielded satisfactory results on the very useful task of summarizing the whole popularity picture of an artist. Its algorithm used geometrical shapes formatted by the individual metrics' values of each entity, and it was found to outperform the most natural choice for metric aggregation, being a simple average, with respect to several measures of similarity between the computed metrics and reference data. Furthermore, the proposed aggregation method was robust even when the under study artist was popular only in some of the monitored popularity metrics. Finally, we should mention that our methodology could be extended for use in several other areas such as cinema and football in which actors and players will serve as entities and their social media accounts and other related factors (e.g., tickets/jerseys sold) as metrics. Future work will include the evaluation of all metric aggregation methods on other tasks, such as the prediction of individual metrics' future values.
Funding: This work is partially funded by the European Commission under Contract Number H2020-761634 FuturePulse.