Network-based ranking in social systems: three challenges

Ranking algorithms are pervasive in our increasingly digitized societies, with important real-world applications including recommender systems, search engines, and influencer marketing practices. From a network science perspective, network-based ranking algorithms solve fundamental problems related to the identification of vital nodes for the stability and dynamics of a complex system. Despite the ubiquitous and successful applications of these algorithms, we argue that our understanding of their performance and their applications to real-world problems face three fundamental challenges: (1) rankings might be biased by various factors; (2) their effectiveness might be limited to specific problems; and (3) agents’ decisions driven by rankings might result in potentially vicious feedback mechanisms and unhealthy systemic consequences. Methods rooted in network science and agent-based modeling can help us to understand and overcome these challenges.


I. INTRODUCTION
The roots of research on ranking in social systems can be traced back to the 40s and 50s, when early studies introduced methods to quantify the social status of the members of a social system [1][2][3]. Today, with the increasing availability of massive datasets on human activity, research on ranking algorithms for social systems is highly interdisciplinary, and it has significant real-world applications. These two features are sharply exemplified by Google's PageRank. Introduced in 1998 by Brin and Page, this ranking algorithm became massively popular because of its implementation in Google's Web search engine [4], which has triggered a large wave of interest by computer scientists on its mathematical properties [5] and applications to information retrieval problems [6]. At the same time, its defining equation had been already formulated in the social science literature in 1991 [7,8], and the algorithm and its variants have found countless scientific applications beyond their original domain [9], including the quantification of scientific impact [10], social leadership [11], species importance for an ecosystem's stability [12], and sport performance [13]. Real-world applications of ranking go far beyond Web search engines: every day, reputation systems estimate the trustworthiness of users and retailers in online marketplaces such as AirBnb [14]; automated recommender systems can be used to suggest new products to users in e-commerce platforms like Alibaba and Amazon [15], or suitable investors for new startups [16]; rankings of prospect influencers can be used by online content creators and companies to detect the best candidates for product endorsements [17].
These examples highlight the relevance of ranking various kinds of agents that compose social systems, including pieces of information (like websites and scientific papers), individuals, and businesses. A data-driven ranking of a given class of agents can be obtained through a variety of techniques, ranging from network-based algorithms (the main focus of this Perspective) to machine-learning algorithms (e.g., matrix factorization for recommender systems [18] and latent semantic models to quantify the relevance of a document to a query [19]) and methods based on dynamical models (e.g., fitness-based models [20]). Among all possible ranking techniques, network-based ranking algorithms play a prominent role [21,22]. This class of algorithms take as input a given network representation of a complex system, and assign a score to each node meant to quantify its structural importance (or centrality) in the network. These algorithms can solve important problems related to the structure and dynamics of complex systems. For example, building on optimal percolation theory, the collective influence algorithm finds the minimal set of nodes that, when removed, cause the collapse of the giant component of the network, which has useful implications for information spreading, marketing campaigns, and immunization interventions [23]. Under some assumptions about the topology of the network and the spreading dynamics, the nonbacktracking centrality quantifies the average size of the spreading processes initiated by a node [24], which might be informative for campaigns aimed to maximize the reach of information spreading. a) Electronic mail: manuel.mariani@business.uzh.ch b) Electronic mail: linyuan.lv@uestc.edu.cn In this Perspective, we argue that although the solution to these problems indicates the importance and usefulness of network-based ranking algorithms, there exist three fundamental challenges faced by researchers interested in the development and application of network-based ranking algorithms for social systems: (1) Bias. Rankings can be biased by various factors. For example, rankings of scientific papers based on citation networks can be biased by the papers' publication date, scientific field, and type of article. (2) Performance variability. A ranking algorithm's good performance in a given problem does not guarantee good performance in another problem (cross-problem performance variability). Even when considering a particular problem, how the problem is specified can radically impact on the performance (within-problem performance variability). (3) Systemic consequences. In social systems, ranking algorithms influence the behavior of the agents, which results in potentially vicious feedback mechanisms and unhealthy systemic consequences.
The main goal of this Perspective is to provide explicit examples of these three threats to the design and application of ranking algorithms, and to suggest possible ways to overcome them. We anticipate one of the main takeaways of our Perspective: To properly counteract potential biases, performance limits and unhealthy systemic consequences of ranking algorithms, it is vital to understand the mechanisms behind the emergence of these three challenges. This can be achieved by leveraging models that describe how a social system evolves over time, including models of growing social and information networks [25] and agent-based models where the components of the system behave and interact according to a set of rules [26].

II. NETWORK-BASED RANKING ALGORITHMS
In this Section, we provide a brief overview of network-based ranking algorithms [21,22]. For a simple network representation where nodes are connected by links, counting each node's number of connections provides us with the degree centrality -arguably, the simplest proxy for the node's centrality [22]. Starting from the degree, one can gradually incorporate higher-order network information by iteratively applying an operator to the vector of nodes' scores, which leads to the H-index and k-core centrality (or coreness) [27]. The H-index is an example of local centrality, meaning that the score of each node can be computed by only including information of the neighborhood of the node. Other prominent examples of local centralities include the aforementioned collective influence metric [23], and the LocalRank centrality, a heuristic metric to identify influential spreaders of information [28].
The eigenvector, Katz, PageRank, LeaderRank, nonbacktracking centralities are all examples of global ranking algorithms based on the solution of an eigenvalue problem, and most of these "spectral ranking" techniques can be also interpreted as weighted combinations of network paths [3,22]. Another important class of algorithms builds on pairwise network distances. The average distance of a node from the other nodes in the network can be interpreted as its centrality. Different choices of the distance function lead to different centrality metrics: for example, the shortest-path distance leads to the classical closeness centrality, whereas the random-walk effective distance leads to the ViralRank centrality [29]. Another classical algorithm is the betweenness centrality and its variants, which build on the assumption that a given node i is central if many shortest paths that connect pairs of nodes pass through node i [30].
Finally, an important class of network-based ranking algorithms aims to order the nodes of a system based on the results of (potentially incomplete or noisy) observed sets of pairwise interactions [31][32][33][34][35]. These algorithms seek to find a permutation of the nodes that optimizes a predefined cost function [34]. From this perspective, the edges of a network are interpreted as the consequence of hidden hierarchies and, therefore, one can attempt to infer the underlying hierarchies from the observed outcomes. This family of algorithms typically applies to directed weighted networks, with a broad spectrum of potential applications, ranging from inferring a ranking of sport players/teams in tournaments from the results of their matches [32], to inferring a ranking of universities from the mobility of Ph. D. graduates [34]. Different cost functions have been studied in the literature, leading to various algorithms, e.g., SerialRank [32], SyncRank [33], and SpringRank [34].
We refer the interested reader to [21,22] for detailed reviews on ranking in complex networks, including formulations of centrality for more complex network representations (like weighted networks, higher-order networks, multilayer networks). Crucially, each centrality metric builds on a specific assumption on what being an important node means. While the algorithms' assumptions and their translations into mathematical equations are generally plausible, important challenges arise when applying the resulting ranking algorithm to specific problems, which is the focus of the following sections.
FIG. 1. Bias: the case of PageRank's age bias. We focus on growing directed networks of 10, 000 nodes where each node can receive incoming edges or create outgoing edges. Nodes are labeled by entrance order (i ∈ {1, . . . , 10, 000}, where node 1 and 10, 000 are the first and last nodes to enter the network, respectively). The two panels show the average entrance order of the top 100 nodes in the ranking by PageRank, τ100, in networks generated with two different models. Both models are characterized by two main control parameters: the timescale of relevance decay, ΘR (i.e., the decay of node likelihood to attract incoming edges), and activity decay, ΘA (i.e., the decay of node likelihood to create outgoing edges). The ranking by PageRank is (nearly) unbiased only over a narrow region of the parameter space where τ100 5, 000 = N/2. When the relevance decay is slower than the activity decay (ΘR ΘA), PageRank tends to be biased toward old nodes (τ100 < N/2). A bias toward recent nodes (τ100 > N/2) emerges instead when the relevance decay is faster than the activity decay (ΘR ΘA). Reprinted from [36].

III. BIAS
The main hope motivating the use of algorithms for ranking and prediction tasks is that they might provide an objective estimation of the value of an agent (whether the quality of a cultural product, the talent of an individual, or the relevance of a webpage), whereas human or expert judgment might be subjective and influenced by biases and social factors. Besides, algorithms can crunch massive datasets in relatively little time, allowing for fast retrieval of relevant information in a short time. However, accepting blindly the outcomes by a ranking algorithm is potentially misleading, and we argue here that an appropriate dose of caution is necessary when interpreting the results by a given algorithm as a signal of quality or talent. An important shortcoming of ranking algorithms is that -in a similar way as machine learning algorithms [37] -they can be biased by multiple confounding factors.
Consider the popular Google PageRank algorithm, which builds on a diffusion process on a directed network where each node can have incoming or outgoing edges. In a citation network, for example, a scientific paper's outgoing and incoming edges represent its references to other papers and its received citations from other papers, respectively. PageRank assigns to each node a score equal to the stationary probability of a stochastic process that combines a random walk along the network's edges with a random teleportation mechanism [9]. The algorithm reflects the plausible assumption that a node is important if other important nodes endorse it -e.g., in a citation network, a paper is important if other important papers cite it. However, consider now the first direct experimental observation of gravitational waves and the Physical Review Letter paper that presented this discovery [38], published in February 2016. Were we to blindly trust Google's PageRank, at the end of 2016, we would conclude that among all the papers in the APS corpus, this letter deserves the 12, 482nd place by importance. Instead, even though it is impossible to establish the exact ranking position deserved by the paper, virtually nobody would doubt that the the paper's finding constitutes one of the main milestones in the history of physics. The crux of the problem lies in a fundamental bias: in first approximation, the PageRank score of a paper is expected to be linearly correlated with its number of received citations [39] and, therefore, it strongly favors old nodes over recent ones [36].
Stochastic models of network growth help us to understand why the biases of ranking algorithms emerge [36]. One can indeed grow synthetic directed networks where the probability that a node receives new incoming edges is proportional to its number of previous links (preferential attachment), to its intrinsic fitness, and a time-decaying function that represents the node's relevance decay over time [25]. Besides, existing nodes can activate to create outgoing edges at a rate that decays with their age [36]. This framework reveals that the interplay between the timescales of the two time-decay processes leads to the PageRank's temporal bias (Fig. 1): when the aging of relevance is slower (faster) than the decay of activity, most directed links point from a node to an older (more recent) node, causing the diffusion process that determines the PageRank scores to favor old (recent) nodes [36].
While based on a simple model, this approach reveals that in growing networks, a large part of the variation in PageRank score is due to temporal effects. This suggests that PageRank's temporal bias can be removed by rescaling the scores with a transformation that ensures that the average score of the nodes and its standard deviation are independent of node age [40]. When such a transformation is applied, the resulting "rescaled" score can detect much earlier important nodes, with useful implications for the early detection of milestone papers [40,41], patents [41,42], and movies [43]. The benefit from this procedure is exemplified, again, by the paper that reported the first direct observation of gravitational waves [38]: the paper is ranked 16th by rescaled PageRank at the end of 2016, which constitutes a substantial improvement compared to the 12, 482nd position by the original PageRank, and suggests that the paper deserves a place among the most significant ones in the APS corpus.
While we have focused on the bias by node age of ranking algorithms, biases can emerge along multiple dimensions. For example, citation-based rankings of papers' scientific impact are biased by paper age and scientific field [44,45], which motivates the question whether both biases can be removed by a simple rescaling procedure [46]. Existing results indicate that while simple rescaling procedures can dramatically mitigate the biases of a metric, it remains challenging to achieve a ranking that is statistically consistent with a ranking that is unbiased by construction, obtained by uniformly sampling the top-papers from different age and field groups [46]. Similar age and field biases, and attempts to mitigate them, exist in rankings of researchers [47].
Subtler forms of bias can exist. In science, the number of citations received by a paper might not simply result from the intrinsic importance of the paper, but heavily depend on social mechanisms. Social influence and reputation factors that may affect a paper's impact metrics include the authors' previous number of citations [48] and centrality in the co-authorship network [49], and reciprocity effects [50]. Therefore, social mechanisms might affect rankings by impact and popularity, potentially mismatching rankings and the intrinsic value of the papers. Given the importance of impact indicators for academic careers, additional research is needed to understand whether it is necessary to factor out social effects from these metrics, and if yes, how to do so.

IV. PERFORMANCE VARIABILITY
How to evaluate the performance of a ranking algorithm? In network science, ranking algorithms are routinely evaluated according to their ability to single out vital nodes for the system's structure and dynamics. From a structural perspective, rankings can enable the identification of a set of nodes that maximize a function of the structure of the network [21], e.g., the minimal set of structural influencers whose removal causes the collapse of the network's giant component [23]. From a dynamical perspective, rankings can enable the identification of a set of nodes that maximize a function of the structure and dynamics of the network (functional influence maximization [21]) -e.g., a set of influential spreaders that, when activated, trigger the largest cascade processes under a predefined spreading model [51]. The importance of a set of nodes for dynamical processes on networks can be also assessed for synchronization processes, by quantifying the ability of a set of nodes to drive the system from an initial to a desired state in a short time [52].
One can also evaluate ranking algorithms based on their performance in various kinds of predictive problems. In the machine-learning literature, an important stream of works [32][33][34][35] have assessed the rankings' ability to predict the outcomes of pairwise interactions. Such predictive power can be tested in empirical data on sport tournament results, animal dominance, and faculty hiring, among others [34]. In science of science, rankings can be evaluated according to their ability to early detect small sets of scientific papers or patents that have been labeled by experts as groundbreaking or seminal [40,42].
A key challenge is that the performance of an algorithm in one of these problems does not predict its performance in another one (cross-problem variability). For example, by comparing 17 network-based ranking algorithms, a recent study [41] found that time-rescaled versions of PageRank and its variant LeaderRank [11] are the best-performing algorithms in the identification of expert-selected seminal papers and patents. PageRank is also effective in identifying influential researchers [53]. However, PageRank performs poorly in other problems. Other centrality metrics -like the degree centrality, the k-core centrality [51], and ViralRank [29] -substantially outperform PageRank in finding the most influential spreaders in a network [29]. Building on optimal percolation theory, the collective influence metric significantly outperforms PageRank in detecting the structural influencers [23].
But even when considering a single problem, how the problem is specified can substantially alter the results (withinproblem variability). For example, in the problem of identifying the influential spreaders based on the SIR model, the algorithms' performance depends widely on the transmission probability, a key parameter of the spreading model [29] (Fig. 2), on the type of dynamics in exam (e.g., contact vs. reaction-diffusion dynamics) [29], and on the timescale at which the dynamics is observed [54]. The main takeaway from these studies is that there might exist no all-weather algorithms. The good performance of an algorithm in a given problem does not guarantee good performance in another problem, and even for the same problem, different problem specifications might require different algorithms.

FIG. 2.
Within-problem performance variability: the case of influential spreaders. The performance of six different ranking algorithms (ViralRank, nonbacktracking centrality, random-walk accessibility, k-core centrality, degree, and LocalRank, see [29] for the definitions) is plotted against the SIR spreading model's transmission probability, β, for six empirical datasets -see [29] for details. The performance of a metric is defined as the linear correlation between the node score they produce and the nodes' spreading ability. The spreading ability of a given node is defined as the average size of the spreading processes initiated by that node, and it is measured from the results of numerical simulations with the SIR model with a predefined transmission rate, β. The result shows that the algorithms' performance is highly dependent on β, meaning that a metric that performs well around the critical point of the model (β/βc 1) can perform poorly well above the critical point (β βc). Reprinted from [29].
This uncertainty highlights the importance of reporting clearly the scenarios where an algorithm is effective, and the scenarios where the algorithm fails. In the problem of identifying the influential spreaders under a given spreading model, this can be achieved by reporting the algorithms' performance over a broad range of values of the model parameters, and understanding which parameter ranges are the most relevant ones for various real-world processes.
The choice of the performance evaluation metric is critical as well. In the problem of identifying expert-selected important nodes (papers or patents) in science and technology, the age distribution of the expert-selected nodes can significantly impact on the algorithms' performance: if the expert-selected nodes are old ones, performance evaluation metrics that ignore this bias will favor ranking algorithms that are biased in favor of old nodes [40,41,55]. "Corrected" performance evaluation metrics that penalize biased metrics are not affected by this confounding effect [41]. However, there is not yet a unique and universally-agreed way to evaluate ranking algorithms for scientific and technological impact. Given the role played by impact metrics in research evaluations [56], we need a deeper understanding of their biases and their ability to quantify productivity, talent, and impact. Toward this direction, the introduction of a "gold standard" for the evaluation of impact metrics for academic actors at various levels -from researchers to departments and universities -is highly desirable.

V. SYSTEMIC CONSEQUENCES
A focus on bias removal and performance alone might miss an important property which makes social systems fundamentally different from physical systems composed of atoms and molecules. Rankings can indeed alter how the members of a social system behave: As ranking algorithms are adopted and used in a social system, they can influence the behavior of the agents, which in turn influences the rankings themselves. For example, experimental studies on cultural markets found that when the members of a cultural market are aware of the rankings by popularity of cultural products, the system exhibits wider popularity inequalities compared to conditions where the members are unaware of rankings [58]. Can we predict how the adoption of an algorithm in a given system will alter the agents' behavior and further influence the evolution of the system? Most studies that aimed to answer this question have focused on agent-based models and network formation models [57,[59][60][61][62] [57]. Control parameter β determines the relative importance of ranking and fitness, whereas parameter α determines how sensitive the agents are to the other nodes' ranking position. When the nodes follow the ranking by age-rescaled popularity, R(k), the correlation between node popularity and intrinsic fitness, r(k, q), is significantly larger than when the nodes follow the ranking by popularity, k (top panels). A qualitatively similar result holds for the overlap between the top-100 nodes by popularity and quality, P100(k, q) (bottom panels). Surprisingly, when agents are highly sensitive to ranking (α > 1, roughly), all popularity-based metrics lead to network with a lower popularity-quality correlation than networks where agents follow a random ranking (black lines). Adapted from [57].
insights that can be gained from stochastic models of network formation. First, in a social network, when agents strive to connect to high-ranked agents (i.e., central agents) and delete their links to low-ranked agents (i.e., peripheral agents), highly-hierarchical societies emerge, with reduced social mobility [59,[63][64][65]. From a statistical physics perspective, one can formulate a growing network model where, at each step, a selected agent (with probability α) creates a connection to the most central agent they are not connected to, or (with probability 1−α) deletes its connection to the least-central agent among its neighbors. The model features a phase transition at a critical value α = α c which separates a phase where the resulting network is fully-connected (α > α c ) from a phase where the network exhibits a highly centralized topology (α < α c ), namely a perfectly nested one where agents' interaction are hierarchically arranged [65]. Importantly, this result holds regardless of the algorithm employed by the agents to assess the centrality of the other agents [63]. Results on systems composed of a constant number of agents indicate that highly-hierarchical structures are associated with reduced social mobility, meaning that it is harder for agents to improve their ranking in society [59].
Second, in a growing network, if the new nodes choose their connections driven by a ranking metric that is biased by node age, the resulting system will exhibit uneven popularity distribution, and the overall correlation between talent and success will be low [57]. This property can be observed in a growing network model where, when establishing new links, a node is (with probability β) driven by the ranking position of the preexisting nodes, or (with probability 1 − β) by their intrinsic fitness -a parameter that quantifies the inherent appeal of the node. When the node chooses by ranking, the probability that it chooses node j decays with the ranking position r j of the node as r −α j , where α and β are control parameters of the model. The model can be used to grow synthetic networks under the influence of different ranking algorithms and thus, to compare the long-term implications of the ranking by a metric that is biased by node age (e.g., the total number of links received by the node) against those by a metric that is not biased (e.g., a time-rescaled link count). Numerical simulations indicate that when the nodes follow the unbiased metric, the overall correlation between the final popularity of the nodes and their intrinsic fitness is significantly higher (Fig. 3), and popularity is more evenly distributed [57].
Third, if the agents attempt to mimic the strategies of the highest-ranked individuals, the overall welfare of the society increases, but so do success inequalities [61]. This has been demonstrated with a model where a given agent benefits from a given action due to both the intrinsic payoff associated with the action and her ability to produce a successful outcome out of it. Each agent starts with a random action, and at each time step of the dynamics, each agent has the opportunity to change action. With a probability q, the agent copies the action of a randomly-selected agent among those ranked better than her (imitation), whereas with a probability 1 − q, she selects a random action (serendipity). The control parameter q ∈ [0, 1] determines the relative importance of imitation and serendipity for agents' choices. As q increases, the total welfare of the system increases, but so do success inequalities, whereas the correlation between agents' success and talent decreases [61].
Taken together, these studies indicate that agent-based models can shed light on the possible systemic consequences of ranking algorithms on a given system. In particular, agent-level actions motivated by the results of ranking algorithms can result in unhealthy systemic consequences, like reduced social mobility [59] and low correlation between success and merit [57,61]. In future research, we expect investigations of more complex models to deepen our understanding of the impact of rankings on the structure and stability of social systems. Beyond extensive explorations of various theoretical scenarios, we also expect future research to put more emphasis on the calibration of the models on empirical data: once the agents' behavior is understood through laboratory experiments or observational data analysis, the systemic consequences of the observed micro-level behavior can be grasped through agent-based models calibrated on empirical behavior [66,67].

VI. CONCLUSION
The promise that algorithms deliver objective estimations of quality, talent, and importance is challenged by the algorithms' potential biases, performance variability, and systemic consequences. We argue that these challenges need to be carefully examined by researchers who aim to develop new ranking algorithms, researchers interested in applying an existing ranking algorithm to answer a given research question, and by policymakers who aim to apply quantitative methods to assess an individual's or organization's talent and their likelihood of future success. Although we have focused on network-based ranking algorithms, these challenges potentially apply to any ranking method for a social system. This is exemplified by recent cross-disciplinary efforts to understand, quantify, and mitigate the bias of machine-learning algorithms employed by governments and organizations [37], and to predict the long-term consequences of such biases through computer simulations [62].
We have focused on these three challenges because of their potential interest to the physics and complexity science community. Of course, other challenges exist and need attention by the scientific community, including the algorithms' computational complexity, the dependence of their outcomes (and biases) on data quality and completeness, ethical concerns related to their real-world application, and their robustness against malicious attacks and manipulation, among others. We hope that this Perspective will increase the awareness of the potential limitations of ranking algorithms and inspire studies aimed to overcome them.