Corrigendum: Identifying a set of influential spreaders in complex networks

Scientific Reports 6: Article number: 27823; published online: 14 June 2016; updated: 25 August 2016 This Article contains errors in the Acknowledgements section. “This work is partially supported by the National Natural Science Foundation of China under Grant Nos 61433014 and 61300018, by the National High Technology Research and Development Program under Grant No.


Introduction
In real world, many complex systems can be represented as complex networks [1,2,3,4,5,6], in which, Many activities such as advertising over media and word-of-mouth on social networks can be described by information spreading on complex networks [5,7,8,9,10,11]. Maximizing the scale of spreading is a common target. If a market manager want to advertise a new product on Twitter.com, she/he tries to choose a small number of users to provide them with free products in exchange for posting tweets about the product to influence their friends to buy the products. So, the task of market manager is to choose a few users such that the product information can be transmitted to more users and, more products can be sold finally. With the topology unchanged or changed slightly, the location of source spreaders determines the final scale of spreading on large degree. The problem of choosing initial nodes as source spreaders to achieve maximum scale of spreading is defined as influence maximization problem [12]. Our research focuses on the strategy of choosing a set of critical nodes as source spreaders in this report.
As influential nodes have strong ability to affect other nodes, selecting top-ranked influential nodes as source spreaders is a common and classical strategy. Up to now, many ranking methods have been proposed, such as degree, closeness [13], betweenness [14] centralities, and other heuristic algorithms [15,16,17,18]. Random-walk based methods such as well-known PageRank [19] and LeaderRank [20] have been receiving great attentions and shown significant value in last few years. Pei et al. [21] addressed a direct method to search for influential spreaders by following the real spreading dynamics in a wide range of networks. Some other methods such as HITS [22] and TwitterRank [23] are also useful and effective. Recently a local based method ClusterRank [24] has also good performance in some cases. Ref. [25] shows that the crucial factor of node's influence is its location in network measured by k-shell value. Under this measuring strategy, nodes with larger k-shell values usually have more ability on spreading. Wei et al. [26] proposed a weighted k-shell decomposition to identify influential nodes. Liu et al. [27] introduced a measure based on link diversity of shells to distinguish the true core and core-like group so as to find the real influential spreaders. Based on Ref. [27], Liu et al. [28] proposed an improved k-shell method by removing redundant links. Lü et al. [29] unveiled the elegant mathematical relationship among three simple yet important centrality measures of networks, i.e., degree, H-index and coreness. Ref. [25] indicates that top-ranked nodes obtained by k-shell decomposition are of significant influences. However, if select them as a group to spread, the result is not so good, even is worse than the result of pure degree centrality. Like k-shell method, other ranking methods such as closeness, PageRank, LeaderRank and ClusterRank suffer the similar limitation.
Kempe et al. [12] proposed a hill-climbing based greedy algorithm that can find a group of important nodes to affect the widest scope of nodes. Their work can overcome the shortcoming mentioned before. However, it is very time consuming, especial in large scale networks. Based on greedy strategy, Narayanam and Narahari [30] proposed a much faster algorithm SPIN approach than greedy algorithm while its quality decreasing a little. Unfortunately, SPIN is also hardly applied to large scale networks. For example, the CPU running time is 28.25 minutes if to find top-30 important nodes in network with 1589 nodes. For this reason, some fast heuristic algorithms are presented in recent years. Chen et al. [31] proposed degree discount heuristic algorithm, which nearly matches the performance of the greedy methods for the IC model. Tang et al. [32] presented a Two-phase Influence Maximization (TIM) algorithm that aimed to bridge the theory and practice in influence maximization. In theory, TIM runs in O((r + ℓ)(n+ m) log n/ε 2 ) expected time and returns a (1 − 1/e − ε)-approximate solution with at least 1 − n −ℓ probability where ℓ and ε are parameters. Zhao et al. [33] made an attempt to find a set of important spreaders by generalizing the idea of the coloring problem in graph theory [34] to complex networks. Ji et al. [35] proposed an effective multiple leaders identifying method based on percolation theory. The method well utilizes the similarities between the pre-percolated state and the average of information propagation in each social cluster to obtain a set of distributed and coordinated spreaders. Very recently, Morone and Makse presented an effective method to find a set of critical nodes by mapping the problem onto optimal percolation in random networks [36]. He et al. [37] proposed a novel method to identify multiple spreaders in complex networks with community structures.
In this report, we propose a simply yet effectively iterative method named VoteRank to choose a set of influential spreaders. In our method, influential spreaders are elected one by one according to their voting scores obtained from their neighbors. At each iteration, the voting ability of elected spreader will be set to zero while that of its neighbors will be decreased by a factor. Our method can be applied to large scale network with millions of nodes since it just updates local information after selecting a spreader. Experimental results on real datasets show that our method outperforms traditional methods on both final affected scale and spreading rate. What's more, VoteRank is also superior to other group-spreader identifying methods on computational time.

Spreading Models
In this report, we mainly use SIR epidemic model with limited contact [38,39] to evaluate methods. In SIR model, each node is in one of three statuses, i.e., Susceptible(S), Infected(I) and Recovered(R). Initially, all nodes are susceptible status except for a set of r infected nodes selected as source spreaders. At each time step, infected node tries to infect one of its neighbors with probability µ. At the same time, each infected node will be recovered with a probability β, if success, it won't be infected again and no longer infect other susceptible nodes. The process terminates if there isn't any infected node in network. In this report, we use λ = µ/β to represent infected rate, which is crucial to infected speed and final affected scale that are often used to indicate the spreading ability of r source spreaders. Besides SIR model with limited contact, the performance of methods can also be evaluated by SIR model with full contact and SI model [40] that is usually used to evaluate the method on spreading rate especially in the early stage.

VoteRank Algorithm
In real world, if a person A has supported person B, the support strength of A to others will fade generally. Under this perspective, a vote based approach for identifying influential spreaders named VoteRank is presented in this report. In VoteRank, the main idea is to choose a set of spreaders one by one according to voting scores of nodes obtained from their neighbors. If we need to select top-r influential spreaders, every node has to vote r turns. The node getting the most votes in each turn is regarded as the most influential node in that turn and will be elected as one of top-r influential spreaders. If a node has been elected as a spreader, it doesn't participate in subsequent voting, and the voting ability of its neighbors also be decreased.Actually, when a node u is elected as spreader, the propagation range has increased a little if the nodes near u are elected as spreaders again since u can transfer information to these nodes. So, it's better to select far apart nodes because they can affect as many nodes as possible. That is to say, after a node is elected as spreader, the selection probability of its neighbors and neighbors' neighbors will decrease. Under this mechanism, the selected nodes are far apart and are important in its local structure. In fact, similar idea has been reported in references. For example, Kitsak et al. [25] pointed out that the propagation range would be improved greatly if any two selected spreaders are disconnected comparing with simply selecting nodes with maximum degree or k-shell value one by one.
In VoteRank, each node u is attached with a tuple (s u , va u ), where voting score s u denotes the number of votes obtained from u ′ s neighbors and voting ability va u represents the number of votes that u can give its neighbors. The details of VoteRank are described as following five steps: step 1: Initialize. Tuples of all nodes are set to (0, 1). step 2: Vote. Nodes vote for their neighbors, at the same time are voted by their neighbors. After voting step, the voting score of each node will be calculated. It is noted that the voting score of node is set to zero if it has been elected in earlier turn so as to avoid electing it again. For example, node v 0 has three neighbors v 1 , v 2 and v 3 . Node v 0 will vote for v 1 , v 2 , v 3 with va v0 votes, and v 1 , v 2 , v 3 will vote for their corresponding neighbors with va v1 , va v2 and va v3 votes respectively. So, the voting score of node v 0 is s v0 = va v1 + va v2 + va v3 . This voting process is different from political voting because some nodes just vote for less one vote in VoteRank.
step 3: Select. According to voting scores calculated in step 2, select the node v max that gets the most votes. This node will not participate in subsequent voting turns, that is, its voting ability va vmax will be zero from now on. step 4: Update. Weaken the voting ability of nodes those voted for v max in step 2. For example, if node u voted for v max , update the voting ability of u with va u − f unless va u has been decreased to zero, where f is a decreasing factor being between 0 and 1. For special case of f = 0, just the degree of newly elected node's neighbors will minus one since only the voting ability of newly elected node turns to zero. In this report, we mainly focus f on a simple form 1 <k> , where < k > is the average degree of the network. step 5: Repeat steps 2 to 4 until r spreaders are elected. In order to give an intuitive explanation, we use VoteRank to choose top-2 nodes on a small toy network with 10 nodes, as shown in Fig. 1. Fig. 1(a) represents the first turn of voting. The value of voting score and voting ability for each node is marked as tuple (s, va) in Fig. 1(a). In this turn, node 0 is chosen and its voting ability is set to 0. The voting abilities of nodes 1, 2, 3, 4 and 5 are reduced by 1 2.4 = 0.417. The updated voting ability of each node is marked in Fig. 1(b). According to new voting abilities, node 7 is chosen since it gets the highest voting score 2.583 at the second voting.
VoteRank algorithm not only can be used to choose top-r spreaders in undirected networks, but also can be used in directed networks. In directed network, if there is a link from node u to node v, u is the in-neighbor of v, and correspondingly, v is the out-neighbor of u. In this report, a link from node u to v indicates that v receives information from u. The directed version of VoteRank is slightly different from undirected one. Firstly, nodes only vote for their in-neighbors, and secondly, only the voting ability of elected node and its out-neighbors will be updated.

Performance Metrics
In this report, three metrics are used to evaluate the performance of methods. The first two are based on spreading scale under SIR or SI spreading model, and the third is based on structural properties of elected spreaders.   In order to compare the spread speed for different methods, we introduce infected scale F (t) at time t which is defined as: where n is the number of nodes of network, n I(t) and n R(t) (n R(t)=0 for SI model) are the number of infected and recovered nodes at time t respectively. In order to investigate the final scale of affected nodes, final affected scale F (t c ) is introduced: where n R(tc) is the number of recovered nodes when spread process achieving steady state. Besides F (t) and F (t c ), the structural properties among selected spreaders are also used to evaluate the performance of different methods. In this report, the average shortest path length L s between each pair of source spreaders S is used as evaluating metric. It is defined as: where l u,v denotes the length of the shortest path from node u to v.

Data Description
Four real networks are used to test the performance of VoteRank in this report. Networks YOUTUBE [42] and COND-MAT [43] are undirected and Networks BERKSTAN [44] and Table 1: The basic topological features of four real networks. n and m are the total number of nodes and edges, respectively. < k > is the average degree for undirected networks or the average out-degree for directed networks. k max is the maximum degree for undirected networks or the maximum out-degree for directed networks. < c > is the average clustering coefficient and H is the degree heterogeneity, defined as H = <k 2 > <k> 2 [41]. Networks n m < k > k max < c > H YOUTUBE [42] 1134890 2987624  NOTRE DAME [45] are directed. YOUTUBE is a video-sharing web site that includes a social network, in which, nodes represent users and edges represent friendships between two users. COND-MAT is a collaboration network, which generates from the e-print arXiv and covers scientific collaborations between authors who submit papers to Condense Matter category. In BERKSTAN, nodes represent pages from berkely.edu or stanford.edu domains and directed edges represent hyperlinks between them. In NOTRE DAME network, nodes represent pages from University of Notre Dame and directed edges represent hyperlinks between them. Some topological features of these four networks, including the number of nodes n, the number of edges m, the average degree (or average out-degree for directed networks) < k >, the maximum degree (or maximum out-degree for directed networks) k max , the average clustering coefficient < c >, and the degree heterogeneity H which is defined as <k 2 > <k> 2 [41], are shown in Table 1.

Results
The performances of VoteRank and other methods are evaluated by three metrics mentioned before on four real networks. Figure 2 shows the infected scale F (t) on four networks under different methods with infected rate λ = 1.5 and p = 0.002 where p is the ratio of the number of source spreaders and that of nodes in network. From Fig. 2, it can be seen that from the source spreaders obtained by VoteRank, information can spread faster and eventually affect larger scale than that by other methods. Moreover, the deviation of F (t) is generally small especial for YOUTUBE, BERKSTAN and NOTRE DAME. Figure 3 shows the final affected scale F (t c ) with different number of source spreaders. It's obvious that VoteRank can achieve wider final affected scale F (t c ) than other benchmark methods under same number of source spreaders, especially when the number of source spreaders is large. Figure 4 shows the F (t c ) with different λ for different methods on four networks. From Fig. 4, it can be seen that VoteRank can achieve wider spread scale than other methods under different λ, especial in YOUTUBE, BERKSTAN and NOTRE DAME networks. If λ is too small, information can not be effectively spread no matter how to choose source spreaders. And if λ is too large, information can spread all over the network. For this reason, λ just be ranging from 1 to 2 in this report so as to compare the difference of methods clearly.  Actually, final affected scale F (t c ) is not only determined by the influence of source spreaders, but also by their relative location. For this reason, k-shell decomposition can dig out influential single spreader effectively, but perform poorly on selecting group spreaders by simply selecting nodes with the biggest k-shell value. To overcome this limitation in some degree, a reasonably improved strategy is to choose nodes with the highest voting score or k-shell value such that any two of selected spreaders are not directly linked. That is, under current state, if a node with the highest score is the neighbor of any selected spreader, we will skip this node and consider the next one. Under this improved strategy, VoteRank and k-shellRank can be modified as their improved versions, i.e., VoteRank-Non and K-shellRank-Non, respectively. In order to evaluate the performance of VoteRank with this improved selecting strategy, we compare K-shellRank and VoteRank with K-shellRank-Non and VoteRank-Non. Figure 5 shows the results of F (t c ) against p ranging from 0.0005 to 0.003 on two undirected networks. Both K-shellRank-Non and VoteRank-Non are improved compared with K-shellRank and VoteRank. Particularly, K-shellRank gets significant improvement. Even though, VoteRank-Non outperforms K-shellRank-Non when the number of source spreaders is large, especial in YOUTUBE. The results of VoteRank-Non and original VoteRank are very close, as shown in Fig. 5. This indicates that the source spreaders selected by VoteRank are more disperse and diverse than K-shellRank. Interestingly, VoteRank even outperforms K-shellRank-Non when p is larger than 0.0015 in YOUTUBE, as shown in Fig. 5(a). In VoteRank, there are two parameters, i.e., decreasing factor f and initial voting ability. The final affected scale F (t c ) under different f is compared, as shown in Fig. 6(a). From this figure, it can be seen that the final affected scale F (t c ) for f > 0 is larger than that for f = 0 except for NOTREDAME. The effect of initial voting ability on VoteRank is also analyzed. The initial voting ability of node i is set as k α i ((k out i ) α for directed network) where α is a parameter whose value is from zero to one, correspondingly. For curtain α, the parameter f of node i is set as <k out > for directed network). The initial voting ability is 1 when α = 0 , and it equals node degree when α = 1. The effect of initial voting ability is shown in Fig. 6(b). Generally, in undirected networks, initial voting ability has little effect on F (t c ). However, in directed networks, the smaller initial voting ability is a relatively better choice. Besides SIR model with limited contact, the performance of methods are also compared on other spreading models such as SI model and SIR model with full contact process, in which, a node will contact its all neighbors. SI model is usually used to evaluate the method on spreading rate especially in the early stage. In SI model, the infected scale F (t) of early stage of different methods is compared, as shown in Fig. 7. From this figure, it can be seen that from the source spreaders obtained by VoteRank, the information will spread faster than that from other methods. The performance of VoteRank on SIR spreading model with full contact process with β = 1, λ = 1.5λ c is compared with other methods where λ c is the threshold [46,47,48]. The results of final affected scale F (t c ) for different methods are shown in Table 2. From this table, it can be seen that in most of cases, VoteRank is rather good.
To verify source spreaders selected by VoteRank are more scattered than that by other methods, the average shortest path length L s obtained by VoteRank and other methods are compared. We just use two small networks, NOTRE DAME(directed) and COND- MAT(undirected) to analyze in this report for calculating the length of the shortest path in large scale network being time consuming. Figure 8 shows L s of source spreaders selected by different methods under different scale of source spreaders. From Fig. 8, it can be seen that spreaders selected by VoteRank have larger L s than that by other methods, especially when p is large. So, compared with other methods, the source spreaders selected by VoteRank are more decentralized in the whole network. Actually, as pointed out in Ref. [49], the spreading will be more effective when L s gets larger.

Computational complexity analysis
The total computational time includes three parts as follows: the time of initializing voting ability and voting score, the time of selecting a node with the highest voting score, and the time of updating the voting ability and voting score. For the first part, the time of initializing voting ability is O(n) and that of initializing voting score is O(< k > n) = O(m) where < k > is the average degree of network and m is the number of edges, so, the computational complexity of this step is O(n + m) = O(m). Particularly, the computational complexity is O(1) if we set initial voting ability as 1. For the second part, the computational complexity is O(n) so as to select a node with the highest voting score. And if we take high efficient data structure such as red-black tree, the computational complexity will decrease to O(log n). For the third part, just the information of nodes with a distance of 2 from the newly selected spreader needs updating. Hence, the computational complexity is O(< k > 2 ) = O(m 2 /n 2 ). To select r spreaders with r times in step 2 and 3, the total computational complexity is O(m) + O(r log n) + O(r < k > 2 ) = O(m + r log n + rm 2 /n 2 ). If networks is sparse and r ≪ n, the computational complexity of VoteRank can be reaching O(n). Although above analysis is based on undirected network, it is similar for case of directed network.

Discussion
In summary, with utilizing information of r − 1 ranked nodes to rank the r th node, we get an obvious boost on information spreading in complex networks, especial in large scale networks.
However, when r is small, little information can be utilized and the advantage of VoteRank is not significant. When r becomes large, the information accumulated by the r − 1 previous nodes becomes abundant and can make a significant improvement. VoteRank provides a simple yet effective way to determine the next most influential node based on the selected nodes. It is worth mentioning that VoteRank outperforms K-shellRank on undirected network, and also outperforms other benchmark algorithms such as PageRank, ClusterRank and IndegreeRank on directed network. The results also indicate that performance of VoteRank is fairly stable with different infected rate λ and different scale of initial spreaders p in terms of information spreading.Another interesting question is how to judge the optimal number of r to get the best spreading ability. In fact, this problem has two equal forms. The first is maximizing the spreading ability while fixing the number of initial spreaders. The second is minimizing the number of initial spreaders for giving spreading ability, i.e., fixing the number of final affected nodes. We just take into account the first form in our work, the other form can be analyzed similarly. Besides, some researchers use the robustness R [50], which is defined as R = 1 n n i=1 σ(i/n) where σ(i/n) is the fraction of nodes belonging to giant component after removing i/n of nodes from network, to evaluate the attacking ability of a method. The method has higher attacking ability if R is smaller. In recent years, many researchers aimed to the study of temporal networks, including structure and dynamics [51]. To identify a set of influential nodes in temporal networks is an important and interesting topic. How to extend our work to temporal networks is worth further studying.