Effective spreading from multiple leaders identified by percolation in the susceptible-infected-recovered (SIR) model

Social networks constitute a new platform for information propagation, but its success is crucially dependent on the choice of spreaders who initiate the spreading of information. In this paper, we remove edges in a network at random and the network segments into isolated clusters. The most important nodes in each cluster then form a set of influential spreaders, such that news propagating from them would lead to extensive coverage and minimal redundancy. The method utilizes the similarities between the segmented networks before percolation and the coverage of information propagation in each social cluster to obtain a set of distributed and coordinated spreaders. Our tests of implementing the susceptible-infected-recovered model on Facebook and Enron email networks show that this method outperforms conventional centrality-based methods in terms of spreadability and coverage redundancy. The suggested way of identifying influential spreaders thus sheds light on a new paradigm of information propagation in social networks.


Introduction
The development of social networks has had a great impact on our lifestyles, from making friends to dating, from working to shopping. They become more essential as we are increasing our dependence on them to gather information. Compared with search engines which are based on isolated queries, collecting information through leveraging the individual specialties in social networks leads us to useful information given by experts in disparate fields, and thus increases both the quality and the diversity of acquired information. By the same token, influential individuals can also be used to spread information. The key to success is to identify the most influential spreaders in the network. However, it is difficult to identify them as there are usually just a few individuals capable of propagating a piece of news to a large number of users [1]. For example, while socially significant users are rare in the Twitter network, their messages, and blogs can spread quickly throughout the whole network [2,3].
Some simple methods have been proposed to identify optimal spreaders. For instance, degree centrality suggests that nodes with higher degree are more influential than others [4]. On the other hand, the location of a node in a network and the influence of its neighbors are also important. For instance, a node with a small number of highly influential neighbors located at the center of the network may be more influential than a node having a larger number of less influential neighbors. Kitsak et al [5] thus proposed a coarse-grained method to use the k-core decomposition to quantify the influence of a node, based on the assumption that news initiated at nodes in higher shells is likely to spread more extensively. However, this location-based method is invalid for tree-like networks where all nodes are in the same shell. Recently, Morone and Makse mapped the influencer identification problem onto optimal percolation and proposed a metric called 'collective influence' to find the solution [6] . Their method can find a class of strategic influencers which outrank the hubs in the networks. Some distance-based global metrics such as betweenness [7] and closeness [8] are also suggested which can lead to extensive propagation, but due to high computational complexity, they are not practical for large-scale social networks. Other centralities such as LocalRank were also suggested [9]. We remark that although the selected spreaders by different methods may be influential in specific spreading models, these results are usually sensitive to the chosen spreading mechanism [10,11]. In this paper, we only study the susceptible-infected-recovered (SIR) models, which well describe some aspects of the information spreading process in social media [12][13][14][15].
Simple but sub-optimal protocols have been applied to social media such as QQ, BBS, and Blog to find the key spreaders who can trigger the 'tipping point' in social marketing to promote commercial products. Specifically, if one can convince a set of influential users to adopt a new product, one may induce a large cascade of purchases, as these initial buyers propagate their compliments of the product along the network. Unlike the forementioned methods which identify a set of independent spreaders according to their centralities, our goal is to find a coordinated set of individuals such that their combined impact is greatest, leading to much more extensive propagation of information. However, identifying the optimal combination of spreaders is indeed a difficult task, both conceptually and computationally [16].
In this paper, we utilize the similarities [15,17] between bond percolation and information propagation to identify a group of influential spreaders. By removing edges at random until percolation ceases, individual isolated clusters are formed. Due to the correspondence between percolation and information transmission, the emergence of such clusters implies that news can be effectively propagated within the clusters but not across the clusters. Initiating a piece of news on the most influential user identified by degree centrality in each cluster is thus an effective way to distribute the news within the cluster. Since such a process is static and requires much less computation power than the dynamical spreading of news, a lot of segmented states can be generated and averaged to give a more accurate result on the segmentation of social clusters as well as their corresponding influential spreaders.
By testing our method on Facebook and Enron email networks, we show that in addition to a higher computational efficiency, our method outperforms other simple heuristics based on local and global centrality in terms of propagation coverage and coverage redundancy of the selected spreaders. This is consistent with the old saying that the power of a typical group exceeds that of a single most competent individual. Moreover, we find that the average degree of the users selected by our method is lower, which implies a lower cost in identifying the spreaders when compared to other methods. We also identify the different characteristics of spreaders who are most effective in promoting niche or popular items in order to maximize coverage. All these results lead to insights into the design of viral marketing strategies and a new paradigm for information propagation.

The model
Spreading dynamics with the involvement of humans can be mainly classified into two classes: one is the spreading of infectious diseases which requires physical contact, and the other is the spreading of information, including opinions and rumors where physical contact is not required [18]. Due to the similarity between epidemic and information spreading, well-established models of epidemic spreading are widely used to describe the propagation of information [12-15, 19, 20].
In particular, the susceptible-infected-recovered (SIR) model is one representative [12][13][14][15]. Individuals in this model are classified into three states: susceptible (S, does not carry the disease and will not infect others but can be infected), infected (I, carries the disease and can infect others), recovered (R, either dead or recovered from the disease and immune to further infection). The simulation runs in discrete time steps. At each time step, an infective individual transmits the disease to his or her neighbors with probability β and will recover with probability γ. Then the SIR transmissibility is p b g = . The process stops when there is no infected node anymore. When applying the SIR model to mimic information spreading, a susceptible person (S) in the model is analogous to an individual who is not aware of the information. An infected person (I) is analogous to an individual who is aware of the information and will pass it to his/her neighbors. A recovered person (R) is analogous to an individual who loses his/her interest and will never pass the information on again.
Newman [15] studied in detail the relationship between the static properties of the SIR model and bond percolation phenomena on networks, and remarked that the SIR model with transmissibility p is equivalent to a bond percolation model with bond occupation probability p on the network. After removing the other edges, a number of clusters are formed. It is clear that the ultimate size of the SIR epidemic outbreak triggered by a single initially infected node is precisely the size of the cluster that the initial node belongs to. Apparently, the nodes in the same cluster are expected to have the same coverage. A review article on epidemic processes in complex networks can be found in [17].
Our method is then devised in relation to the bond percolation model [15,[21][22][23][24] as follows. Given an undirected network G V E , ( )where V represents the set of nodes (i.e. users in social networks) and E represents the set of edges (i.e. connection in terms of communication, friendship, or other kinds of interactions), all edges are first removed and each individual edge is then recovered with a probability p. All links are removed when p = 0, and when p increases, more links are recovered and clusters start to form and merge with each other. For a network containing N nodes, a giant component of size O N ( ) emerges only when p is larger than a critical threshold p p c = , and this phenomenon is called percolation. In this paper, we will call those states with isolated clusters the segmented states. In the context of information propagation, since an edge between two nodes appears with a probability p, the value p can be considered as the transmissibility of information from one node to another.
To find the most influential group, we identify the W most influential spreaders in the network by utilizing the segmented states where some of the edges are removed. Assume that there are m isolated clusters in a segmented state after one realization of link recovery, and denote by S i the size of cluster i, for i m 1, 2, , =  . We introduce a tunable parameter L, which is usually equal to or larger than W. If L m  , we choose the top-L largest clusters and assign one unit of 'leader score' to the largest degree node in each cluster. If there are many nodes with the largest degree, we randomly assign the score to one of them. If m L m 2  < , we first choose the highest degree node in each of the top-m largest clusters, and the rest of the L − m nodes are chosen to be those with the second largest degree respectively from the top-L m - , we will choose the next largest degree nodes in each cluster following the same selection rules. After M times of different trials of link recovery, all nodes are ranked according to their scores in a descending order and those W nodes with the highest leader scores are suggested to be the set of initial spreaders. For the sake of simplicity, we set L = W and have tested and found that the results are not sensitive to L. The dependence of the results on L is shown in supplementary figure S1 available online at stacks.iop.org/NJP/19/073020/mmedia.
In other words, our suggested method draws an analogy with percolation to identify individual social clusters in the network where news can be effectively propagated within clusters but not across clusters. These isolated clusters in the segmented state thus have a direct correspondence to propagation coverage when one spreads news from an initial spreader in each of the clusters. Unlike most other methods which usually identify a group of influential spreaders that are not evenly distributed to cover the whole network, our procedures segment the network into non-overlapping components such that the identified spreaders are well distributed, enjoying reduced redundancy when compared to a set of uncoordinated spreaders. These differences make our method unique compared to other methods.

Datasets
We consider two social networks, namely the Facebook network and the Enron email network. Their statistical features are shown in table 1.
(i) Facebook: the friendship relations in the New Orleans Facebook social network. It is a directed network, consisting of nodes which correspond to Facebook users. Each directed edge represents a post or a comment, connecting the users who are writing the comment with the users whose wall the post is posted. Since users may write multiple comments on the same wall, the network allows multiple edges connecting a node pair. Since users may also comment on their own wall, the network contains self-loops. In our experiment, we consider the network as a typical undirected network by deleting self-loops and merging multiple links into a single link. We assume that two nodes are connected if there is at least one directed link between them. The data can be freely downloaded at http://levich.engr.ccny.cuny.edu/~hmakse/soft_data.html.
(ii) Enron email network [25]: Enron's email communication network covers roughly 0.5 million email communications between a group of users. This dataset was originally open to the public, and was posted on the Table 1. The basic characteristics of Facebook and Enron email networks. We denote by V | | and E | | the number of nodes and edges, respectively, C the clustering coefficient [26] and r the assortative coefficient [27]. We denote by k á ñthe average degree, d á ñthe average shortest distance, and H the degree heterogeneity, such that H k k 2 2 = á ñ á ñ . internet by the Federal Energy Regulatory Commission during its investigation. Nodes of the network correspond to email addresses in the system, and if an address i sent at least one email to address j, there is an undirected edge between i and j. Note that non-Enron email addresses are considered as sources and sinks in the network, as we only observe their communications with the Enron email addresses, but not the communications between them. The data can be freely downloaded at http://snap.stanford.edu/data/email-Enron.html.

Spreadability and coverage redundancy
To quantify the performance of our method, we examine the spreadability, i.e. the propagation coverage of a piece of news from a set of W selected spreaders, by our method as well as other methods. We will use the SIR model to mimic the spreading of news, and the spreadability is defined as the ratio of recovered nodes, i.e. the size of outbreak, or the number of users who received the information to the total number of users. We remark that the transmissibility p adopted in the SIR model is the same as the probability p used to recover edges to identify the clusters in the segmented states. As a result, for a single spreader, the maximum size of the SIR outbreak triggered by this spreader is precisely the size of the cluster that it belongs to. Likewise, the maximum size of the SIR outbreak triggered by a group of spreaders in distinct clusters is the sum of the size of the clusters that these nodes belong to. For example, if we measure the maximum coverage of three selected nodes on the network with N nodes, and if the first two nodes belong to the cluster S i , and the third one belongs to the cluster S j , the maximum coverage of each individual node 1, 2, and 3 is respectively S N S N , ) . We first apply our method on the Facebook network with 59691 nodes. Figure 1(a) shows the coverage obtained from 4000 initial spreaders chosen by our percolation method, compared with a set of 4000 spreaders identified by four other methods, namely the k-shell decomposition, the betweenness centrality, the collective influence (CI) method [6], and the high degree adaptive method (HDA) where the degrees of nodes are recalculated according to the updated network (see the appendix for the definition of each of these methods). The percolation method yields the highest spreadability for an arbitrary transmissibility p as shown in figure 1(a), while comparisons with other centrality measures can be found in supplementary figure S2. We show in section 4 of the supplementary data that our method also outperforms the other methods on weighted networks in terms of spreadability. Figure 1(b) shows the degree distribution of the 4000 spreaders identified by the percolation method. When p p 0.01 c < » , the percolation method yields isolated clusters of similar size [21], and since the set of selected spreaders comes from different clusters, a wide range of degree is found among the spreaders as shown in figure 1(b). In this case, the percolation method is more likely to choose high-degree nodes, see supplementary figure S3(b), where the red stars represent the degree distribution of the 4000 selected nodes when p 0.008 = . When p p c > , the distribution becomes narrower as p increases, see the blue squares in supplementary figure  S3(b). In this case, the percolation method prefers low-degree spreaders. The average original degree (i.e. the degree in the original network before edge removal) of the 4000 selected spreaders by the percolation method when p p c < is higher than that of the nodes selected when p p c > . This implies that if we promote and advertise a new niche product which is regarded as difficult to accept, one can draw an analogy with the case of small transmissibility p where high-degree initial spreaders are preferred. On the other hand, for popular items which are easy to accept, one can draw an analogy with the case of large p and low-degree initial spreaders are preferred.
We then examine the cost of initializing the spreading from the selected spreaders. Information propagation sometimes induces a cost, for instance, one may need to pay the star bloggers for posting and passing on an advertisement. We assume that the direct influence of a user is equal to the number of its nearest neighbors, i.e. the degree of the user, and the difficulty of finding a user with degree k is proportional to p k 1 ( ), which can be considered the scarcity of the user. Here p k ( ) is the occurring frequency of nodes with degree equal to k. The cost to initialize (or hire) a spreader i is proportional to his/her impact as well as scarcity, and hence the cost is assumed to be k p k Besides spreadability and cost, we also examine redundancy in coverage which quantifies the efficiency of propagation. Specifically, the redundancy of a node i is defined as the number of initial spreaders who have the potential to infect node i. A method is inefficient if the initial chosen spreaders pass the same information to the same group more than once. Averaging redundancy over all infected nodes, we obtain the redundancy of the set of initial spreaders. Figure 1(e) compares the spreading redundancy of our method with the other four methods (comparisons with other centrality measures can be found in supplementary figure S4). The highest redundancy is found in the methods of k-shell and degree centrality, followed by CI, and then by betweenness centrality. Our percolation method has the lowest redundancy compared to the other four methods, since the spreaders identified by our method are usually located in different regions of the network. We also checked the Enron email network and results similar to the Facebook network are obtained, see figure 2. We also show in section 4 of the supplementary data that our method outperforms the other methods on weighted networks to reduce the redundancy in coverage.
To further examine redundancy in coverage, we applied the five methods to identify four initial spreaders on a simulated network with clear communities. There are three steps to generate a network with community structures. In our experiment, we consider a network with 2000 nodes which has four communities, each of which contains 500 nodes. First, we generate a random network of size 500 and with node degrees distributed in a power-law with exponent 2.2 using the configuration model [28]. The minimum degree is 1 and the maximum degree is 500 23 » [29]. Second, we repeat the above procedures to generate independently the other three networks. Finally, for each pair of sub-networks we randomly select a fraction of node pairs to connect them. As shown in table 2, the four spreaders identified by the percolation method are likely to be found in different communities. For the other methods, there are high probabilities that at least two initial spreaders are in the same community. This result is easy to understand as our method relies on the segmentation of networks into isolated clusters. In this case, the network separates into four communities and thus one spreader is found in each community.
Although the leader score is aggregated after M realizations, a different set of leaders may be generated if another set of M realizations of percolation is generated, since link recovery in our percolation-based method is stochastic in nature and the highest degree nodes in small clusters are not unique. While the other methods always suggest the same set of initial spreaders, different spreaders can be generated by reiterating our percolation procedures, especially for large values of p. Figures 3(a) and (b) show the average number of common nodes in two different solutions of the percolation method, i.e. n c , on Facebook and Enron email networks, respectively. It is clear that when p increases, the number of common spreaders decreases, indicating that the solutions become more diverse. This result has practical significance; in cases when some initial spreaders are offline, we can use the next best candidates as back-up spreaders without extensively losing  spreadability. Compared with the other four methods, the percolation method provides higher flexibility in the choice of spreaders. We further calculate the Shannon entropy of the obtained solutions, which is defined as where q i is the percentage of realizations where node i is found. Figures 3(c) and (d) show the dependence of Shannon entropy on parameter p. A non-zero Shannon entropy also indicates that various solutions are found and different sets of nodes are identified as the initial spreaders. A more non-trivial trend of the Shannon entropy is observed compared to the entropy computed by merely the number of different solutions. For instance, a peak of Shannon entropy at the intermediate values of p p c > is observed. In this case, a giant component exists in the network together with many small clusters. Depending on the random recovery of links, the set of smallest clusters is different for different solutions. When the total number of clusters is roughly equal to the number of identified spreaders, one spreader is identified for each cluster, including the smallest clusters. The different sets of smallest clusters then contribute to the different sets of identified spreaders, and hence a peak in the Shannon entropy at intermediate values of p when the number of clusters is roughly equal to L.
To further examine the difference between our method and the other methods, figure 4 shows the fraction of common spreaders with the percolation method, i.e. f. Comparison with other methods can be found in supplementary figure S5. The overlap between the percolation method and the degree centrality method reaches the highest value at the critical point p 0.01 c = and then sharply decreases to less than 5% when p 0.03 = . This is because when p increases, most high-degree nodes are replaced by nodes with smaller degree, and there are > . As we can see, the percolation method is most cost effective in a large range of spreader cost, i.e. with the highest spreadability given a selected spreader of the same cost. On the other hand, betweenness identified low-cost spreaders with high spreadability, but their spreadability is limited to a small maximum value compared to the other periods. In this case, the spreader with maximum spreadability is always identified by the percolation method. Besides the two real networks, we also investigated scale-free networks. Similar results are found; see supplementary figures S6-S11.

Discussion
As we can see, social networks constitute a new platform to propagate information. Unlike the usual practice where the networks are used by uncoordinated individuals to share their own message, controlled spreading of information can be implemented via the networks. To quantify its performance, one can measure the coverage, the redundancy in propagation, and the cost in identifying appropriate initial spreaders. Yet these measures of performance are largely dependent on the choice of users who start the propagation, and there is not a single protocol which achieves optimality in all these dimensions. These difficulties of identifying influential spreaders make controlled information propagation via social networks solely theoretical.
To tackle the challenge, we draw an analogy between percolation processes and information propagation to develop a method which gives rise to a low-cost, minimally redundant set of initial spreaders capable of achieving large propagation coverage. Our method was tested on Facebook and Enron email networks, where favorable results over centrality-based methods were obtained. When compared to uncoordinated spreaders identified by these conventional methods, the spreaders identified by our method are evenly distributed within the network which greatly increases the propagation coverage and reduces its redundancy. Such coordination of spreaders is essential and can only be obtained using the suggested percolation procedures.
The success of this method is not just a coincidence, since it utilizes similarities between percolation and information propagation. By removing edges at random until percolation ceases, we identify individual isolated clusters where news can be effectively propagated within the clusters but not across the clusters. Specific spreaders at the center of these clusters are then identified to be the influential initial spreaders in the original network. By initiating news propagating from this set of spreaders, coverage is increased and redundancy is reduced compared to conventional centrality methods. Percolation is thus at the center of our method instead of a mere analogy.
The remaining question is practicality. As we have shown in the appendix, the computational complexity of our method is O M V ( | |)with M the number of realizations, which is a favorable characteristic for applying on real systems as its complexity scales linearly with system size. Once the set of important initial spreaders is identified, a coordinator just has to connect to these users and pass them the news; information will propagate quickly throughout the network. Of course, a lot of details and practical issues are omitted in this simple description, but our results shed light on a completely new paradigm of information propagation. Further research along this line may revolutionize our way of spreading and gathering information in the near future.

A.1. Methods for comparison
To identify the most influential spreaders, various centrality measures have been proposed. The simplest method is degree centrality, which we compare our result with. Degree centrality is a straightforward and efficient metric. It assumes that a node with more nearest neighbors has a higher influence. However, node degree can only reflect its direct influence and not the indirect influence triggered by its nearest neighbors. For example, a node of small degree, but with a few highly influential neighbors may be more influential than a node having a larger number of less influential neighbors. In this paper, we employed the adaptive version of degree centrality as one of the baseline methods, namely the high degree adaptive (HDA) approach which recalculates node degree after the removal of links in the network. We compare the high degree (HD) method with the high degree adaptive (HDA) method and find that the adaptive method performs slightly better than the static high degree strategy (see supplementary figure S2).
The second method we used for comparison is k-shell decomposition. Recent research shows that the location of a node in a network may play a more important role than its degree. A node located at the center of the network is more influential than a node having a larger number of less influential neighbors. Similar to this rationale, Kitsak et al [5] proposed a coarse-grained method by using the method of k-core decomposition to =  ) nodes as x-axis values and their corresponding spreadability (as a group) as y-axis values. k refers to the degree of a spreader and p k ( ) denotes the probability of a spreader with degree k in the entire social network. quantify the influence of a node based on the assumption that nodes in the same shell have similar influence, and nodes in higher-level shells are likely to infect more nodes.
The third method is betweenness which is one of the most popular geodesic-path-based ranking measures. It is defined as the fraction of shortest paths between all node pairs that pass through the node of interest. Betweenness is, in some sense, a measure of the influence of a node in terms of its role in spreading information [31,32]. For a network G V E , = ( ) with n V = | | nodes and m E = | |edges, the betweenness centrality of node v, denoted by B v ( ) is [7,33] B v g v g 2 s v s t v t st st , , where g st is the number of shortest paths between nodes s and t, and g v st ( ) denotes the number of shortest paths between nodes s and t which pass through node v.
The last method we compare our method with is the 'collective influence' (CI) method proposed by Morone and Makse [6]. Define i l Ball , ( )as the set of nodes inside a ball of radius l (defined as the shortest path) around node i, i l Ball , ¶ ( ) is the frontier of the ball. Then the CI index of node i at level l is defined as where k i is the degree of node i. Here we set l = 3.

A.2. Computational complexity
Given a network G V E , ( ), there are four steps to find the W influential spreaders by the percolation method. Firstly, all the edges are first removed and then recovered with a probability p; we then obtain a new network G¢ in segmented state. The required computational complexity is O E (| |). Secondly, we find the strongly connected components of G¢ using Tarjanʼs algorithm [34] which has a complexity of O V E + (| | | |). Thirdly, we select one node with the highest degree in each of the L largest components and assign one score to the selected nodes. This complexity for the procedures is O L V ( | |). Repeating the above three steps for different realizations, we rank the nodes according to their scores in descending order, and the top-W nodes are chosen to be the most influential spreaders.