Identifying vital nodes by Achlioptas process

The vital nodes are the ones that play an important role in the organization of network structure or the dynamical behaviours of networked systems. Previous studies usually applied the node centralities to quantify the importance of nodes. Realizing that the percolation clusters are dominated by local connections in the subcritical phase and by global connections in the supercritical phase, in this paper we propose a new method to identify the vital nodes via a competitive percolation process that is based on an Achlioptas process. Compared with the existing node centrality indices, the new method performs overall better in identifying the vital nodes that maintain network connectivity and facilitate network synchronization when considering different network structure characteristics, such as link density, degree distribution, assortativity, and clustering. We also find that our method is more tolerant of noisy data and missing data. More importantly, compared with the unique ranking list of nodes given by most centrality methods, the randomness of the percolation process expands the possibility space of the optimal solutions, which is of great significance in practical applications.


Introduction
Identifying and ranking vital nodes in complex networks is one of the important research fields in network science [1][2][3][4]. By identifying and controlling vital nodes, one can facilitate the spread of information [5] and promote products [6,7], inhibit the spread of the epidemic [8,9], prevent the cascading failures [10][11][12] and better control the network [10][11][12] and so on. According to the difference in principle of methods, Ref. [2] divided representative vital nodes identification methods into four categories: (1) structural centralities, such as degree centrality [13] and coreness [14]; (2) iterative refinement centralities, such as eigenvector centrality [15]; (3) node operation, such as connection-sensitive method [16]; (4) dynamics-sensitive methods, such as dynamics-sensitive centrality [17]. Besides the methods in these four categories, there is another group of methods that considering the network percolation process. Piraveenan et al. [18] first proposed percolation centrality that quantifies the relative impact of nodes based on their topological connectivity, as well as their percolation states, which can be considered as a variant of the betweenness centrality with weighting based on percolated state. Recently, Morone et al. [19] pointed out that the problem of identifying vital nodes can be exactly mapped onto the socalled optimal percolation. Besides, there are many methods defined in terms of the percolation process, such as articulation point-targeted attack strategy, greedy articulation points removal strategy [20], simulated annealing [21], and the method based on percolation critical state [5].
In general, percolation considers the behavior of the clusters formed by the occupied nodes/links of a network, which can be realized by inserting the corresponding nodes/links into the network according to its structure [22,23]. Compared with the traditional indices of node importance, the indices extracted from the percolation process have their unique advantages. First, the clusters formed in the percolation process have universal properties, which do not depend on the specific network attributes, such as the network scale and average degree. Second, the percolation process not only reflects the structural features of networks but also involves the characteristics of dynamics, so it can expose the influence of nodes on both structure and dynamics of networks. Besides, the randomness of percolation process allows the existence of fluctuation on the ranking list and thus may yield multiple solutions. Note that not all the methods defined in terms of percolation have this advantage, for example, k-core decomposition [14], optimal percolation [19], historydependent percolation [24] and greedy articulation points removal [20]. And systematic reviews on the network percolation process can be found in Refs. [25] and [26]. A form of explosive percolation (EP) is realized by applying a competitive rule for edge addition via what is known as an Achlioptas Process [27][28][29][30], which suppresses the growth of the largest cluster when nodes are activated one by one. In the early stage of this percolation process, the size of the largest cluster grows slowly, but once the number of activated nodes approaches a certain threshold, the size of the largest cluster will grow rapidly and a giant connected component scaling in the order of system size will emerge [29]. This suggests that the later the nodes are activated, the more important they are to the network both structurally or functionally.
In this paper, we propose a universal method of vital nodes identification based on the explosive percolation, which covers all the above advantages of percolation-type methods. Explosive percolation and similar competitive mechanisms are often used to describe the growth of networks [31], here we utilize the growth process to identify vital nodes, i.e. the later growing node which also is seen as the winner of the competitive process are more likely to be the vital nodes. The analytical results show that the nodes identified by this method have a more important impact on the connectivity and synchronization dynamics of networks than other classical indices. Then, this paper further analyzes the influence of network structure on the performance of the new method under different network features, including link density, degree distribution, assortativity, clustering feature, and also explores the influence of human intervention (i.e. node protection strategy) and non-structure factor (i.e. containing missing or noisy data) on the robustness of the method. The results show that the new method can maintain good performance under different conditions. Then we demonstrate a practical advantage of the new method, namely how it can be used to identify alternative rankings that achieve similar performance. Finally, we discuss its computational complexity problem.

Explosive Percolation (EP) method
We propose a new method, named the Explosive Percolation (EP) method, that utilizes the Achlioptas process as described below to quantify nodes importance. Given a simple undirected network ( , ), where and represent nodes and links, respectively. We denote the state of node by . Initially, = 0 is assigned to all the nodes meaning they are inactive. Then the nodes are activated (σ = 1) one by one according to the following rules: (a) Two nodes and are chosen randomly from all the inactive nodes. Assuming nodes and are activated respectively, then their respective cluster sizes are = ∑ ∈ , = ∑ ∈ , (1) where J (K) is the set of nodes containing all connected active nodes in the cluster to which the node j (k) belongs. (b) Activate the node with the smaller cluster size ξ, and the other one remains inactive, namely If = , one of the two nodes will be randomly activated.
(c) Repeat (a) and (b) until all the nodes in the network are activated. The order of the node activation of node i is denoted by . Therefore, the later the node is activated, the more important it is. Because the two potential nodes are randomly selected from those not yet activated, the average value of multiple realizations is used to reduce the fluctuation, and thus the final importance of node is defined as where is the activation order of node in the t-th realization, is the number of realizations. In this paper, we set = 500. The larger the value , the more important the node i.

Performance Evaluation of EP method
2.3.1 Network dismantling Network dismantling, or network attack [33,34], is destroying network structure and function by disabling nodes or links. The efficiency of network dismantling can be used to assess the performance of vital node identification methods. Specifically, a method performs better than others if the network collapses faster by removing nodes one by one according to the ranking given by this method. For comparison, we also consider some wellknown methods, including Degree centrality (DC) [1] (i.e., the number of direct neighbors) , Closeness centrality (CC) [35,36] and Eigenvector centrality (EC) [2,15,37], Collective Influence (CI) [19,38] and Adaptive Degree (AD) [38,39].
To quantitatively evaluate the performance of the method, a metric, called Robustness [2,[40][41][42], is defined as Here, Ω( ) is the normalized value of the largest cluster size after removing the first Q nodes, and the normalization factor 1/N allows the Robustness values of networks of different sizes to be compared [40]. Note that the vital nodes identified by different methods are the first to be removed. Clearly, a smaller R corresponds to a quicker decrease of Ω( ) and better attack effect. Fig. 1 shows the changes of the normalized size of the largest cluster of the six empirical networks when nodes are removed following the order ranked by different methods. The results generally show that EP method makes all these networks dismantled more rapidly. Notably, in Fig. 1(b), since Jazz is a dense network with high average degree, it is difficult to broken down into small clusters at the early stage of the dismantling procedure, leading to a worse performance initially. The Robustness values of the six indices on six networks are shown in Table 2, where in each line the best performing index is in bold, with EP performing best in all cases. This indicates that EP method has good performance in dismantling network. In other words, the vital nodes identified by EP method are more important for maintaining network connectivity. This is mainly because some specific structures of the network may limit the efficacy of other methods if only consider either local or global information, while EP method captures both local and global information of the network. For instance, the degree centrality may underestimate the importance of bridge nodes (i.e., nodes that connect two clusters) with low degrees.   [43,44] which means that all nodes in the whole system reach the same state within a finite time by selectively pinning a fraction of nodes in the system. Specifically, the state of each node in the initial system is given randomly, and pinning a node means leading it to reach the target state by exerting a control action. With feedback control, the other nodes will reach the same target state after a fraction of nodes are pinned. We next verify the validity of vital nodes identified by EP method in synchronization dynamics. For a simple connected network ( , ) consisting of N linearly and diffusively coupled nodes, the system state equation is ∈ is the state vector, (·) denote the selfdynamics of node i, is coupling strength, ( ): → the inner coupling matrix connecting different components of a state vector. describes the control on node i, = [ ] × is called the Laplacian matrix [45] of the network, which satisfies the following conditions, if (i, j)∈ , ℎ = -1; if (i, j) ∉ , and ≠ , then = 0 ; if i = j, then = − ∑ ≠ .
We successively pin nodes one by one according to the given ranking and quantify the synchronizability of the pinned networks by the reciprocal of the smallest nonzero eigenvalue of the principal submatrix [46] of the network, namely 1/ 1 ( − ), where Q is the number of pinned nodes, and − describes the principal submatrix which is obtained by deleting the Q rows and columns corresponding to the Q pinned nodes from the original Laplacian matrix L. A smaller value of 1/ 1 ( − ) corresponds to a higher synchronizability of the network after pinning Q nodes.
To compare EP method with other methods, we pin nodes successively corresponding to different centrality indices and calculate 1/ 1 ( − ). Similar to the definition of Robustness, we here introduce an indicator, named Pinning Efficiency [46], to evaluate the performance of different indices on pinning control, which is defined as where = 0.3 is the maximum number of pinned nodes. A smaller P means a higher synchronizability and thus a better performance of the pinning order according to an index. We compare EP method with DC, CC, EC, CI, and AD in pinning control and the results are shown in Fig. 2 and Table 3 where the best-performed index in each line is in bold. We can see that the EP method ranks first in two networks of the six and rank second in the other four networks, and thus performs overall the best along with AD.

The influence of network features on the performance of EP method
The universality of a method on different networks is an important factor related to performance and practicability. This section studies the impact of different network statistical features on EP method performance, including link density, degree distribution, assortativity, and clustering.
respectively. All the three networks have 1000 nodes and 5000 links. The link rewiring probability of WS is 0.1. The results showed in Fig. 3b clearly illustrate that in all the three model networks the EP method has the best performance.

Assortativity. Network assortative is another property
that greatly affects method performance [3,49,50]. A network is assortative if nodes with a high degree tend to connect to other nodes with a high degree; on the contrary, the network is disassortative. Here, we calculate the degree of assortativity of a network [1,45,49] using the Pearson correlation coefficient [45,49,51,52], namely where M is the number of links in the network, and are the degree of node i and node j, respectively, and the sum is over all the links. If r = 1, the network has the largest assortativity, while r = -1 indicates the largest disassortativity. The networks used in the experiments are constructed based on BA networks through the following reconnection rules [49]: (a) Randomly select two links and without common nodes, then delete the two links. (b) Rank nodes , , , according to their degrees. (c) To be assortative, connect the two nodes with the highest degrees, and then connect the remaining two nodes; to be disassortative, connect the two nodes with the highest and the lowest degrees, and then connect the remaining two nodes. (d) Repeat steps (a)-(c) until the assortativity coefficient r reaches the target value. Robustness values for the six networks with different assortativity coefficients are listed in Table. 4, here A1-A3 denotes assortative networks while D1 and D2 are disassortative networks, respectively. In each line, the bestperformed index is in bold, where all the networks have the same number of nodes N =1000 and links M = 4000. According to the results, networks with lower assortativity are more easily collapse and EP method performs best in most networks. In fact, in the previous section, the assortativity of the three networks ER, WS, and BA decreases in turn, the Robustness of each index in these three networks also decreases in turn, which is consistent with the conclusion in Table. 4. This is because the nodes with large degrees are usually connected and form a network core in assortative networks, which makes the network robust against destruction; while the disassortative network is more tree-like and shows obvious vulnerability at the outset.

Clustering coefficient. The clustering coefficient [1]
of a node assesses the density of the triangular pattern in its neighborhood. It can be also understood as the probability that two neighbors of the node are connected, namely = 2 ( −1) where is the degree of node i, is the number of links between the neighbors of node i. Moreover, the clustering coefficient of a network is the average clustering coefficient of all nodes in the network When increasing the clustering of a network with a fixed number of links, the local connections of the network become denser. However, the compact of the connections between neighbors comes at the expense of diluting other connections. Many previous studies have already shown that this makes the network easy to be destroyed [49,[53][54][55][56], meaning that all the methods evaluated by the Robustness will perform better with the increasing clustering coefficient. The experiments are carried out on the scale-free networks (with the degree distribution follows ( )~− 3 ) with different clustering coefficients and the fixed number of nodes N =1000 and links M = 4000. The networks are constructed by Holme and Kim algorithm [57]. The Robustness for six networks is shown in Table 5, here C1~C5 denotes networks with different clustering coefficients respectively, and the best-performed index is in bold. The results illustrate that EP method is superior to the other three indices in both high-clustering and low-clustering networks.

The influence of non-structure factors on the performance of EP method
In practice, the performance of a method is not only affected by network structure, but also by some non-structure factors, such as protection strategy, or system noise, which are studied in this section.

Protection strategy.
Due to the restrictions of objective conditions, sometimes the attack is unable to be applied to some nodes due to protection. To model these situations, n nodes are randomly selected for protection, so that the attack on them is invalid. In Fig. 4  nodes of the network. All the results are the average of 100 realizations. The results in Fig. 4 show that the Robustness of all four indices, indicating that the protection strategy degrades the performance of network attacks. However, we can find that the EP method is always the best of the four regardless of how many protected nodes there are.

Noise tolerance.
Noisy data commonly exists in real networks, such as the addition of fake links or missing links [58,59]. To assess the tolerance of EP method to noisy data, for a given network G, we randomly remove or add ( ∈ [0, 0.1 ]) links, and the resulted network is labeled as G'. Then the original network G is attacked according to the rank which is calculated based on G'. The results are shown in Fig.  5. We can find that all four indices return similar responses to the noise. Specifically, the noise has the least influence on DC and the most influence on CC, while EC and EP are in between. But in all cases, EP method performs best.

The multi-solution advantages of EP method
Due to the randomness of the percolation process, the ranking obtained by EP method may differ in each realization, and thus may provide multiple optimal solutions. The nature of EP method is of great practical significance. For example, when some of the nodes are unable to be protected or attacked, the EP method can find out alternative rankings to achieve similar performance. Here, we take Powergrid network as an example to show this property.
We used the Kendall correlation coefficient (see Method section for details) to evaluate the difference of the rankings obtained from different realizations. The larger , the more similar the two rankings are. The results presented in Table 6 suggest that the Robustness obtained from four different realizations(S1-S4) are very close to each other (all fall in the range [0.0458, 0.0480]), which indicates that these rankings have similar attack effectiveness. While the values of between any two of them are not very high (0.752~0.758), indicating that there are still differences between the rankings obtained from different realizations. Therefore, the EP method allows us to get different rankings with almost the same performance, which can be used as a supplement to each other in practical applications. This property has also been used in the analysis of network robustness [60][61][62].

Discussion
In this paper, we proposed a vital node identification method based on Achlioptas process. Compared with classical methods, our method has better performance for networks with different structure features. This mainly because the information extracted from percolation clusters captures the local and global property of a network, simultaneously, so that the new method can well utilize the combined features to identify important nodes. Besides the high accuracy, the new method can also produce multiple optimal solutions due to the randomness of the percolation process. This advantage suggests that our method provides a variety of options for many practical problems, and thus has a wide range of applications, for instance, prevention and control of infectious diseases, power network maintenance, transportation or communication optimization, and so on.
For a given network ( , ) , each realization of EP method can be regarded as a random node activation procedure with time complexity of O(| |) . And the time complexity of activating a single node in the procedure, which includes the detection of cluster sizes using Breadth-first search, BFS [63], is O(| | + | |) . Thereofore, the time complexity of each realization is O(| | 2 + | || |) . To achieve high accuracy, EP method needs to perform many realizations to reduce the fluctuation due to randomness, leading to high computational complexity, especially in very large scale networks. Fortunately, since each realization is independent, the method of parallel computation can thus be applied to improve the computational efficiency. In future work, we will further improve EP method, optimize the computational efficiency, and seek more practical application scenarios. Although the proposed method is to identify important nodes, this research idea and framework can inspire other problems in complex network research, such as link prediction, community decomposition, etc., which will be set as our future researches.

Method
Eigenvector centrality, EC [2,15,37]. Eigenvector centrality assumes that the importance of a node depends not only on the number of its neighbors, but also on the importance of neighbors. It is defined as EC = = ∑ =1 (10) where xi describes the importance of node i. For all the nodes, this equation can be written in a matrix form x = cAx. Here, c is a proportionality constant, defined as c = 1/ where is the largest eigenvalue of the network adjacency matrix A.
Closeness centrality, CC [35,36]. Closeness Centrality believes that the node has a smaller average distance with other nodes in the network is more important. Closeness centrality is defined as where is the shortest path of network. If network is not connected, may be ∞, and 1/∞ = 0.
where is the degree of node i and l is a predefined nonnegative integer which does not exceed the network diameter for a finite network. In this paper, we set l as 2.
Adaptive Degree, AD [38,39]. Adaptive degree is a variant of the degree method. AD recomputes the degree of all remaining nodes after each removal of the node with the highest degree in the current network. [64]. Considering two sequences X, Y with N elements, and 、 represent the i-th element in X, Y, respectively. If i<j and − has same sign with − , then we call pair (i, j) consistent, and the formula of Kendall correlation coefficient is

Kendall correlation coefficient
Obviously, ∈ [-1,1]. = -1 indicates that X and Y are two completely opposite ranks, while = 1 means the two ranks are identical.

Author contributions
Zhihao Qiu, Ming Li, and Linyuan Lü designed the study. Zhihao Qiu performed the research and Tianlong Fan executed the experiment validation. All authors analyzed the data and discussed the results. Zhihao Qiu and Tianlong Fan drafted the manuscript. Ming Li and Linyuan Lü revised the manuscript and all authors read and approved the final manuscript.