Statistical significance of rich-club phenomena in complex networks

We propose that the rich-club phenomena in complex networks should be defined in the spirit of bootstrapping, in which a null model is adopted to assess the statistical significance of the rich-club detected. Our method can be served as a definition of rich-club phenomenon and is applied to analyzing three real networks and three model networks. The results improve significantly compared with previously reported results. We report a dilemma with an exceptional example, showing that there does not exist an omnipotent definition for the rich-club phenomenon.

We propose that the rich-club phenomena in complex networks should be defined in the spirit of bootstrapping, in which a null model is adopted to assess the statistical significance of the richclub detected. Our method can be served as a definition of rich-club phenomenon and is applied to analyzing three real networks and three model networks. The results improve significantly compared with previously reported results. We report a dilemma with an exceptional example, showing that there does not exist an omnipotent definition for the rich-club phenomenon. Almost all social and natural systems are composed of a huge number of interacting components. Many selforganized features that are absent at the microscopic level emerge in complex systems due to the dynamics. The topological properties of the underlying network of the interacting constituents have great impact on the dynamics of the system evolving on it [1,2,3,4]. Most complex networks exhibit small-world properties [5] and are scale free in the sense that the distribution of degrees has power-law tails [6]. In addition, many real networks have modular structures or communities expressing their underlying functional modules [7] and exhibit self-similar and scale invariant nature in the topology [8,9,10,11,12,13,14]. The modular and hierarchical structure of social networks may partly account for the log-periodic power-law patterns presented extensively in financial bubbles and antibubbles [15,16]. A closely relevant feature is recently reported in some complex networks, termed the rich-club phenomenon, which however lacks a consensus on its definition [17,18,19,20,21,22].
The rich-club phenomenon in complex networks digests the observation that the nodes with high degree (called rich nodes) are inclined to intensely connect with each other. The average hop distance of the tight group is between one and two [18]. Intuitively, rich nodes are much more likely to organize into tight and highlyinterconnected groups (clubs) than low-degree nodes. Therefore, it is rational to accept that there is a rich-club phenomenon in the topology of internet [17,18,19,20]. This rationale can be characterized quantitatively by the rich-club coefficient φ, which is expressed as follows [18], where N >k refers to the number of nodes with the degrees higher than a given value k and E >k stands for the number of edges among the N >k nodes. The richclub coefficient φ(k) is the ratio of the real number to * Electronic address: wxzhou@ecust.edu.cn the maximally possible number of edges linking the N >k nodes, which measures how well the rich nodes 'know' each other. For example, φ = 1 means that the members within the club form a full connected network. Indeed, φ(k) is nothing but the well-known clustering coefficient of the rich club. Zhou and Mondragón argue that an increasing function φ(k) with respect to k provides evidence for the presence of rich-club structure [18]. However, Colizza et al. point out that a monotonic increase of φ(k) is not enough to infer the presence of rich-club phenomenon since even random networks generated from the ER model, the MR model and the BA model have an increasing φ(k) with respect to k [21]. Instead, the rich-club coefficient φ(k) should be normalized by a reference and the correct null model that can serve as a reference is the maximally random networks with the same sequence of degrees as the network under investigation [21,22]. The maximally random networks can be generated with the chain-switching method [23,24]. The normalized rich-club coefficient is defined by where φ ran (k) is the average rich-club coefficient of the maximally random networks [21]. The actual presence of rich-club phenomenon in a network is confirmed if ρ(k) > 1 [21,22]. In this framework, there is no richclub ordering in the network of Internet.
We have repeated the analysis of Colizza et al [21] for three model networks, namely the Erdös-Rényi (ER) model [25], the Molloy-Reed (MR) model [26], and the Barabási-Albert (BA) model [6], and three real-world networks being the protein interaction network [23] of the yeast Saccharomyces cerevisiae, the scientific collaboration network collected by Newman [27] and the Internet network at the autonomous system level collected by the Oregon Route Views project [28,29,30]. The rich-club coefficients φ of the six networks under investigation are presented in Fig. 1 with black circles as a function of the percentage g of the richest nodes included in the rich club. The φ ran functions are also shown for the corresponding maximally random networks. We note that when we plot φ(k) versus k, our results for the investigated networks are the same as shown in Fig. 1 obtained by Colizza et al [21]. Figure 2 shows the ρ functions, which are not the same as those in Fig. 2 presented by Colizza et al [21]. Specifically, we find that the normalized coefficients of the networks of protein interactions, scientific collaborations, and the ER model are qualitatively the same as those reported by Colizza et al [21], while the rest three are not. We note that the AS-Internet data was created by Mark Newman from data for July 22,2006. Figure 2 shows that the normalized coefficient ρ is not less than 1 for the Internet, the MR model, and the BA model. For the Internet case, we notice that its φ is close to 1 for the richest nodes. Intuitively, the corresponding φ ran should be less than 1, which is observed in our analysis but not in that of Colizza et al [21].
The importance of null model has been emphasized in the assessment of some properties claimed to be present in complex networks [22,23,31]. Other than the simple normalization of the rich club coefficient, we argue that the correct way to assess the presence of rich club phenomenon is to perform a statistical test, which amounts to determine the probability that the identified rich-club phenomenon emerges by chance. The null hypothesis is the following: H 0 : ρ(g) is not larger than 1. The alternative hypothesis is that ρ(g) > 1. We can compute the p-value, which is the probability that the null hypothesis is true. The smaller the p-value, the stronger the evidence against the null hypothesis and favors the alternative hypothesis that the presence of rich-club ordering is statistically significant. The p-value is 100% when g = 1. By adopting the conventional significance level of α = 5%, the rich-club phenomenon is statistically significant if p < α. Figure 3 shows the p-values as a function of the percentage g of rich nodes for the networks investigated. For the protein interaction network and the ER network, the p-values are larger than α when g < 10%. Therefore, there is no rich-club ordering in these two networks. For the Internet, except for the point at the smallest g and the point with g = 0, all p-values are well below α = 5%, indicating significant rich-club ordering in the Internet. For the scientific collaboration network, the p-values are less than α = 5% for most values of g. However, the most connected scientists corresponding to small g do not form a rich club. According to the top-right panel of Fig. 2, the group of these most connected scientists has relatively large normalized rich-club coefficient. What is the most surprising is that the MR network and the BA network have significant rich-club phenomena.
Among these cases, the presence of rich-club in the Internet has stirred quite a few debates. In a recent work [32], Zhou and Mondragón find that there is a clique of rich nodes that are completely connected, which is an undoubtable hallmark for the presence of rich club. We can put further evidence for this argument. As illustrated in Fig. 1, the rich-club coefficients φ are close to 1 when g is small for the Internet, the MR model, and the BA model. This means that the richest nodes in these networks are almost fully connected. This validates the intuitive definition that a rich club is a group of nodes with high degree that are intensely linked. A statistical test puts further credit on the declaration of Zhou and Mondragón for the presence of rich club in the Internet.
A missing ingredient in the discussions of rich-club phenomenon is the connectedness of the rich club. When we define rich nodes as those with for example g > 1% and start to investigate whether these nodes form a rich club, a scrutiny should be carried out to see if this "club" contains several disconnected sub-clubs. As illustrated in the upper panel of Fig. 4, the scientific collaboration network are not fully connected for small g. There are several separated clusters for small g. According to Fig. 3, all these three subgraphs are rich clubs, which however contradicts the common intuition that the members are aware of each other forged by other members in the club. For g = 0.141, there are two rich clubs (1,4,5,9,11,12) and (2,3,6,10,16,20,21,14,15,17). With the increase of richness (smaller g or larger k), the rich club (1,4,5,9,11,12) remains unchanged. The second rich club (2, 3, 6, 10, 16, 20, 21, 14, 15, 17) splits into two clubs (2,3,6,10,16) and (14,15,17) when node 20 and node 21 are removed for g = 0.111. When g = 0.080, the rich club (14,15,17) disappears and (2, 3, 6, 10, 16) degenerates to (2,3,6). Therefore, when there are more than one isolated clusters of nodes for a given g, we should investigate their statistical significance one by one except for the trivial cases of isolated nodes and pairs of nodes. The lower panel of Fig. 4 shows the results for the scientific collaboration network. One observes that p < 5% for all clusters.
So far, we have shown that performing statistical test is necessary which does a good job in the detection of rich clubs in complex networks. However, a story always has two sides. Consider a toy network shown in Fig. 5. The graph consists of two kinds of nodes identified with different colors: The degree of each white node is k = 1, while the red nodes are very "rich" and fully connected. It is evident that the rich-club coefficient of the red nodes is φ(k = 1) = 1 and one would say they are within a rich club without any doubt. Indeed, a qualitatively same figure was taken as an example for the presence of rich club [21]. Surprisingly, this observation of φ(k = 1) does not ensure that the red nodes form a rich club in neither framework of ρ > 1 adopted by Colizza et al. [21] and the statistical test proposed in this work since φ i (k = 1) ≡ 1 for all maximally random networks. Hence, we have ρ(k = 1) = 1, which means that there is no richclub ordering when k = 1. This conclusion contradicts our intuitions.
We can generalize our discussion above by considering a network consists of m rich nodes, which are linked to k 1 , k 2 , · · · , k m nodes of degree k = 1, respectively. Since each node with k = 1 has to be linked to a node with k > 1 to ensure the connectedness of the randomized network, the group of the m rich nodes have m i=1 k i out-edges and E >1 edges among them. The value of E >1 does not change for all randomized networks. In other words, φ ran (k = 1) = φ(k = 1) and ρ(k = 1) = 1. This class of artificial networks invalidates the sophisticated approach based on statistical tests.
The analysis presented here provides a more rigorous methodology for detecting rich clubs in complex networks. This allows us to understand this phenomenon on a solid basis. However, there exist a class of artificial networks with rich clubs on which the methods based on null models taking maximally random networks. In this sense, the definition of rich-club phenomenon remains an open problem. Rich-club coefficient φ as a function of the percentage g of nodes whose degree is larger than k used for detecting rich-club ordering. The black cycles are for the networks under investigation and the red squares are for the null models. For each simulated model network, the total number of nodes is 10 4 and its average degree is k = 6. The percentage of nodes with k > 1 is g = 77.1% for the protein interaction network, g = 65.9% for the Internet network, g = 86.6% for the scientific collaboration network, g = 98.1% for the ER model, and there is no node with k = 1 for the BA and MR networks.   111% (k = 57), and g = 0.141% (k = 54). The lower panel shows the statistical analysis on the sub-clubs for different g. The blue markers in the left panel shows the rich-club coefficient φ for all isolated sub-clubs with more than two nodes, while the red ones in the same panel are the associated φran. The middle panel presents the ρ function and the right panel digests the corresponding p-values. It is observed that p < α for all g.