Anti‐triangle centrality‐based community detection in complex networks

Community detection has been extensively studied in the past decades largely because of the fact that community exists in various networks such as technological, social and biological networks. Most of the available algorithms, however, only focus on the properties of the vertices, ignoring the roles of the edges. To explore the roles of the edges in the networks for community discovery, the authors introduce the novel edge centrality based on its antitriangle property. To investigate how the edge centrality characterises the community structure, they develop an approach based on the edge antitriangle centrality with the isolated vertex handling strategy (EACH) for community detection. EACH first calculates the edge antitriangle centrality scores for all the edges of a given network and removes the edge with the highest score per iteration until the scores of the remaining edges are all zero. Furthermore, EACH is characterised by being free of the parameters and independent of any additional measures to determine the community structure. To demonstrate the effectiveness of EACH, they compare it with the state‐of‐the art algorithms on both the synthetic networks and the real world networks. The experimental results show that EACH is more accurate and has lower complexity in terms of community discovery and especially it can gain quite inherent and consistent communities with a maximal diameter of four jumps.


Introduction
The graph or the network is a powerful tool to characterise the complex relations between a set of instances by taking each instance as a vertex and the interaction between a pair of vertices as an edge. Many complex systems can be modelled and analysed as complex networks such as technological networks [1], social networks [2,3] and biological networks [4,5] and so on. It has been proved that many real world networks reveal the structures of the modules or the communities that are subgraphs with more edges connecting the vertices of the same group and comparatively fewer links joining the outside vertices. The Modules or the communities reflect the topological relations between the elements of the underlying system and the functional entities. For example, the genes belonging to the same group are prone to reveal a homogeneous biological function; the people in the same social group have the same or similar background or hobbies. Thus, accurately extracting communities has considerable merits in practice because it allows us to infer the special and the hidden relations among the vertices.
However, designing an efficient algorithm for identifying the communities in complex networks is still highly non-trivial for many reasons. Even though it is non-trivial, there are several algorithms available. The most popular algorithms maximising the modularity function [6,7] are criticised for the serious resolution limit problem [8]. The proposed modularity density function solves the resolution limit problem very well [9], however it still is an additional measure to determine the community structure. The methods based on non-negative matrix factorisation (NMF) [10,11] and spectral clustering (SC) [12,13] possess matrix theory supports, but they both depend on a set of parameters. Among these parameters, the number of the expected communities is most important since its determination has direct effectiveness on the results for the real world networks. For more other algorithms for community detection the reader can refer to the literature [14]. Among the algorithms, the centrality algorithms can make use of both the vertex and the edge information. Centrality can be thought of as an important measure to weigh the vertices or the edges in the complex networks. The more important a vertex or an edge is, the larger the centrality is. The essence of these approaches is to discriminate the different roles of the vertices or the edges. For the sake of convenience, the edges connecting various communities are outer links and the inner links are for the same community.
As one of the most famous centralities, edge betweenness [5,15] is meant to compute the shortest paths between all the pairs of the vertices in a network, and defined as the number of the shortest paths between all the pairs of the vertices through the given edge. However, the GN [5,15] algorithm based on the edge betweenness is criticised for two reasons: (i) computing the shortest paths between a pair of vertices is expensive; and (ii) the edge betweenness is sensitive to the perturbation of the networks. Furthermore, an edge clustering coefficient [16] is proposed, which is defined as the ratio of the number of the triangles to which a given edge belongs divided by the number of the triangles that might potentially include it. The edge clustering coefficient can decrease the complexity dramatically by sacrificing the accuracy. There are also several other centralities, including information centrality [17], closeness centrality [18], k-path centrality [19] and so on.
However, none of them can make a good balance between the complexity and the accuracy. This is the major motivation of this paper. We introduce a novel local edge centrality called edge antitriangle centrality for community detection. EACH can be used for large networks since it is just based on the local edge antitriangle centrality. It is characterised by being free of the parameters and independent of any prior measures to determine the community structure. To completely investigate the performance of the proposed centrality, we execute it in comparisons from different aspects: (i) we show the correlation between the edge antitriangle centrality and the edge betweenness, and the anticorrelation between the edge antitriangle centrality and the edge clustering coefficient; (ii) we compare the edge betweenness, the edge clustering coefficient as well as the proposed centrality on the accuracy of characterising the roles of the edges; and (3) we compare the edge antitriangle centrality with the isolated vertex handling strategy (EACH) with the algorithm Girvan and Newman proposed (GN), the algorithm based on the edge clustering coefficient (ECCA) [16], NMF, SC, the algorithm Clauset, Newman and Moore proposed (CNM) [6] and the alogorithm based on spectral maximising modularity density SpeMD [20] on both the synthetic and the real world networks.
The paper is organised as follows: Section 2 introduces the edge antitriangle centrality, Section 3 presents the details of the EACH algorithm, Section 4 shows the experimental results and the conclusions and discussions are proposed in Section 5.

Edge antitriangle centrality
Prior to defining the edge antitriangle centrality, we introduce some terminologies that are used in the forthcoming sections.
The first is P 4 [21], the second the potential P 4 and the third the triangle. A simple path consisting of four vertices and three consecutive edges is defined as P 4 shown in Fig. 1a and most importantly there is no circle among the four vertices, whereas as shown in Fig. 1b the potential P 4 is not necessarily simple, in other words, the potential P 4 also consists of four vertices and three consecutive edges but there may be circles among the four vertices. What we need to emphasise finally is that the potential P 4 shown in Fig. 1b is not unique and it is just an example of the potential P 4 . According to their definitions, P 4 must be the potential P 4 , not vice-versa. A triangle as shown in Fig. 1c consists of three vertices and three consecutive edges, therefore it is the simplest and most basic circle in the complex networks.
The edge antitriangle centrality is defined as the ratio of the number of P 4 to which a given edge belongs divided by the number of the potential P 4 that might include it. The definition is proposed based on the fact that the inner links belong to the more potential P 4 but fewer P 4 , whereas the outer links belong to the fewer potential P 4 but more P 4 . The denser the edges are, the more circles they belong to. The Intracommunity edges are denser than the intercommunity ones in the complex networks and then there are more triangles including the inner links than the outer links since the triangle is the simplest circle. An edge, for example, e ij , has more opportunities to be included by the triangles which means it tends to be included by fewer P 4 under the certain degrees of its vertices i and j. Hence, we can regard P 4 with the property of the antitriangle as shown in Fig. 1d. Thus, there are more P 4 including the outer links than the inner links. There are more potential P 4 including the inner links than the outer links since a triangle is a potential P 4 according to their definitions. Intuitively, we have the fact that the inner links belong to the more potential P 4 but fewer P 4 , whereas the outer links belong to the fewer potential P 4 but more P 4 .
The edge antitriangle centrality can be used for discriminating the outer links from the inner links for community detection. According to the definition of the edge antitriangle centrality, it can be used to measure the edges to the extent that they can be the inner links and to the extent that they can be the outer links since the larger score an edge has, the more likely it is an outer link, and the lower score an edge has, the more likely it is an inner link.
The antitriangle centrality contains two elements: the number of P 4 and the number of the potential P 4 . Given an edge e ij , the centrality is where PN ij is the number of P 4 and PPN ij is the number of the potential P 4 . To get rid of the degeneracy, we slightly modify the centrality as To facilitate calculation, we denote the three consecutive edges of the potential P 4 as the left, the central and the right edge, respectively. Correspondingly, we consider the three cases within which a given edge occupies the left, the central and the right position of the potential P 4 , respectively, when we calculate PPN ij and PN ij . Let us consider the left, the central and the right case successively and let PPN l ij , PPN c ij and PPN r ij , respectively, be the number of the potential P 4 with e ij as its left, central and right edge in sequence. Similarly, the counterparts for P 4 are denoted by PN l ij , PN c ij and PN r ij , respectively. PPN l ij , PPN c ij and PPN r ij can be defined, respectively, as where NS( j) is the direct neighbourhood of j minus i, NS(i) is the direct neighbourhood of i minus j, l n is an arbitrary vertex of NS( j) or NS(i) and k l n denotes the degree of l n . The essence of the calculations of PN l ij , PN c ij and PN r ij is to distinguish P 4 from the potential P 4 , respectively.

EACH and complexity analysis
Without loss of generality, we only consider the connected, the undirected and the unweighted networks, denoted by where V is the set containing all the vertices of the graph G and E is the set containing all the edges. EACH keeps on removing the edge with the highest edge antitriangle centrality score per iteration until the scores of the remaining edges are all zero. The pseudocode of EACH is described as follows: Output: the result communities Calculate the antitriangle centrality score for each available edge While the highest score ≠ 0 do Remove the edge with the highest score Recalculate the scores of those edges affected by the removal End Implement the isolated vertex handling strategy Output the vertices inside the non-trivial components as those of the result communities Let us now analyse the complexity of EACH. First, we focus on the space complexity. The network G = (V, E) with the |V| = N vertices and the |E| = M edges can be stored as an M × 2 matrix. The edge antitriangle centrality of the M edges can be stored as an M × 1 matrix. Hence, the total space complexity of EACH is O(M ).
Second, the time computational complexity of the edge antitriangle centrality of e ij , is for simplicity, where k is the average degree of the network G. At the first step of EACH, we calculate the scores of the M edges and hence the cost is O( k 2 M ). Then, we calculate the scores of those edges affected per iteration for T times since T is the maximum number of the iterations and hence the cost is O( k 4 T ). Hence, the whole time complexity of EACH is O( k 2 M + k 4 T) the complexity of the isolated vertex handling strategy can be neglected since there are few isolated vertices in general. On the sparse networks with a very low average degree, EACH is more efficient than others. The space and the time complexities of the other state-of-the art algorithms are listed in Table 1, where K is the number of the communities and T 1 is the iteration number for searching the parameter for the complexity of the SC, where d is the depth of the hierarchy.

Details of EACH
EACH keeps on removing until the edge antitriangle centrality scores of the remaining edges are all zero and it may lead to the isolated vertices. What we want to emphasise is that EACH does not need to fix the prior number of the expected communities just because it keeps on removing until the edge antitriangle centrality scores of the available edges are all zero. In fact, the edge antitriangle centrality scores of the available edges are all zero is an additional measure to decide the community structure. In other words, the edge antitriangle centrality possesses the decision role during the edge removing process. For this reason, it does not need to fix the prior number of the expected communities for EACH. To solve the isolated vertices, we handle them by taking advantage of a very simple isolated vertex handling strategy.
Let N v be the direct neighbourhood of the arbitrary isolated vertex v and V NC be the set containing all the vertices of the non-trivial component NC. Then, we define the ratio (|N v ∩ V NC |/|V NC |) as the measure to [22] quantify the closeness between v and NC, where |N v ∩ V NC | is the number of the vertices in the NC connected with v and |V NC | is the number of the vertices in the NC. If the closeness between v and NC is larger than that between v and the other non-trivial components, we select the NC as the candidate component of v.
In addition, we solely recalculate the edge antitriangle centrality scores of the few edges in each iteration. For instance, after removing e ij we just need to recalculate the scores of the edges whose at least one endpoint is belonging to the vertex set N i ∪ N j .

Experiments and analyses
We choose some widely used algorithms including GN, ECCA, NMF, SC, CMN and SpeMD to make comparisons with EACH. The reason why the GN and the ECCA are selected is because they are edge centrality-based algorithms. The NMF and the SC are based on the matrix theory and the CNM and the SpeMD are based on optimising the additional measures to obtain the expected communities. To completely compare the proposed centrality, we have three types of experiments: first we investigate the relations among the edge betweenness, the edge clustering coefficient and the proposed centrality; then, we compare the three centralities on the accuracy of characterising the roles of the edges; finally, the comparisons are based on community discovery. For convenience, we first list the details of the networks used in the experiments in Table 2 such as the LFR synthetic networks (SNs) [23], the Zachary karate club network (ZKCN) [24], the political blog network (PBN) [25] and the gene regulatory network (GRN) [26], the bottlenose dolphins network (BDN) [27] and the football network (FN) [5,28], respectively. The parameters of the LFR synthetic network are: average degree k = 15, mixing parameter mu = 0.5, minimum for the community sizes minc = 20 and the maximum for the community sizes maxc = 50. Here, we set mu = 0.5 because its median is 0.5. In fact, except mu, the other parameters are all the defaults of an example inside the original code (http://www.santo. fortunato.googlepages.com/inthe press2).
To quantify the accuracy of the algorithms on community discovery, we adopt three widely used criteria: the normalised mutual information denoted as NMI [29], the modularity function denoted as Q value [15] and the partition density denoted as the D value [30], respectively.
Given two partitions p 1 and p 2 of a network, let A be the confusion matrix whose element A ij is the number of the vertices inside the community i of the partition p 1 that are also inside the community j of the partition p 2 .The NMI value I( p 1 , p 2 ) is defined as where n p 1 (n p 2 ) is the number of the communities in the partition p 1 ( p 2 ), A i· (A ·j ) is the sum of the elements of A in row i (column j), and N is the number of the vertices. A larger value of NMI represents a greater similarity between p 1 and p 2 . The modularity [15] is defined as where K is the number of the communities, l i is the total number of the edges joining the vertices inside the community i, M is the total number of the edges in the network and d i is the sum of the degrees of all the vertices inside the community i. Obviously, the higher D value a partition has, the stronger community structure it possesses.  Testing the networks for community detection consists of ten LFR networks and four practical networks. Here, the GN and the CNM are based on the tool NodeXL (http:// www.nodexl.codeplex.com/). The ECCA is implemented by us, the NMF and the SC are based on the R packages NMFN [31] and clusterSim [32], respectively. SpeMD is based on the original code. For the sake of convenience, ECCA_Q indicates the ECCA based on the Q value and ECCA_D indicates the ECCA based on the D value as additional measures, respectively. EAC indicates the same algorithm as EACH but with no last step of EACH, that is, within the EAC there is no isolated vertex handling strategy. The parameters of the LFR networks are set the same as the synthetic network listed in Table 2 except the mixing parameter there and the mixing parameters here of the ten networks from 0.1 to 1.0 with a step of 0.1. As described in Tables 3-6, we list the D value, the Q value, the NMI, the edge removal ratio (RR) and the number of the obtained communities (NOC), where there is no NMI in Table 6.

Relations with the edge betweenness and the edge clustering coefficient
To explore the relations between the edge antitriangle centrality and the edge betweenness and the edge clustering coefficient, we calculate the correlation coefficients and the corresponding P-values on the synthetic and the real world networks, respectively, as described in Table 7.
As shown in Fig. 3a, we plot the scatters of the edge antitriangle centrality and the logarithm of the edge betweenness on the SN, a typical artificial network. The two centralities are positively correlated because the Pearson correlation coefficient is 0.6795 and their two type corresponding P-values are all zero. This means that the edges with higher edge antitriangle centrality scores tend to have higher edge betweenness. As shown in Fig. 4a, we plot the scatters of the edge antitriangle centrality and the edge clustering coefficient on the same network. Obviously, an anticorrelation between these two centralities for the Pearson correlation coefficient is −0.8794 and their two types corresponding P-values are also zero. Then, the edges with higher edge antitriangle centrality scores tend to have lower edge clustering coefficient scores. The correlation between the edge antitriangle centrality and the edge betweenness, the anticorrelation between the edge antitriangle centrality and the edge clustering coefficient are inherent on various networks. Thus, the edge antitriangle centrality can be possible for community detection such as edge betweenness and edge clustering coefficient.

Accuracy on characterising the roles of the edges
Here, in order to compare the three centralities on the accuracy of characterising the roles of the edges, we use two important quantities, respectively. The first one is the fraction of the vertices contained in the giant component, denoted by RGC [33]. A sudden decline of the RGC is observed if the network disintegrates after the deletion of a certain fraction of the edges. Another quantity is the so called normalised susceptibility [33], defined as S = s,s max n s s 2 N (8)    where n s is the number of the components with size s, N is the size of the whole network and the sum runs over all the components except the largest one. WhenS is a function of the fraction of the removed edges f, usually, an obvious peak can be observed that corresponds to the precise point at which the network disintegrates [33,34]. We compare the three centralities on those networks used in Section 4.1.
As shown in Fig. 5, we compare the three centralities from the point of view of the RGC. As shown in Fig. 5, the edge antitriangle centrality reveals the comparative accuracy compared with the edge betweenness. However, as shown in Fig. 5, the edge antitriangle centrality reveals more accuracy than the edge clustering coefficient on the four typical networks. As shown in Fig. 6, we compare them from the point of view of the normalised susceptibility. The results also demonstrate that the edge antitriangle centrality reveals the comparative accuracy compared with the edge betweenness which has more accuracy than the edge clustering coefficient.

Community detection results
For the length limit, the analyses of the synthetic networks and the social networks are arranged in the Supplementary Materials. Here, we show the main results of the GRN.

Gene regulatory network:
Through the GRN from the literature [26], we get rid of the genes with no official name and neglect all the directions. A vertex indicates a gene and an edge indicates a regulatory relation between the two genes. As described in Table 6, the D value and the Q value of EACH are 0.1285 and 0.7024, respectively. The D value of EACH is higher than that of the GN, the Q value is close to that of the GN. The edge RR is just 37.42% much less than that of the GN. The isolated vertex handing strategy improves the Q value from 0.5676 to 0.7024 and the number of the communities (the modules in the biological networks) from 714 to 72 closest to the number obtained by the GN. As shown in Fig. 7, the largest module of the results obtained by EACH, GN, EAC and SpeMD, respectively, is the same one including 353 genes. We make an analysis of these 353 genes by the web tool Gene Trail Express [35]. Fortunately, among these 353 genes there are 352 ones belonging to the subcategory olfactory transduction and the corresponding P-value is 0. The 352 genes are green as shown in Fig. 7 and only the gene OR1D4 is not a member of the subcategory olfactory transduction. As shown in Fig. s3 (supplementary materials), the largest module of the results obtained by the ECCA_D consists of 410 genes. However, there are only 352 genes (green ones) among these 410 ones belonging to the subcategory olfactory transduction. Obviously, the remaining 58 genes (pink ones) and the 352 genes belong to different modules, but regretfully, the pink ones are not extracted from the largest module by the ECCA_D. As shown in Fig. s4, the largest module of the results obtained by the ECCA_Q consists of 932 genes. However, there are only 352 genes (green ones) among these 932 ones belonging to the subcategory olfactory transduction. Obviously, the remaining 580 genes (pink ones) and the 352 genes belong to different modules, but regretfully, the pink ones are not extracted from the largest module by the ECCA_Q. In addition, intuitively, there are obvious module structures inside the 580 genes but the ECCA_Q cannot detect them further. As for the NMF, especially, we set the prior number of the expected modules as 71 the same as that obtained by the GN, then the largest module of the results obtained by the NMF consists of 123 genes. However, there are no regulatory relations among these genes. The largest module of the results obtained by the CNM consists of 516 genes, however there is no significant biological function among them. Here, as for the SC we also set the prior number of the expected modules as 71 Since the SC runs over 180 h on this network but does not output any results, we stop the R package. From the point of view of the whole results, the GN obtains 71 modules and there are 12 modules only including one gene. The EAC obtains 714 modules and there are too many modules only including one gene. The EACH obtains 72 modules, the ECCA_Q obtains 68 modules, the ECCA_D obtains 232 modules, the SpeMD obtains 69 modules and the CNM obtains 25 modules, respectively. By comparing the results obtained by EACH with those of the other algorithms, we can take advantage of the neighborhood affinity score to decide one module when matching the other modules [36]. Among the 72 modules obtained by EACH, there are 53 modules matching and 19 ones not matching those of the GRN, there are 41 ones matching and 31 ones not matching those of the ECCA_Q, there are 64 modules matching and 8 ones not matching those of the ECCA_D, there are 27 modules matching and 45 ones not matching those of the NMF and the CNM and there are 59 modules matching and 13 ones not matching those of the SpeMD, respectively. Then, these common modules reveal the robustness of EACH and the particular ones reveal its novelty. What we want to emphasise is that there are two modules obtained by EACH, which do not match any module obtained by the other algorithms in this paper. One module consists of 115 genes and reveals no significant biological function, whereas the other module consists of 27 genes, further among the genes of this module there are 17 ones belonging to the subcategory Wnt signalling pathway and the P-value is 9.0 × 10 −22 , as shown in Fig. 8. Hence, in general, EACH can obtain more meaningful and more compact communities in this network.

Advantages of EACH:
We can find several advantages of EACH very intuitively by systematic comparisons. Firstly, the performance of the isolated vertex handing strategy within EACH is significant. Secondly, EACH is more accurate than those that do not depend on the prior number of the communities on most networks. Thirdly, unlike the NMF, the SC and the SpeMD, EACH is free of parameters. What we want to emphasise here is that it does not need to fix the prior number of the expected communities and the number can be fixed automatically during the edge removing process. Fourthly, unlike the ECCA, EACH does not depend on any additional measures to decide the community structure and what is more important, it can obtain inherent and consistent communities. Fifthly, the complexity of EACH is significantly lower than others. Finally, the communities obtained by EACH are more compact than others and the diameters of the communities are four jumps at most. Thus, EACH is more appropriate for the networks with compact community structures.

Conclusions and discussions
In this paper, we propose a novel local edge antitriangle centrality and further propose our approach (EACH) based on this centrality for community detection. EACH is characterised by being free of any parameters including the prior number of the expected communities and independent of any additional measures to decide the community structure. We demonstrate that the novel local edge antitriangle centrality is appropriate for community detection as the edge betweenness and the edge clustering coefficient and we follow up on testing EACH and the other state-of the-art algorithms on several synthetic and practical networks, the experimental results show that EACH is more efficient and accurate and especially can gain quite inherent and consistent communities with a maximal diameter of four jumps. Thus, EACH is more appropriate for the networks possessing compact community structures inside themselves.
Although EACH owns outstanding properties, there are still some problems requiring further investigation. Firstly, the isolated vertex handling strategy used in this paper reduces the performance of EACH on the LFR networks when the mixing parameter mu ≥ 0.6. As for the LFR networks, there are more isolated vertices left as mu increases, while the isolated vertices handling strategy used in this paper cannot handle these isolated vertices very effectively. Therefore seeking a better isolated vertex handling strategy deserves further research. Secondly, the edge antitriangle centrality is designed for the undirected and the unweighted networks. Next we want to extend this centrality for the directed and the weighted networks. Finally, although the edge antitriangle centrality is developed for community detection, we can seek other usages.