Community Detection with Self-Adapting Switching Based on Affinity

Community structures in complex networks play an important role in researching network function. Although there are various algorithms based on affinity or similarity, their drawbacks are obvious. They perform well in strong communities, but perform poor in weak communities. Experiments show that sometimes, community detection algorithms based on a single affinity do not work well, especially for weak communities. So we design a self-adapting switching (SAS) algorithm, where weak communities are detected by combination of two affinities. Compared with some state-of-the-art algorithms, the algorithm has a competitive accuracy and its time complexity is near linear. Our algorithm also provides a new framework of combination algorithm for community detection. Some extensive computational simulations on both artificial and real-world networks confirm the potential capability of our algorithm.


Introduction
e continuing advance of network science plays a prominent role in deepening the understanding of complex systems in the real world [1][2][3]. Among others, one salient property commonly observed in many complex networks is the community structure, i.e., the organization of nodes in di erent groups, with many edges connecting nodes of the same group and comparatively fewer connections among nodes of di erent groups [4][5][6][7]. For instance, in a scienti c citation network, communities are sets of scienti c papers on the same topic or in a similar research eld [8], while in protein-protein interaction networks, proteins working in the same biological process (or being in the same cellular component) interact with each other. Moreover, the community structure has been shown to have strong impacts on epidemic dynamics [9,10] and link prediction. erefore, with the acquisition of the real network data, one should pay careful attention to the community structure, which is of value to further investigations of complex networks.
For a deep understanding to the community structure, it is necessary to de ne what a community is. In general, there are three types of de nitions: local de nition, global de nition, and de nition based on vertex similarity [6], including the de nition based on modularity and the topological structure, such as the self-referring de nition and comparative de nition [11]. However, there are few de nitions that quantitatively describe the community structure. In 2003, Radicchi et al. provide the community de nitions in both the strong and weak sense with the quantitative description [12]: the subgraph C is a community in a strong sense if and in a weak sense if i∈C k in i (C) > i∈C k out i (C), ∀i ∈ C.
(2) e above quantitative de nitions mean that the degrees inside of all, or most, nodes are more than the degrees outside, where the degree inside k in i (C) is the number of node's neighbors in the same community and the degree outside k out i (C) is the number of node's neighbors in other communities. ereafter, another quantitative definition is defined by Hu et al. [11] as follows: subnetworks (or subgraphs) C 1 , C 2 , . . . , C m are said to be m communities of a network (or graph) G if and only if they satisfy that ∪ l�m l�1 C l � G, and for any node j ∈ C l 0 , l 0 ∈ 1, 2, . . . , m { }, one has i∈C l 0 where A is the adjacency matrix of the graph G. Unlike the consideration by Zhan et al. [13], we regard this definition as the generalized definition, since it allows that each node degree outside can be more than degree inside, and only need the node which has the largest number of neighbors with its own community. In this paper, we use this definition as our standard for community detection and it is remarkable that the overlap of node is not considered and node belongs to only one community based on the detection result.
In order to accurately describe the quantitative relation between the degree inside and outside of communities, Lancichinetti et al. introduce a mixing parameter μ i for each node i to denote that the node i shares a fraction μ i of its links with external nodes and a fraction 1 − μ i with internal nodes, i.e., [14,15]. In this paper, we consider that the mixing parameter of each node is less than 0.5 in strong communities, and contrarily, it is more than 0.5 in weak communities, and these two kinds of communities all satisfy the definition of Hu et al.
ere have been various kinds of algorithms designed for community detection. For example, the Kernighan-Lin algorithm, spectral bisection method, k-means clustering method, and the spectral clustering algorithm are traditional algorithms derived from graph theory or statistics. With the development of computers, large-scale computing is becoming widely available, so it is feasible to increase the calculation complexity and network scale. ese advances enable researchers to develop many optimized algorithms, including the greedy algorithms based on modularity [16] and betweenness [4,17]. Meanwhile, there are some algorithms which are based on dynamical methods [18][19][20][21][22][23] and similarity or affinity [24,25]. However, ignoring difference between the strong and weak communities is a major drawback to some algorithms based on node affinity or similarity, which makes the detection accuracy of these algorithms low for weak communities. us, we design a self-adapting switching (SAS) algorithm based on single affinity and combination of two affinities. e evaluation criterions for the performance of community detection can be determined by two kinds of approaches. One is to compute the topology-based metrics, including the coverage, conductance, and modularity metrics. e other is to calculate the knowledge-driven measurements, such as the Precision metrics, Jaccard index, and the normalized mutual information (NMI) [26]. We adopt NMI index as the evaluation criterion for the performance of algorithms in some real-world networks, the Lancichinetti-Fortunato-Radicchi (LFR) benchmark networks (heterogeneous networks) [14], the Girvan-Newman (GN) benchmark networks (homogeneous networks) [4], and the nonuniform popularity similarity optimization (nPSO) benchmark networks (heterogeneous networks) [27]. Based on the results, we find that our algorithm has an advantage over some stateof-the-art algorithms and is more suitable for heterogeneous networks with larger power-law exponent.
is paper is outlined as follows. In Section 2, we design the principle of our algorithm and discuss its complexity. Tests and results are presented in Section 3. Conclusions are summarized in Section 4.

Structural Analysis and Algorithm
In this section, we will present an analysis about the community structure and design the affinity-based SAS algorithm for community detection, and then its complexity is discussed at last.

2.1.
e Analysis of Community Structure. Some studies indicate that the node degrees generally obey the power-law distribution [28,29] or [30] log-normal distributions in realworld networks, where the nodes with large degree are known as hub nodes and have strong degree centrality, such as the network in Figure 1. Although the number of hub nodes in real-world networks is relatively small, their vital roles in communities and networks have been repeatedly mentioned in some literature studies [13,31,32]. e identification of the hub nodes is usually considered as the breakthrough point for heuristic algorithms. In these algorithms, a single affinity is often deficient for community detection, especially for the weak communities. erefore, we design a new algorithm that combines two affinities in the detection of weak communities.
As is well known, the ultimate aim of the community detection algorithms that are based on affinity or modularity is to find the global maximum of such indices and to guarantee the minimum number of connections between different communities. Both of them are nondeterministic polynomial hard problems. Putting aside these problems, our algorithm is heuristic and its detection process is based on the affinity between the nodes being detected and having been detected, rather than between two single nodes. Motivated by the different affinities, i.e., the common neighbors (CN), hub depressed (HD), and hub promoted (HP) indices summarized by Zhou et al. [33], we provide two definitions of affinity for node j and node set P as follows, and some important notations are shown in Table 1.
e first affinity s (j) P between any node j and node set P is as follows: e second affinity S (j) P between any node j and node set P is given by 2 Complexity where k j is the degree of node j. ese two affinities have different emphases: the first one focuses on the absolute number of the common neighbors and the second is the relative affinity. Our heuristic algorithm is implemented from the hub node and then detects other nodes belonging to the same community based on these affinities.
Generally, the affinities between nodes in one community are larger than those between nodes in different communities, while this is hard to be satisfied sometimes, especially for the weak communities. To illustrate this point, we calculate the first and second affinity between the hub node and its neighbors in LFR benchmark graphs. First, the second affinity between the hub node's neighbors in the same community and other communities is shown in Figure 2(a). We discover that the second affinities of nodes in strong communities have obvious differences, but they are mixed together in the weak communities when μ > 0.5. We conduct a similar experiment on the weak communities with the first affinity to observe its distinction ability. Since the first affinity is the absolute number of common neighbors, we normalize it and only pay attention to its normalization form c (j) N i in Figure 2(b), where the notation of hub node is i, the notation of its neighbors set is N i , and node j is a neighbor of hub node: From the statistical results, we find that, for the strong communities, the second affinity has effective distinction ability. However, it is not enough to detect the weak communities and need to work with the first affinity. Moreover, the detection method of strong communities is not suitable to weak communities and may detect many communities composed of several nodes or even a single node, which can be a trigger principle of the switch condition in our SAS algorithm. So our algorithm is divided into two parts, which we name in short as SAS-1 and SAS-2, respectively. Next, we will describe the algorithm and its principle in detail.

e Algorithm.
Here, we will introduce the two parts of our algorithm including its core principles and pseudocodes and then analyze its complexity. Some important notations are also shown in Table 1.

e Strong Community Method SAS-1.
In this method, each community, its nodes and the edges of these nodes, will be gradually deleted from the network after the end of its detection. So we denote the network as G m � (V m , E m ) after the (m − 1) th (m > 1) community has been detected, where V m and E m are the sets of nodes and edges, respectively. In order to describe the algorithm generally, we will use the example of the detection of m th community. e first step: at step t � 1, the method selects one node i ∈ V m as the hub node, whose degree is maximal in G m . At this step, the hub node i and its neighbors, satisfying S (j) P ≥ 0.5, are the detected nodes belonging to C m (1), where the node set P consists of the node i and its neighbors, node j ∈ N i . e second step: at step t � 2, the method searches the nodes in C m (1), and then these nodes' neighbor j is substituted into this community if and only if it satisfies the condition where the value 0.5 is confirmed by the definition of the strong community, and then the community from C m (1) to C m (2) is updated. e t th step: similarly, when t ≥ 3, in order to reduce the complexity, the method searches the nodes in C new m (t) and only detects these nodes' undetected neighbors. en, neighbor j is substituted into C m (t) if and only if it satisfies the condition

Notation
Meaning |A| e number of elements in set A C m (t) e set of nodes detected at t th step that belong to the m th community, where t is the detection step C * m e set of nodes that belong to the m th community when C m (t) no longer changes e new detected nodes that belong to the m th community at step t, C new e first affinity between node j and P, where P is a set of nodes S (j) P e second affinity between node j and P    Complexity and the community C m (t) is updated.
e detection process of the m th community finishes until there are no nodes satisfying condition (8).

e Weak Community Method SAS-2.
From the results in Figure 2, we can infer the method SAS-1 may detect many communities that are composed of several nodes or even a single node in weak communities. Hence, the algorithm needs a self-adapting switching condition to reflect this phenomenon and make it to switch from SAS-1 to SAS-2. Our method is to calculate the average scale of communities having been detected and the switching condition between the two methods is given by where β � O(p〈k〉), in which p � 0.05 and 〈k〉 is the average degree, and n c (≥1) is the current number of communities having been detected. Actually, few neighbors of hub node in weak community can satisfy condition (7), so the average scale of communities detected by SAS-1 is the same order with p〈k〉, and the parameter p derived from hypothesis test is a small incidence rate. Once the SAS-1 triggers the switching condition, it will switch to SAS-2 and redetects the network. Different from the first method SAS-1, this method does not delete any nodes or edges from the network because the recognition of the weak communities depends on the whole construction of the network. In the following, we will also introduce this new method by taking the detection progress of the m th community as an example. e first step: at step t � 1, the method selects the node i with the maximal degree as the starting node, which does not belong to other communities C * 1 , . . . , C * m− 1 . Obviously, we have C m (1) � i { } after confirming the starting node. e second step: at step t � 2, the method chooses the hub node's neighbor j not belonging to other communities C * 1 , . . . , C * m− 1 , as the member of the m th community if and only if it satisfies the following condition: where c is a threshold based on the average value of c Figure 2, and then C m (2) is updated. e t th step: when t ≥ 3, similar to the method SAS-1, this method searches the nodes in C new m (t) and only detects these nodes' undetected neighbors. en neighbor j is substituted into C m (t) if and only if it satisfies the condition e termination condition of the m th community is to separate the undetected nodes with lower affinity from the nodes having been detected, which have higher affinity each other. We assume that the detection of the m th community stops when there is no node j satisfying the following condition at step t � t 0 : where node j is one of the undetected neighbors of nodes, which belong to C new m (t 0 − 1), and node j ′ belong to C m (t 0 − 1), and the parameter ρ ∈ (0, 1) is used to cut the community in the network. e algorithm pseudocodes are shown in Algorithm 1 and its process structure is shown in Figure 3. Last, we analyze the algorithm complexity. In the method SAS-1, the detection process is conducted in every communities, so we consider that the average step number for each community is t a . e complexity in searching and filtering for each node by the condition (8) scale is O(〈k〉 2 ). With the detection of communities, the number of nodes is reduced, so the extreme complexity is about O(〈k〉 2 t a n c ), where n c is the number of communities having been detected and 〈k〉 is the average degree. In the method SAS-Algorithm community detection with self-adapting switching.
(1) input Adjacency matrix of the network.
(2) while |∪ n c m�1 C * m |/n c ≤ β. (3) Select a node as the hub node such that its degree is maximal in the current network.
(9) Repeat step 8 until no nodes satisfy S (j) , then start the next detection.

Results
In this section, some experiments are performed on both real-world networks (the karate club network, the dolphin network, the football team network, and the political books network) and synthetic networks (LFR, GN, and nPSO benchmark graphs). First, we use the GN benchmark to After the end of one community detection, the nodes and edges of this community will be deleted in the network, and new network data will be provided for the following detection. e red area in the right is part of the algorithm SAS-2. (c-d) Once the algorithm satisfies the switch condition, the SAS-2 will use the combination of two affinities to detect communities one by one. (e) Different from SAS-1, SAS-2 does not delete communities having been detected. Complexity estimate suitable parameters range and analyze parameter sensitivity. e SAS algorithm relies on three parameters β, c, and ρ, where the choice of parameter β is related to the average degree 〈k〉. e parameters ρ and c can be freely selected in the range (0, 1). In the GN benchmark, its scale is 128 and degree distribution is relatively concentrated, so it is suitable for parameter sensitivity analysis. Because its average degree 〈k〉 is 16, so we default the β � 2, 3, 4, 5 and mainly study the sensitivity of parameters ρ and c. Based on the results in Figure 4, we find that the results are insensitive to parameters β and c. However, the changes of parameter ρ have obvious influence on the results when Z out > 4. Fortunately, when the parameter 0.3 < ρ < 1, all the detected results are stable and do not have wide-range fluctuations.
In practice, the parameter c should be close to 1 to ensure the accuracy of initial detected nodes. e parameter 0.3 < ρ < 1 should be appropriately increased with the increase of clustering coefficient. en, we evaluate the advantages and disadvantages of our algorithm compared with other state-of-the-art algorithms: Infomap, LPA, Louvain, Walktrap, Fast greedy, EM, and Blondel. e performance comparison in real-world networks confirms its potential capability shown in Table 2 and Figure 5. It is worth mentioning that some community divisions are slightly different from the ground truth. e possible reason is that the detailed division of communities leads to an increase in the number of community, but its results at least satisfy the quantitative definition of our article and have a good accuracy rate.

e LFR Benchmark.
In this part, LFR networks have two different scales: 1000 and 5000, as presented in Figure 6. For each kind of network, we consider two different community sizes, indicated by the letters S and B, where S stands for "small" communities that have about 10 to 50 nodes and B stands for "big" communities that have about 20 to 100 nodes [15]. In Figure 6, our algorithm tests four types of networks by NMI with μ ∈ [0.1, 0.8]. For the strong and weak community, the performance of our algorithm is better than some algorithms in Table 2.

e GN Benchmark.
Beyond that, we test the SAS algorithm in the GN benchmark network with the results shown in Figure 7, where each point is also tested on 100 same kind networks. e performance of SAS algorithm is as good as other algorithms in Table 2. It is well known that the LFR benchmark is a kind of heterogeneous networks, whose degree distribution follows the power-law distribution. However, for the GN benchmark, its degree distribution follows the normal distribution and the role of hub nodes is weakened. Maybe the heterogeneity of network structure Complexity will affect the accuracy of our algorithm. Next, we will use the nPSO benchmark to conduct the further analysis of the performance of our algorithm.

e nPSO Benchmark.
Recently, there is a new network generative model named nonuniform popularity similarity optimization (nPSO) for evaluation of community detection and link prediction that can create synthetic networks with controlled parameters: network scale, average degree, community number, power-law exponent, and temperature. It allows one to tune the mixing property of networks by temperature. In particular, this model simulates how random geometric graphs grow in the hyperbolic space, generating realistic networks with clustering, small-worldness, scale-freeness, and rich-clubness. In this part, we generate the nPSO hyperbolic networks with community with these parameters: N � [100, 500, 1000] (network size), 〈k〉 0.5 � [4,8,10] (half of average degree), T � [0.1, 0.3, 0.5, 0.7] (temperature, inversely related to the clustering coefficient), n c � [3,6,9] (number of communities), and c nPSO � [2, 3] (power-law degree distribution exponent). We also compare the SAS algorithm with state-ofthe-art community detection algorithms. From the results in Figures 8-10, we find that the performance of SAS algorithm is not sensitive to the change of parameters N, 〈k〉 0.5 , and n c . However, it performs well in the heterogeneous network with c nPSO � 3 and generally with c nPSO � 2. is indicates that our algorithm may be more suitable for heterogeneous networks with larger power-law exponent. Combining all the detection results, we can see that the SAS algorithm has some advantages over other state-of-the-art algorithms, and its accuracy ranks high among those algorithms in some benchmarks. e near linear time complexity is also an advantage of our algorithm.

Conclusions
In this paper, the performance of SAS algorithm is evaluated with some state-of-the-art algorithms in real-world networks as well as three benchmark graphs, traditionally used in the existing literatures. First, experimental results show that it is feasible to use different affinities for strong and weak communities. Our algorithm improves the accuracy of weak communities, compared with some algorithms based on single affinity, and has the same reliability as some state-ofthe-art algorithms. Second, some heuristic algorithms based on hub node may need to analyze the network degree distribution or clustering coefficient in advance to improve the accuracy of the algorithm. e weakening of the role of 14 Complexity hub nodes may be the reason why our algorithm performs bad in nPSO benchmark with power-law exponent 2, but performs well in LFR benchmark and nPSO benchmark with power-law exponent 3. is is also an important direction of algorithm improvement in the future. Last, our definitions of affinity are based on the concept of common neighbours.
Recently, there is a new paradigm to define affinities that not only uses the information associated with the number of common neighbours but also considers (and integrates) the information associated with the links that occurs between the common neighbours. e union of common neighbours and their cross-links is named as local community, and the redefinition of affinities based on common neighbours in function of local communities has demonstrated to significantly boost link prediction in both monopartite and bipartite networks. If the SAS algorithm adopts affinities based on the local community paradigm, instead of the simple common neighbours' paradigm, we guess that this possible innovation may make our algorithm more suitable for heterogeneous networks with smaller power-law exponent.

Data Availability
Previously reported data were used to support this study and are available at Mark Newman's network data (see http://wwwpersonal.umich.edu/∼mejn/netdata/) and the algorithm LFR procedure is available at https://github.com/eXascaleInfolab/ LFR-Benchmark_UndirWeightOvp#changelog. e original authors have already made the data freely available. ese prior studies (and datasets) are cited at relevant places within the text as references [4,15].

Conflicts of Interest
e authors declare that no conflicts of interest exist in the publication of this paper.