A Method for Community Detection of Complex Networks Based on Hierarchical Clustering

Due to the development and popularization of Internet, there is more and more research focusing on complex networks. Research shows that there exists community structure in complex networks. Finding out community structure helps to extract useful information in complex networks, so the research on community detection is becoming a hotspot in recent years. There are two remarkable problems in detecting communities. Firstly, the detection accuracy is normally not very high; Secondly, the assessment criteria are not very effective when real communities are unknown. This paper proposes an algorithm for community detection based on hierarchical clustering (CDHC Algorithm). CDHC Algorithm firstly creates initial communities from global central nodes, then expands the initial communities layer by layer according to the link strength between nodes and communities, and at last merges some very small communities into large communities. This paper also proposes the concept of extensive modularity, overcoming some weakness of modularity. The extensive modularity can better evaluate the effectiveness of algorithms for community detection. This paper verifies the advantage of extensive modularity through experiments and compares CDHC Algorithm and some other representative algorithms for community detection on some frequently used datasets, so as to verify the effectiveness and advantages of CDHC Algorithm.


Introduction
Complex networks are generally networks having lots of nodes and connections, such as Internet, citation network, and social network [1][2][3]. Due to Internet's development and popularization, there is more and more research on complex networks in recent years. Research shows that there exists community structure [4] in complex networks. Connections within a community are dense, while connections between different communities are rare [5]. Generally, a community consists of nodes having similar properties, so finding communities helps to mine some useful information of complex networks, for example, finding a group of people having common interests. How to detect communities in complex networks is a hotspot in recent years.
Newman and Girvan [6] put forward the conception of modularity to measure the quality of algorithms for community detection. The modularity is a number less than 1.
The bigger is, the better quality is. Modularity has become the most widely used assessment criterion for community detection due to two reasons when we do not know the real communities in networks. Firstly, modularity is very convenient to use. Secondly, it seems that the result is better when the modularity is bigger in most cases. Many algorithms [7][8][9] detect communities by maximizing modularity. As it is impossible to calculate the modularity for every possible case, such kind of algorithms usually adopts greedy strategy to find the case almost maximizing modularity.
As modularity becomes more and more popular, people begin to doubt its effectiveness. Fortunato and Barthélemy [10] said algorithms based on modularity are not able to detect communities whose size is less than an inherent size depending on the number of networks' edges. It is called by Fortunato the resolution limit of modularity. We find out by experiment that the quality is not necessarily better when the modularity is bigger. Maximizing modularity tends to 2 International Journal of Distributed Sensor Networks detect more communities than the real situation, because the upper boundary of modularity is bigger when we detect more communities. In order to overcome this problem, this paper proposes the conception of extensive modularity on the basis of modularity. The upper boundary of extensive modularity is independent of the number of communities that we detect. We find out that the result maximizing extensive modularity is better than that maximizing modularity.
For now, there are many algorithms having advantages and disadvantages for community detection. Algorithms based on graph theory appear early, and their main idea is to divide the network into such parts of almost the same size satisfying that connections in the same part are dense and connections between different parts are rare. Kernighan-Lin Algorithm [11] and bisection of spectrum based on Laplace eigenvalues [12] are representative algorithms based on graph theory. Generally speaking, it is hard to know the number of communities ahead of time, but algorithms based on graph theory need the number, so the accuracy is very low. Algorithms based on hierarchical clustering are common to see for now. This kind of algorithms decomposes datasets hierarchically according to a certain method until it satisfies some certain conditions. According to the principle of classification, algorithms based on hierarchical clustering can be divided into two parts, which are aggregation algorithm and splitting algorithm. GN Algorithm [6] is a representative splitting algorithm, while CNM Algorithm [13] and Newman Fast Algorithm [14] are representative aggregation algorithms. GN and CNM can usually obtain large value of modularity, while the accuracy is not necessarily high. LCV Algorithm [15] detects communities by finding local central vertices. EACH Algorithm [16] detects communities by calculating the edge antitriangle centrality with the isolated vertex handling strategy. LCV and EACH are both algorithms based on hierarchical clustering. Besides, there are also other kinds of algorithms.
The rest of this paper is organized as follows: Section 2 introduces some definitions including central nodes, nodes' similarity, link strength between a node and a community, modularity, and extensive modularity; Section 3 describes detailed steps of the algorithm proposed by this paper; Section 4 states experiment results; the last section is our conclusion.

Definitions for Community Detection
Notation of Main Symbols gives the meaning of the main symbols that we will use in the whole paper.

Central Node.
Central nodes include global central nodes and local central nodes. In a network, nodes with maximal degree are called its global central nodes, in which the value of can be determined according to our needs. If a node's degree is bigger than the degree of all its adjacent nodes, we call it local central nodes.
Global central nodes play an important role in the network and are probably important nodes of communities. So global central nodes help to find communities.
in which ( ) = Adj( ) ∪ { }. The bigger Sim( , ) is, the bigger similarity of node and node is. If two nodes have a large similarity, they are probably in the same community. Paper [16] defines the similarity of node and node as This definition has some disadvantages. Firstly, calculating the union set consumes more time than calculating the sum of two numbers. Secondly, we can know intuitively that the similarity 1 of node 2 and node 4 in Figure 1 should be bigger than the similarity 2 of node 2 and node 4 in Figure 2.

Link Strength between a Node and a Community.
We define the link strength between node and community ( ∉ ) as International Journal of Distributed Sensor Networks 3 We multiply [ ][ ] by 2, because a direct link between node and community is stronger than an indirect link. If the link strength of node and community is large, node should be probably in community .

Modularity .
The modularity is defined as is a number less than 1. The bigger is, the better algorithm's quality is. The quality is generally good when is between 0.3 and 0.7. is difficult to exceed 0.7 in the real situation.

Extensive Modularity .
We define extensive modularity as Let us analyze the upper boundary of modularity: Through (6), we can know that 1 − 1/ is a more accurate estimation of the upper boundary of , which depends on the number of communities that we detect. If we detect only 2 communities, the upper boundary of is 0.5, so is difficult to exceed 0.4. If we detect 5 communities, the upper boundary of is 0.8, so is relatively easy to reach 0.4 or even 0.5. Even if is larger in the latter case than that in the former case, the quality is not always better. Therefore, we define the extensive modularity to eliminate this problem. According to our definition, the upper boundary of extensive modularity is 1, which does not depend on the number of communities that we detect. The bigger is, the better quality is. We will explain the advantages of extensive modularity through experiments in the following paragraphs.

Algorithm Description
One city has one or several centers and is expanded layer by layer around the centers. The closer to the center the layer is, the denser the connections to the layer are. The outer layer is connected mainly by itself and inner layers. Inspired by the hierarchical structure of cities, we consider that a community should have one or several global central nodes and should be expanded layer by layer around the global central nodes. The nodes in layer are mainly connected by nodes in layer and layer − 1. According to such community structure, we propose a method for community detection based on hierarchical clustering (CDHC). The process of CDHC Algorithm includes initializing communities, expanding communities, merging small communities and choosing the best result.
We suppose the network is connected, which means we can reach any node from any node through some edges in the network.

Initialize Communities.
The process of initializing communities can be divided into three steps, which are sorting nodes, choosing global central nodes, and merging global central nodes into communities.
Step 1. Sort all the nodes by degree in descending order.
Step 2. Choose nodes with maximal degree as global central nodes of the network. The value of is determined by the following equations: The idea is that we want to choose a small number of global central nodes to reduce the quantity of calculation and the global central nodes include the important nodes of each community. If there are only a small number of nodes with large degree and the degree of most nodes is small, we generally have 1 > 2 , so = 1 . However, it is probable that the degree of all nodes is very close, and we have to choose most nodes as global central nodes. In this case, we should take = 2 .
Step 3. Initialize the first community, assign the node with maximal degree to the first community, and then mark it as the first community's central node. For each node V of the remaining −1 global central nodes, we calculate its similarity with each initialized community's central node. If there exists a similarity bigger than some threshold , assign the node V to the community maximizing the similarity; otherwise, initialize a new community, assign the node V to the new community, and mark the node V as the new community's central node.
Algorithm 1 shows the process of initializing communities.

Expand Communities.
After finishing the process of initializing communities, we need expand these communities. The process of expanding communities includes marking nodes' level and calculating link strength.
Step 1. Mark all the global central nodes of the network as the first level. If the node V is connected to a node of the first level and its level is not marked, mark its level as two. node V is connected to a node of level and its level is not marked, mark its level as + 1. As we suppose the network is connected, all the nodes' level can be marked if we repeat the previous operation.
Step 2. Nodes of the first level have already been assigned to communities. For each node V of level two, we calculate the link strength between the node V and each community . Choose the community maximizing the link strength and then assign the node V to it. By parity of reasoning, for each node V of level, we repeat the previous operation, until every level's nodes have been assigned to communities. Algorithm 2 shows the process of expanding communities.

Merge Small Communities.
After the previous two processes, every node is assigned to a community. Some communities are probably of very small size, and we need to merge these communities into the other communities. This process includes determining small communities and calculating the number of common nodes.
Step 1. If a community is of size smaller than , we call it small community; otherwise, we call it large community. The value of is determined by the following equation: We determine the value of on the basis of two reasons. Firstly, we deem that communities of size smaller than 5 make no sense. Secondly, when the size of the biggest community is very large, communities of size slightly bigger than 5 have no meaning to exist, just like it is meaningless to compare the size of a city with that of a village. Step 2. When merging communities, we will not merge a whole community into another one, because not all the nodes are strongly connected to the same community. We will merge small communities node by node into large communities. For each node V in small communities, calculate the number of common nodes of the node V's adjacent nodes and each large community. If the node V's adjacent nodes have the most common nodes with the community , reassign the node V to community .
Algorithm 3 shows the process of merging little communities.

Choose the Best Result.
Given a network, we will detect its communities after the previous three processes. In the process of initializing communities, we give the threshold of similarity. Different threshold will lead to different result. The smaller the threshold is, the more likely the two global central nodes are assigned to the same community, so there tend to be fewer communities. We need to choose an appropriate value of threshold so as to detect the real number of communities and get the best result.
For the threshold , take at regular interval ten values from 0.1 to 1, and repeat the previous three processes for each value. Then we obtain ten partitions of the network. Generally, we are not able to know the real communities, so we have to choose the best result through an assessment criterion. For International Journal of Distributed Sensor Networks 5 now, modularity is the most widely used criterion. However, we find that the result maximizing the extensive modularity is usually better than the result maximizing the modularity and conforms to the real situation very well. So we choose the partition with maximal value of extensive modularity as the best result. Now we analyze the time complexity of CDHC. The first process costs ( ln ) + ( ). The second process costs ( ), and the third process costs no more than ( ). So the time complexity of CDHC is ( ln ) + ( ) + ( ). In the worst case, and are close to , and the time complexity is ( 2 ).

Experiments
Firstly, we choose three most widely used datasets, which are Zachary Karate Club [17], Dolphin Network [18], and American College Football [19]. Secondly, we introduce some assessment criteria used in the case that the real communities are known. Thirdly, we verify the advantage of extensive modularity on the three datasets. Lastly, we choose some classical or new algorithms and compare them with CDHC on the three datasets.

Zachary Karate Club.
It is a social network of friendships between 34 members of a karate club at a US university in the 1970s. A factional division leads to a formal separation of the club into two organizations of nearly equal size.

Dolphin Network.
It is an undirected social network of frequent associations between 62 dolphins in a community living off Doubtful Sound in New Zealand. It contains totally 62 dolphins in two groups.

American College Football.
It is a network of American football games between Division IA colleges during regular season fall 2000. It contains 115 teams, which are divided into 12 conferences, each containing 8 to 12 teams. Table 1 is a summary of the three datasets' basic information, including NOC (number of communities), number of nodes, and number of edges. Criteria. Given a network , = { 1 , 2 , 3 , . . . , } is its real communities, while = { 1 , 2 , 3 , . . . , } is its detected communities. Here we introduce four assessment criteria to evaluate the overlap ratio of the real communities and the detected communities. [20]. Precision is defined as

Precision
Precision is a number between 0 and 1. It means the ration between the retrieved correct items and the retrieved items. It is not a good criterion for community detection. For example, if we consider every node as a community, the precision will be 1. However, it is obviously not a good result. Generally, if we detect more communities, the precision tends to be larger. [20]. Recall is defined as

Recall
Recall is a number between 0 and 1. It means the ratio between the retrieved correct items and the correct items. It is also not a good criterion for community detection. For example, if we only detect one community, which is the whole network, the precision will be 1. However, it is obviously not a good result. Generally, if we detect fewer communities, the recall tends to be larger. [21]. -measure is defined as -measure ( , )

-Measure
which is -measure ( , ) -measure is a number between 0 and 1. We can see that -measure is the harmonic average of precision and recall. If -measure is large, precision and recall need to have a relatively large value. According to what we discuss about precision and recall, if we want a large value of -measure, the number of communities should not be too large or too small, and the detected communities must be very close to the real communities.

NMI [22] (Normalized Mutual Information). NMI is defined as
NMI is a number between 0 and 1. The larger NMI is, the better result is. NMI is a frequently used criterion, but the calculation is relatively complex.
According to our discussion about the four criteria, we will take -measure and NMI to evaluate the result of community detection.

Experimental Results on Extensive Modularity.
For the threshold , we take at regular interval ten values from 0.1 to 1, and we obtain ten different partitions, respectively, on the datasets Zachary, Dolphin, and Football. For each partition, we use NOC (number of communities), modularity , extensive modularity , -measure, and NMI to evaluate the result. Table 2 is the ten experimental results under different threshold on the dataset Zachary. Seen from the modularity, the best result should be the case when the threshold is 0.6. We can see that the maximal value of modularity is 0.373. In this case, we find that the detected communities differ much from the real communities. The detected NOC is 3, while real NOC should be 2. Besides, the value of -measure and NMI is not very high. Seen from extensive modularity, the best result is the case when the threshold is 0.2. The maximal value of extensive modularity is 0.744. In this case, the values ofmeasure and NMI are both 1, which signifies the detected communities are exactly the same with the real communities. Table 3 is the ten experimental results under different threshold on the dataset Dolphin. Seen from the modularity, the best result is the case when the threshold is 0.3. We can see that the maximal value of modularity is 0.477. In this case, we find that the detected communities also differ much from the real communities. The detected NOC is 4, while Though maximizing extensive modularity does not find the best result, it is very close to the best result and is far better than the case maximizing modularity. Table 4 is the ten experimental results under different threshold on the dataset Dolphin. Seen from modularity or extensive modularity, the best result is the case when the threshold is 0.5. In this case, the detected NOC is 11, which is very close to the real NOC 12. We see that maximizing modularity or extensive modularity does not find the best result, but it has only a slight gap with the best result.
From the discussion on the experimental results on the three datasets, the result maximizing extensive modularity usually is the best result or very close to the best result, while the result maximizing modularity is usually not very good. We can thus consider that extensive modularity is a better criterion compared to modularity.

Experimental Results on Comparison of Algorithms.
As we have discussed, the extensive modularity is a better criterion than modularity, so we choose the case maximizing extensive modularity as the best result and the final result. We have chosen four algorithms for comparison, which are GN Algorithm [6], CNM Algorithm [13], EACH Algorithm [16], and LCV Algorithm [15]. Table 5 is the experimental results of the five algorithms on the dataset Zachary. GN and CNM get large value of modularity, but they detect more communities than the real situation, and their value of NMI is very low. EACH, LCV, and CDHC have detected the real communities. So EACH, LCV, and CDHC are far better than GN and CNM on Zachary. Table 6 is the experimental results of the five algorithms on the dataset Dolphin. GN, CNM, and EACH get large value of modularity, but they detect more communities than the real situation, and their value of NMI is very low. LCV and CDHC get large value of extensive modularity, and the   LCV and CDHC are far better than GN, CNM, and EACH on Dolphin. Furthermore, CDHC is slightly better than LCV. Table 7 is the experimental results of the five algorithms on the dataset Football. The modularity of the five algorithms differs a little. The NOC detected by GN and CNM is far less than the real NOC 12. GN gets the largest value of extensive modularity, but its NMI is relatively low. CDHC gets the second largest value of extensive modularity, and its NMI is very close to the best result. On the whole, EACH and CDHC are better on Football.
From the experimental results on the three datasets, we find that GN and CNM usually get large modularity. However, the detected communities usually differ much from the real communities. Large NOC makes it easy to get a large modularity. CDHC usually gets large extensive modularity, and the detected communities are very close to the real communities. On the whole, CDHC is far better than GN and CNM and is slightly better than EACH and LCV. Figures 3, 4, and 5 show the communities that CDHC has detected on Zachary, Dolphin, and Football. We can see that the community structure is evident.

Conclusion
This paper proposes the conception of extensive modularity on the basis of modularity. The upper boundary of extensive  modularity is independent on the number of communities that we detect, so it overcomes the weakness of modularity that maximizing modularity tends to detect more communities than the real situation. We verify through experiments that extensive modularity is a better criterion than modularity when the real communities are unknown. This paper proposes CDHC Algorithm. CDHC Algorithm includes four processes, which are initializing communities, expanding communities, merging little communities, and choosing the best result. We choose four algorithms to compare with CDHC on three datasets, which are Zachary, Dolphin, and Football. We find that CDHC is far better than GN and CNM and is slightly better than LCV and EACH. According to -measure and NMI, the communities detected by CDHC are very close to the real communities.