An overlapping community detection algorithm based on local community and information flow expansion (LCFE) in weighted directed networks

aMSc student in industrial engineering, Iran University of Science and Technology, Tehran, Iran bProfessor of industrial Engineering, Iran University of Science and technology, Tehran, Iran C H R O N I C L E A B S T R A C T Article history: Received July 14, 2020 Received in revised format: July 25, 2020 Accepted August 1


Introduction
By development of information technology and trilling vast amounts of data in its beds the importance of social networks has been felt more than ever. Biological networks such as protein-protein interactions, online social networks, collaboration networks such as author's citation networks are examples of social networks (Girvan & Newman, 2002). Social network analysis reveals fundamental and strong insights about the modern world. That's why the amount of research conducted in the field of network science has been increased. One can simply describe a network as a set of nodes and edges depicting the interactions between them (Badiee & Ghazanfari, 2018). As an established fact in science of networks, the effect of the structure on the system is inescapable (Kermani, Badiee, Aliahmadi, Ghazanfari, & Kalantari, 2016). Some nodes in the network show a particular similarity to others. These nodes are divided into groups which are densely connected to each other while have sparse connections with other components of the network. The procedure in which one can find these groups is called community detection or clustering or graph partitioning and these groups are called clusters, partitions or communities. But still there is not a clear definition on the graph clustering problem both in undirected and directed graphs (Malliaros & Vazirgiannis, 2013). In this paper we will introduce an overlapping community detection algorithm for weighted and directed networks. The algorithm starts by calculating betweenness for every edge. Then based on the higher betweenness values the ends of the edge are selected as a local community, and the local communities will be expanded through optimizing a modularity function called WFM. In the next step an overlapping score based on the similarity concept is introduced and communities with higher overlapping score will be merged. At the end, the homeless nodes are added to communities based on the value of their fitness. After the last phase, if a node is still homeless, we divide it into outlier group.
This paper is organized as following: In the second part we have a literature review on community detection algorithms in weighted directed networks. In the third section we present the model and the algorithm. In the fourth section we have numerical example and in the fifth section we have a conclusion and suggestions for future works respectively.

Literature review
Dealing with edge directionality have been an issue for a long time and various methods have been proposed to tackle this problem. In many works for convenience the directed graph is transformed into the undirected version thus all the algorithms for undirected community detections can be applied. This approach is called naïve graph transformation (Malliaros & Vazirgiannis, 2013). This method causes a huge loss in data and many vital information could be ignored. In some works the graphs have been transformed to undirected ones but the directionality is somehow maintained. There are two different approaches to deal with directionality in this context. The first approach transforms (converts the symmetric adjacency matrix to an asymmetric one), the directionality to edge weight and keeps the graph as unipartite. For example a directed network can be symmetrized through a two stage process (Satuluri & Parthasarathy, 2011). The idea behind the two stage transformation originates from a fact that a clustering algorithm should not solely depend on the density of the nodes but also the similarity in incoming and outgoing edges should be taken into consideration. In the first step multiple ways for symmetrizing an asymmetric adjacency matrix is proposed and in the second step an ordinary community detection approach can be used. A network can be symmetrized based on its embedding which can be considered indirectly as a transformation to an undirected weighted network; Laplacian matrix can be considered as an embedding (Lai, Lu, & Nardini, 2010b). Edge directionality can be extracted using a PageRank random walk and replace the directionality with edge weights (Lai, Lu, & Nardini, 2010a). The community detection process can be started using core nodes in the network and then, expanding the core nodes in their neighborhood to extract final community structure of the network (Long & Li, 2017).
In the second category the directionality is modeled through converting the graph to a bipartite network. In some works a scheme is used to transform the directed graph to an undirected bipartite graph ( (Guimerà, Sales-Pardo, & Amaral, 2007;D. Zhou, Schölkopf, & Hofmann, 2005). In some other works an objective function is developed to tackle the directionality in graph clustering. In previous works the nature of the directed network was changed but in these methods, objective functions are developed to directly deal with the clustering problem of directed networks. In the first category a modularity function is developed for directed networks. Modularity is a criteria for assessing the quality of clusters (Newman & Girvan, 2004). The modularity function is generalized for directed networks based on reducing the initial size of the network while keeping the modularity intact (Arenas, Duch, Fernández, & Gómez, 2007). A modularity function for directed networks based on the original modularity was introduced and supposed that the modularity can be expressed through eigenvalues and eigenvectors of a specific matrix called modularity matrix (Leicht & Newman, 2008). LinkRank algorithm emphasizes on the edges other than nodes in the process of community detection (Kim, Son, & Jeong, 2010). The scalable Louvain algorithm with maximizing the modularity was extended and thus a brand new community detection algorithm was developed for directed networks (Dugué & Perez, 2015). Regularize asymmetric non-negative matrix factorization (RANMF) was developed which is based on an objective function with pairwise comparison of nodes (Tosyali, Kim, Choi, & Jeong, 2019). a consensus clustering algorithm for directed networks called ConClus which is comprised of three sub algorithms was developed. The algorithm mostly relies on a fitness function and a neural network providing intervals as resolution parameters (Santos, Carvalho, & Nascimento, 2016). An overlapping community detection algorithm for directed networks based on edge betweenness modularity and pagerank was proposed (Sathiyakumari & Vijaya, 2018). another overlapping community detection algorithm for directed networks which uses a Gamma-Poisson block model was introduced. The model can also be generalized for undirected networks by means of making the block model matrix as symmetric (Gao, Liu, & Miao, 2018). A multi-objective optimization model for clustering the heterogeneous weighted networks through key nodes identification with overlapping communities was introduced (Kalantari, Ghazanfari, Fathian, & Shahanaghi, 2020). Some approaches are based on nature inspired algorithms. a consensus genetic based algorithm was used to detect communities in directed networks (Mathias, Rosset, & Nascimento, 2016). In another work Bio-inspired algorithms was used for detecting communities in weighted directed networks (Osaba et al., 2018). An ant colony based algorithm for overlapping community detection was introduced (X. Zhou, Liu, Zhang, Liu, & Zhang, 2015). Although Optimization of an objective function and using nature inspired algorithms can be classified in one category, we'd rather to split them in two different categories because of uniqueness of problem solving approach in nature inspired algorithms. So far the most of the algorithms under study did not take the effect of edge weights into account. A new local clustering coefficient is proposed for weighted and directed networks which captures the presence of triangles as well as weights (Clemente & Grassi, 2018). An algorithm in which impact factors of in-degree and out-degree are considered and the directed weighted degree is used to measure the importance of a node (Liu, Qin, Yun, & Wu, 2011). In Table 1 we summarized the algorithms based on the method they have used to tackle the community detection problem in weighted and directed networks. As it can be shown, the number of algorithms that took edge weight into consideration in the algorithm is too low. It is important to mention that our algorithm is an extension to the algorithm proposed in (Xing, Fanrong, Yong, & Ranran, 2015). This algorithm is designed for undirected and unweighted networks. We extended the algorithm for weighted and directed networks, meanwhile we developed a new modularity function called Weighted Flow Measure (WFM) fitted for weighted and directed graphs and a new overlapping score which considers the similarity between edges and nodes of two given communities at the same time.

Model
We use the graph ( , ) G V E  to model the weighted and directed network in which V is the set of nodes and E is the set of weighted directed edges between the nodes. As it can be seen from Fig. 1, In this model we normalize the edge weights by dividing each edge weight to the maximum edge weight.

Algorithm
In this section we present the algorithm. The algorithm is designed for overlapping community detection in weighted and directed networks and is called LCFE. First we present the notations used in the algorithm pseudocode.

LCFE pseudocode notations
The notations and the parameter definition of the LCFE can be seen in Table 2.

.2 LCFE Steps
LCFE begins with certain edges, sets the ends of the edge as local communities and expands the local communities. In the second step the local communities are expanded through a modularity function optimization. In the third step the local communities are merged based on a certain criterion. And finally in the fourth step the communities are refined through assigning the homeless nodes to detected communities.

Step1: calculating the edge betweenness centrality
First step is calculating the edge betweenness of all edges in the graph by considering the weights. The more central an edge the stronger the communities developed from it. This step is a new contribution to (Xing et al., 2015) because the edges are not chose randomly in order to community expansion. The centrality criteria is the one developed in (Girvan & Newman, 2002): As we know, the weight affects the numbers of shortest paths between two nodes, so in order to relax the negative effect on the intensity concept i.e. edge weights we sort the edge betweenness values in ascending order. Here is the pseudocode of this step:

Pseudocode 1: Edge Betweenness matrix
Step 1: Calculating the edge betweenness Centrality Matrix After calculating edge betweenness for each edge, the algorithm starts by the edges with the first edge in BM and assigns its ends as the local community, then the common neighbors of the ends will be drawn out. After that a new modularity function will determine whether a common neighbor node will be joined to the community or not.

Definition 1. Weighted Flow Modularity (WFM):
This modularity function is mostly based on the popular M function modularity. But due to the existence of directionality and the assumption of stronger communities in directed graphs have stronger information cycles, we extended the M function as following: , In the Eq. A common neighbor which makes a closed cycle starting from itself and ending to itself will have more chance to join the local community because it increases the flow of information to the existing nodes of local community and itself. After clarification on the WFM, it is time to scrutinize the pseudocode for the local community detection step of the algorithm. Here is the pseudocode:

Pseudocode 2: Local Community Detection
Step 2: Detecting Local Communities Input: Network

Fig. 2. Local Community Expansion
For better clarification of this step, consider the following example using Fig. 2. Suppose uv e is the starting edge probed from step 1. Nodes u and v make up the first local community. WFM for this community is equal to 0.13. On the other hand the neighbor nodes set is { , , , } a b c d . Now every node in the neighboring set will be added to the local community. If it increases the amount of WFM, then it will be added to the community. First candidate is node a, WFM value for local community with a is equal to 2.33 which is larger than the amount of WFM before joining a. so the local community will be   , , u v a . It can be seen that the significant increase in the amount of WFM is due to existence of a closed cycle starting from a. If the b is added to the local community the value of WFM with b will be 5.82 which is larger than the value of WFM for the community before joining b, so node b will be added to the local community. By adding node d, the value of WFM will be 5.78 which is smaller than the previous amount. It can be seen that node d does not initiate a closed cycle of information with the current nodes of the local community. Finally node c will be added to the local community. We can see that the amount of WFM by joining node c will be rised to 6.24. So node c will be added to the local community. At the end of this process we can see that the local community expanded to { } , , , , u v a c b . This step of the algorithm continues till all of the edges has been investigated.

Step 3: Local Community Merging
Extracted local communities from step 2 are relatively small and cannot be considered as the final community structure of the network. On the other hand they are not so much overlapping. So in this step we developed a novel overlapping score in which considers the overlapping in nodes and edges at the same time. Before that we will have a review on the base overlapping scores on which we developed our own: Definition 2. Overlapping score (OS) was first introduced in (Nguyen, Dinh, Nguyen, & Thai, 2011). This score is parameter free and requires only the local topological information of the network. Here is the equation: Definition 3. Later on, the equation (3) was extended in (Xing et al., 2015) to the following form: In this new form, a parameter  is added to the fraction because some networks have more overlapping score in nodes rather than edges. Definition 4. A similarity index was introduced in (Carley, 1991). The main idea behind this similarity index is that "Friends tend to be similar".
After illustration on LOS, the pseudocode of step 3 is as following:

end for
In the pseudocode above,  is a tunable parameter. The larger the value of  the less communities combined.

Step 4: Community Refinement
After merging communities there might be nodes that are left without communities. Now that the communities are large enough, the possibility of these nodes to join a community is higher because they have a better chance to form closed cycles. At the end of this step every node that hasn't joined a community is called an outlier. In order for a node which is out of community to join a community we define a criteria called node fitness: Definition 6. The value of fitness for a node as calculated as following: By joining the node to the community C, if the fitness value us strictly larger than 0, then node will be joined to the community. The pseudocode of this step is as following:

 
Outlier Outlier u   12. end if 13. end for

Algorithm for Generating Benchmark networks
For the purpose of benchmarking we used the algorithm introduced in (Lancichinetti & Fortunato, 2009). The algorithm is specifically designed for testing overlapping community detection algorithms in weighted and directed networks.

Run time settings and environment
The simulations have been carried out on a laptop with Intel(R) Core(TM) i5 m48 @ 2.67GHz 2.66 GHz processor and 3.87 GB Memory under Win8 operating system. The source code of the algorithm of this article is written in Python 3.7. The benchmark algorithm has its own software package developed.

Evaluation Criteria of the LCFE
We will test the algorithm performance with normalized mutual information (NMI) for benchmark networks since the community structure of the LFR networks are already known and EQ measure for the real world networks. Considering the fact that, the true community structure of most real networks is unknown, we utilize the EQ measure to evaluate the performance of the algorithms. This measure is calculated through Eq. (8): 1 , A shows the adjacency. The greater the EQ value is, the better community detection result.

Parameters used in benchmark networks
The mentioned algorithm for generating benchmark networks gets the parameters shown in Table 3 as inputs and generates a network based on the input values.

Experimental results on synthetic networks
In this section, we evaluate the algorithm from two points of view. The first one for the accuracy and the second one for algorithm run time.

Experiments for accuracy evaluation
In this section, we generated 64 synthetic networks divided into 16 groups each containing 4 networks. All groups share the common parameters N, k, maxk, beta, t1 and t2. All networks in each group share all parameters except for the number of overlapping nodes. We extracted the community structures with LCFE algorithm and Order Statistics Local Optimization Method OSLOM (Lancichinetti, Radicchi, Ramasco, & Fortunato, 2011). In Table 6 the synthetic network groups for accuracy evaluation of LCFE is listed. The comparison results are shown in Fig. 2. It can be seen from Fig. 3 to Fig. 18t hat the LCFE is dominant over OSLOM in most cases.

LCFE run time evaluation
In this part of the evaluation process, we generated 10 different benchmark networks. From 100 nodes to 1000 nodes. It can be seen from Fig. 19 that LCFE is quite better than OSLOM in most cases and is competitive in other cases.

Experimental results on Real world networks
We implemented the LCFE on 10 real world weighted and directed networks from KONECT project. A brief description of these networks can be seen in Table 4. We used EQ measure in order to evaluate the quality of the community detection that carried out by LCFE and OSLOM. The more the EQ measure, the better the community detection result. EQ measure evaluation results on real world networks are listed in Table 5.

Discussion and conclusion
In this article, we developed an overlapping community detection algorithm for weighted and directed networks. The main contributions of this work are using sorted edges as initiators of community detection process, developing a new modularity function called weighted flow modularity based on M function and weighted flow cycles, a new overlapping score which considers overlapping between nodes and edges at the same time. We generated 81 LFR benchmark in order to evaluate the various aspects of the developed algorithm in terms of accuracy, run time and parameter selection. We evaluated the community detection results on LFR benchmarks using normalized mutual information. Then the performance of the algorithm was evaluated using EQ measure on real world networks. In all cases, the performance of the algorithm was compared with Order Statistical Optimization Method (OSLOM). LCFE was dominant over OSLOM in most cases and was competitive in other cases.  N1  N2  N3  N4  N5  N6  N7  N8  N9  N10  N11  N12  N13  N14  N15  N16  N  200  200  200  200  200  200  200  200  200  200  200  200  200  200  200  200  K  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  MAXK  25  25  25  25  25  25  25  25  25  25  25  25  25  25  25