Unsupervised community detection in attributed networks based on mutual information maximization

Community detection is of great significance for understanding network functions and behaviors. With the successful application of deep learning in network-based analyses, recent studies have turned to utilizing graph convolutional networks (GCNs) to this problem due to their capability in capturing network attributes. Nevertheless, most existing GCN-based community detection approaches are semi-supervised and local structure-aware, even though community detection is an unsupervised learning problem essentially. In this paper, we develop a novel GCN method for unsupervised community detection under the framework of mutual information (MI) maximization, called UCDMI. Specifically, a novel MI maximization mechanism is developed to capture more fine-grained information of the global network structure in an unsupervised manner. Moreover, a new aggregation function is proposed for GCN to distinguish the importance between different neighboring nodes, which enables our method to identify more high-quality node representations and improve the community detection performance. Our extensive experiments demonstrate the effectiveness of our proposed UCDMI compared with several state-of-the-art community detection methods.


Introduction
Community structures are common in complex networks, which can be described as some groups where nodes within the same group are closely connected [1][2][3]. The purpose of community detection is to detect such community structures from complex networks, which helps understand the hidden information of complex systems, e.g. the functions and units of a social group [4][5][6] and urban traffic systems [7,8]. In reality, community detection owns many valuable applications and solves many practical problems, which have been widely focused in fields ranging from medicine and engineering to social science and biology [9][10][11]. For example, detecting community structure in World Wide Web networks can reveal different topics and facilitate social recommendations [12].
Up to now, numerous algorithms have been developed to detect communities by utilizing network structures, including generative models [13,14] and metric-based methods [15,16]. Besides the topology of networks, the effective utilization of attribute features also play a very important role in improving the accuracy of community detection [17]. Such attributes provide additional rich information of networks and indicate possible states of nodes in a network [18]. For example, in a citation network, papers are equipped with title and areas of keywords [19]. Since the topology structure and node attributes are two different types of information in networks, it poses a challenge to detect communities in such attributed networks.
Several methods have been proposed to detect communities of attributed networks by considering attributes of nodes, which include methods based on the nonnegative matrix factorization [6], spectral clustering [20], graph convolutional networks (GCNs) [21], etc. Among them, GCN-based methods (e.g. DGI [22]) have recently gained a lot of attention due to their abilities to effectively integrate network topology information and attribute information. However, there are some issues which require further consideration. First, these methods typically update node features by aggregating the features of their neighboring nodes, which means treating all neighbors equally [23]. However, since different neighbor nodes have different positions in a network, their importances to the target node may also be different. Although the graph attention networks-based methods have considered this issue, they are implicit and may involve interpretability issue [24]. Second, most exiting methods require some data with labels to train GCN. Nevertheless, the information of labels is often expensive or even unavailable due to the privacy policy [25]. Hence, it is crucial to develop a method that can detect high-quality community structures of networks in an unsupervised fashion. Third, most of these methods pay little attention to the global information of networks, which can be attributed to the fact that they are based on the local convolutional strategy [22]. Nevertheless, the global information may indicate the global status of nodes in a network, e.g. revealing the similarities between two nodes that are far away from each other but have similar connection patterns [26].
To fully inherit such global information and attributed features in an unsupervised manner, we execute community detection in attributed networks based on the mutual information (MI), inspired by the success of the Deep InfoMax method [27] on images processing. Deep InfoMax exploits global representations of an image by maximizing MI between images (i.e. the inputs) and hidden vectors (i.e. the outputs) [27]. Some recent works transfer Deep InfoMax to the network domain, such as deep graph infomax (DGI), for node classification and link prediction [22]. More specifically, they assume that the learned representations of each node should contain information of the entire graph. For this purpose, similar to the image processing, these methods discover useful node representations by implementing MI maximization based on the representation of the whole network. Nevertheless, learning the representation of nodes by capturing information from the entire network is crude and cannot reflect the intrinsic structural information. As mentioned before, there are community structures in networks, where nodes within the same community tend to share more similar information than nodes from different communities. In fact, for specific nodes, the information of their corresponding subnetworks (e.g. community structures) can better reflect their status and functions in the network than the information of the whole network. For example, a more accurate class information can better reflect the status of a specific student, such as the information of their major and grades, rather than a broader school information. However, existing methods which implement the MI maximization based on the whole graph cannot make use of this structure characteristic effectively, leading to rough node representations and efficiency decrease in community detection.
Considering these limitations, this paper proposes a new method for unsupervised community detection in attributed networks based on the MI maximization, called UCDMI. Specifically, we first analyze the potential cluster structure, namely fine-grained subnetworks, based on the attributed features of networks. Then, we develop a new MI maximization mechanism to maximize the MI between node representations and the corresponding fine-grained subnetwork representations. Based on such a mechanism, the more fine-grained information of the global network structure and attributes can be effectively captured in an unsupervised manner. Moreover, we design a new aggregation function of GCN that can aggregate the features of neighbors in preference according to the importance between nodes. Finally, the community structure of networks can be obtained by applying a clustering algorithm on node embedding. The main contributions of this paper are as follows: (a) A novel aggregation function for GCN is proposed in which more important neighbor nodes can contribute more in the process of the feature aggregation. (b) A new MI maximization mechanism is designed in which more fine-grained global information of networks is captured in an unsupervised fashion, which helps get high-quality community structures in attributed networks.
The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 introduces the proposed UCDMI, which describes the formulations of the proposed aggregation strategy and the new MI maximization mechanism. Section 4 presents comprehensive experiments to demonstrate the effectiveness of UCDMI. Finally, section 5 draws the conclusions of this paper.

Related work
Existing community detection methods related to this work fall into two categories: structural community detection and attributed community detection [28]. Structural community detection only utilizes the structure of networks (i.e. node connectivity [29,30]). Specifically, the graph Laplacian eigenmaps-based (a) Traditional solutions for community detection in attributed networks, which aims to separately detect the community structure based on the information structure and attributes, and finally fuse together. (b) Our joint optimization solution for community detection in attributed networks, which simultaneously learns these two types of information. As the outputs, two communities c 1 and c 2 are detected by these two kinds of solutions from the original network G, where nodes within the same community are closely connected.
methods assume that similar nodes should be mapped closer [31]. To exploit more structural relationships, the matrix factorization-based methods factorize the adjacency matrix and other relationship matrices of networks into node representations [32]. For example, a recent method proposes an NMF-based method to detect community structures by solving the majorization-minimization principle [2]. However, the matrix factorization has high time complexity due to the frequent matrix operations. Therefore, the random walk-based methods are used instead to maximize the probability of neighborhoods for each node to overcome this problem [33]. Moreover, although the autoencoder-based methods can also exploit relationships of the network structure by using the learned node representations to reconstruct the adjacency matrix, these methods only exploit network structures [34]. Besides, some methods, e.g. NSGAMOF, apply the modularity-based technology for community detection in order to further exploit the structures [35]. However, some attributes that can provide extra profiles for users should be taken into account.
Different from structural community detection methods, attributed community detection methods consider both the structures and attributes of networks [36]. Specifically, some existing methods aim to fuse the information of structure and attributes after the process of community detection. As shown in figure 1(a), these methods first separately detect communities based on the attributes (e.g. by k-means [37,38]) and the structure (e.g. by Louvain [39]). Then, the resulting communities are merged in a way that results in structure-and attributes-aware communities. For example, CFOND utilizes the consensus factorization principle to preserve the information of the structure and attributes for co-clustering network data. Similarly, based on the consensus clustering [40], FCCCN can be applied to networks with millions of nodes by calculating the consensus matrix and additional node pairs [41]. In fact, the information of the structure and attribute are interrelated, and considering the two types of information separately during the process of community detection may sever the connection between the both [6].
In order to jointly optimize both kinds of information, as shown in figure 1(b), some recent methods have applied deep learning-based methods on the attributed and large-scale networks to detect underlying community structure [42]. Specifically, vGraph detects communities in attributed networks by utilizing a generative model to jointly learn node attributes and network structures [43]. However, although vGraph has the ability to capture node attributes, it mainly relies on the structural information of networks when computing the scores of community distribution over nodes [25]. To make full use of the attribute information, some methods (e.g. ARVGE [44] and DAEGC [25]) use a graph autoencoder to encode the node attributes and topological structure of a network into a compact representation. To some extent, these methods revealing community structures depend on adjacency matrix reconstruction, which splits the intrinsic relationship between node attributes and network structure [45]. To overcome this problem, some recent GCN-based methods (e.g. GUCD [42]) use the structural information (often denoted as adjacency matrix) to guide the aggregation of node attributes, thus effectively unifying these two kinds of information to better detect communities in networks.
However, existing GCN-based methods do not address the global information of networks because they adopt the local aggregation strategy [21]. To overcome this, DGI proposes a MI maximization strategy to embed the global information into node representations [22]. Nevertheless, the global information captured Figure 2. The high-level overview of (a) deep graph infomax (DGI [22]), (b) the idea of MI maximization, and (c) subnetwork MI maximization (our). Note that DGI maximizes the MI representations between nodes and the whole network, which results in the global information learned being coarse and cannot reflect the status of nodes. Thus, our UCDMI aims to better exploit the global information by maximizing the MI between local node representations and more fine-grained subnetworks. Moreover, we also design a new aggregation function of GCN for learning better node representations. Based on our proposed aggregation function, a node can also directly receive the information from nodes of great value but without direct edges between them (as shown by the red message propagation lines).
by DGI is coarse, since it maximizes MI based on the whole network. Therefore, how to develop a novel method that can capture both the fine-grained global information and node attributes is still an open problem.

The formulation of UCDMI
In this section, we first give the notations and problem definition, and then present our proposed UCDMI. Specifically, as shown in figure 2(c), there are two key parts of UCDMI: the new aggregation strategy of GCN and the novel MI maximization mechanism. The new aggregation strategy can adjust the contributions of nodes in the process of feature aggregation according to the importance between nodes. As shown in the red message propagation lines of figure 2(c), the information of features can even pass directly between valuable nodes where no direct edges exist between them. Moreover, different from existing MI maximization-based methods (e.g. DGI, which is shown in figure 2(a)), our proposed MI maximization mechanism can much better capture the global information of networks by maximizing the MI between local node representations and more fine-grained subnetworks. More differences between our approach and DGI will be described in detail later.

Notations and problem definition
We formally define an undirected and attributed network as G = (V, E, X), where V = {v 1 , . . . , v n } is a set of nodes, and E = e i j V i,j=1 represents a set of edges. X ∈ R n×f is a feature matrix for all nodes, where f means the number of features of a node. Given an attributed network G, the aim of community detection is to find some groups where nodes within the same group are closely connected. Formally, such groups are referred as communities, i.e. C = {c 1 , . . . , c m }.

Graph convolutional layer
To realize the MI maximization, we should capture the node-level information, i.e. node representations. In this work, our encoder E our is based on the GCNs. The original propagation rule E GCN of GCN is defied as where A is an adjacency matrix of a network G,Â = A + I n represents the summation of the identity matrix and adjacency matrix.D is a degree matrix whereD i,i = jÂ i,j . The σ is the ReLU function, and W is a learnable linear transformation. H = {h 1 , . . . , h n } denotes the matrix of node representations in which h i ∈ R d is a low-dimensional vector of v i , and d means the number of embedding dimensions. Based on equation (1), the propagation rule of GCN is to aggregate features of neighboring nodes in average, since it uses A + I n to guide the aggregation, i.e. all neighboring nodes are treated equally. Besides, some valuable nodes that are important to each other but not directly connected cannot share the information directly. However, in the real world, the importance of different friends to a person often varies according to the positions of these friends in society, and some indirect friends may also be important. This motivates us to design an effective propagation rule for guiding the aggregation according to the importance between nodes.
To this end, we quantify the importance between nodes by measuring the Jaccard similarity of nodes because of its empirical success in graph clustering and node classification. Specifically, given two nodes v i and v j , the Jaccard similarity between them, i.e. S v i ,v j ∈ S, is calculated as where N(v i ) and N(v j ) represent a set of neighbors of v i and v j , respectively. Such similarity between two nodes indicates how important they are to each other, and will be used to guide the aggregation. Therefore, based on the Jaccard similarity matrix S, we design our encoder E our as whereŜ = A + I n + αS, and α is a parameter used to balance the contribution of the Jaccard similarity.
According to the encoder we designed, a node no longer aggregates the features of neighboring nodes equally, but adopts an unequal aggregation based on the values ofŜ instead. The information about nodes that are important to each other but not directly connected is also taken into account.

Mutual information maximization mechanism
Since GCN-based frameworks utilize the local propagation rule, this results in the node representations learned by equation (3) retaining only the local information of networks. Thus, we proposed a new MI maximization mechanism that enable nodes embedding to capture the global information of networks. It is realized in three steps: (1) finding some fine-grained subnetworks that indicate potential clusters; (2) computing the global representations of such subnetworks; (3) maximizing the MI between the global presentations of each subnetwork and the local representations of nodes residing in the subnetwork. As mentioned before, different from existing MI maximization-based methods (e.g. DGI) which design MI based on the whole graph representations, we use some more fine-grained subnetworks instead. Given a graph G with the attributed matrix X, we aim to design a function P(X) to find some subnetworks indicating potential clustering. In this paper, we adopt k-means to serve as such function because it works best in experiments, which can be defined as where g i is a fine-grained subnetwork. Having trained the local node representations in equation (3) and a set of subnetworks {g 1 , . . . , g m }, we introduce a subnetwork-level representation encoder, i.e. a readout function: R: R p×d → R d . To avoid notational cluttering, we assume that there are p nodes in each subnetwork, although we reiterate that the number of nodes in each subnetwork may be different. Specifically, for each subnetwork, we aim to obtain its global representations s t r , where 1 r m. Formally, we define such a readout function as where H r ∈ R p×d represents the corresponding node representation matrix of subnetwork g r extracted from H, and h r i is the ith row in H r . σ denotes the logistic sigmoid nonlinearity. Finally, to enable our model to capture more fine-grained global information, for each subnetwork g r , 1 r m, we maximize the MI between the local node representation patches h r 1 , h r 2 , . . . , h r p and the corresponding summary network representations s r . For this purpose, we use a discriminator D which discriminates true samples [i.e. (h r i , s r )] from its negative samples [i.e. (h i r , s r )] and the final loss is defined as where the negative node representationsh r j correspond to the jth row inH r . D is a discriminator used to score the pairs (h r i , s r ). In this paper, the simple bilinear scoring function is utilized, as equation (7).
where B t denotes a trainable scoring matrix, and σ represents the logistic sigmoid nonlinearity.H r is extracted fromH in the same way as H r . To construct the negative node representation matrixH, we first shuffle the original attribute matrix X in the row-wise fashion to generate the corrupted network, i.e. Algorithm 1. The UCDMI algorithms.
X →X. Then, we reuse the encoder defined in equation (3) to generateH, that is,H = E our X , S, A|W . It is remarkable that, our novel MI maximization mechanism amounts to the binary cross entropy loss shown in equation (6). Such theory has been proved in previous works [22], and also fits with our mechanism, although the details of our model differ from theirs. By minimizing the loss of equation (6) based on the designed GCN, i.e. equation (3), we obtain the node representations which capture the fine-grained global information of networks. Based on such node representations, the community structures are identified by directly applying a clustering algorithm on it, such as k-means. Algorithm 1 shows the pseudocode of UCDMI.

Complexity analysis
The time complexity of UCDMI depends on the operation of convolution in GCN and the calculation of the Jaccard similarity. Concretely, according to [46], the time complexity involved in GCNs is O(|E|). It means that the time complexity of the convolution operation increases linearly with the number of edges. In terms of calculating the Jaccard similarity, the complexity of calculating two sets X and Y with n elements is O(n log n). For calculating the similarity of nodes in complex networks, we measure the similarity between target nodes and their corresponding neighboring nodes. Let d avg denote the average degree of a node. Then, the calculation of the Jaccard similarity between a pair of nodes has the complexity O(|E|d avg log d avg ), where |E| represents the number of edges. Thus, the overall time complexity of UCDMI is O(|E|d avg log d avg ).

Experiments
In this section, we first introduce the datasets, baseline methods, and parameter setting involved in this paper. Then, we compare the performance of our proposed UCDMI with other baseline methods on community detection. Moreover, we also provide some further investigation, i.e. ablation study, parameter analysis, and visualization to understand the effectiveness of the proposed strategies.

Datasets and evaluation metrics 4.1.1. Datasets
Several widely used and standard networks are analyzed to verify the effectiveness of our proposed methods. A summary of these networks is presented in table 1, and the detailed information of these methods is introduced as follows.
Polbooks network 5 : Polbooks network reflects the sales relationship of political books on Amazon, where the nodes represent the books sold on Amazon, and each edge between two nodes means that the two books are purchased by the same customer. Since the attributes of nodes in this network are not provided, we use the vectors extracted from the adjacency as such attributes. Cora network 6 : Cora network is built based on the citation relationship between machine learning papers, where the nodes represent published papers, and edges denote the citation relationship between two papers.
Citeseer network 6 : Citeseer network is also built based on the citation relationship between scientific papers, where the nodes represent published papers, and edges denote the citation relationship between two papers.
Pubmed network 6 : Pubmed network describes the citation relationships between diabetes-related scientific papers in Pubmed database, where the nodes represent published papers, and edges denote the citation relationship between two papers. Parliament network 7 : Parliament network is a bill co-signer network where the nodes denote French parliament members. If two parliament members jointly sign a bill, there is an edge between them.
Wiki network 8 : Wiki network describes the links between webpages, where the nodes represent webpages and edges mean the links between webpages.
Synthetic network 9 : Synthetic network is the benchmark generated by Elhadi and Agam [47], and all the settings of parameters come from [48].

Evaluation metrics
To evaluate the performance of UCDMI and baselines in community detection, we adopt three typical and wildly used metrics: normalized mutual information (NMI), clustering accuracy (ACC), and macro F1-score (F1). The higher the value of these metrics, the better the performance of community detection.

Parameter setting and baselines
In UCDMI, taking the efficiency into account, the node representation dimension d is set to 128, and the number of layer in the GCN is set to 1. For the parameter α, which balances the contribution of the proposed aggregation strategy, we set α = {0.2, 0.3, 0.5}. The parameter analysis is given in section 4.5.
To fairly evaluate the performance of UCDMI when dealing with community detection on attributed networks, we compare our proposed method with three kinds of community detection methods, and the parameters of these methods are consistent with the source papers. Concretely, there are methods that only use node attributes, methods that only use network structure, and methods that use the both. We list these baseline methods as follows.

Results of community detection
The NMI, ACC, and F1-score of UCDMI and other baseline methods on benchmark networks are reported in tables 2 and 3. There are several key observations. Compared with baseline methods, UCDMI achieves relatively good results on all experimental datasets. Specifically, compared to methods that use either node attributes or network structure, our UCDMI achieves an obvious improvement. It demonstrates that the usefulness of both kinds of information of networks in community detection, and the rationality of UCDMI when simultaneously using such information to learn high-quality node representation. Moreover, as for methods that use both the network structure and attributes, we find UCDMI consistently outperforms these methods. The main reason is that UCDMI can better utilize the information of networks. In particular, ARVGE only captures the information of 2-hop neighboring nodes, while our UCDMI can exploit the global information without being restricted by k-order neighbors. Besides, UCDMI achieves better performance compared with DGI which also considers the global information by utilizing the strategy of MI maximization. This is because DGI implements such maximization based on the whole graph, which ignores the finer substructures of networks and learns relatively coarse node representations. In contrast, UCDMI deals with finer substructures by implementing the MI maximization based on more fine-grained subnetwork, which enables it to better learn node representations and improves the performance of community detection.  . The comparison of community structures detected by our UCDMI and DGI on Polbooks. Different shapes of nodes denote different ground-truth communities, and communities detected by UCDMI and DGI are colored differently. UCDMI is able to better detect the community structure in the network, while DGI additionally divides v 5 , v 50 , v 67 , and v 85 into incorrect communities.

Ablation study
Two components, the Jaccard similarity and new MI maximization mechanism play significant roles in helping our model the better learn node representations and detect high-quality community structures of networks. Thus, we carry out an ablation study experiment to evaluate the contributions of these two components. In particular, 'basic' means the model that implements MI optimization based on the whole network. '+Ja' denotes adding the Jaccard similarity to the model, and '+Fg' represents the adoption of the new MI maximization mechanism based on the fine-grained subnetworks.
As shown in table 4, we find that using the Jaccard similarity to evaluate the importance between nodes for guiding the aggregation clearly improves the performance of community detection. This reveals that it is necessary to adjust the weight of nodes in the aggregation of features according to their importance. Moreover, by comparing the penultimate row and the second row of table 4, we find that the proposed new MI maximization mechanism performs better than the existing MI maximization strategy. This demonstrates the effectiveness of maximizing the MI between local node representations and more fine-grained subnetwork representation.

Parameters analysis
Next, we analyze the influence of two key parameters in our proposed method: the embedding dimension (controlled by d), and the coefficient of Jaccard similarity (controlled by α). As shown in figures 3(a)-(c), the experimental performance is improved by increasing the dimension d. Moreover, through the study of figures 3(d)-(f), we observe that the experimental performance obtained for α > 0 are always better than for when α = 0. Such phenomena further demonstrates the effectiveness of using the Jaccard similarity to guide the aggregation. Moreover, we find a slight decline in experimental performances when adjusting α beyond a certain value. After a careful investigation, we found that increasing the value of α magnifies the differences between neighboring nodes. When the difference is magnified to a certain extent, the information of some neighboring nodes will be ignored, thus leading to a performance decline. It indicates that our proposed aggregation strategy can effectively distinguish the importance of different neighboring nodes.

Visualization
To better understand the effectiveness of our proposed UCDMI, we visualize the community detection results of UCDMI and DGI on Polbooks, respectively. Based on the comparison of results between UCDMI and DGI, we expect to demonstrate the effectiveness of the proposed aggregation strategy and the new MI maximization mechanism. The results of visualizations are presented in figure 4, where the ground-truth communities are denoted by different shapes, while the communities detected by experiments are represented by different colors.
As we can see from figure 4, UCDMI exhibits three communities in Polbooks network. It is worth noticing that there are two communities, colored by red and yellow, that are almost detected correctly by our UCDMI. In comparison, four nodes, i.e. v 5 , v 50 , v 67 , and v 85 have been divided incorrectly into communities by DGI. Based on some careful investigations, we find that these nodes reside in the edge of the communities (called bride nodes). They often share more Jaccard similarity, such as v 67 , to nodes in the community where they belong. Thus, our proposed UCDMI, which uses Jaccard similarity to guide the aggregation, can effectively utilize such characteristic and detect high-quality communities. Besides, we find that UCDMI can detect communities where nodes within the same communities are closely connected, which is consistent with the definition of the community structure. We can conclude that the proposed aggregation strategy and the new MI maximization mechanism are able to preserve the more fine-grained global information, e.g. the information of community structures, and are suitable for community detection.

Conclusion
In this paper, we propose a novel method based on MI maximization for community detection in attributed networks. To increase the contribution of more important nodes in the aggregation process, we design a new aggregation of GCN which can distinguish the importance between nodes by measuring the Jaccard similarity of nodes. To exploit the fine-grained global information of networks, we develop a novel MI maximization mechanism to maximize the MI between the local node representations and global subnetwork representations. Some experiments demonstrate the effectiveness of our proposed method in community detection and it achieve better results than state-of-the-art baseline methods.