Optimization of identifiability for efficient community detection

Many physical and social systems are best described by networks. And the structural properties of these networks often critically determine the properties and function of the resulting mathematical models. An important method to infer the correlations between topology and function is the detection of community structure, which plays a key role in the analysis, design, and optimization of many complex systems. The nonnegative matrix factorization has been used prolifically to that effect in recent years, although it cannot guarantee balanced partitions, and it also does not allow a proactive computation of the number of communities in a network. This indicates that the nonnegative matrix factorization does not satisfy all the nonnegative low-rank approximation conditions. Here we show how to resolve this important open problem by optimizing the identifiability of community structure. We propose a new form of nonnegative matrix decomposition and a probabilistic surrogate learning function that can be solved according to the majorization–minimization principle. Extensive in silico tests on artificial and real-world data demonstrate the efficient performance in community detection, regardless of the size and complexity of the network.

Existing methods for detecting network communities mainly rely on the optimization of quality functions that are defined according to the similarity between different nodes. Nonnegative matrix factorization is a mainstream technology for detecting network communities [39]. The basic protocol of the nonnegative matrix factorization methods uses nonnegative low-rank approximations to factorize a node-to-node similarity matrix into node-to-cluster low-rank matrices. Ding et al [40] verified the approximate equivalence between the nonnegative matrix factorization and the seminal k-means clustering.
However, the nonnegative matrix factorization methods do not satisfy all of the nonnegative low-rank approximation conditions, so that they may lead to unbalanced network partitions in which some communities are assigned with much more nodes than others [39]. Besides, nonnegative matrix factorization methods are limited to account for curved manifolds and to proactively compute the optimal number of communities [41].
To resolve these limitations, we introduce the identifiability of network communities as the objective of optimization, which is defined as the discrepancy between the network similarity matrix and the doubly-stochastically normalized community incidence matrix. This objective function is solvable for all possible community partitions, and the optimal number of communities can also be determined proactively. To deal with sparse similarity inputs, we use the Kullback-Leibler divergence instead of the conventional distance-based measures. We also propose a new form of nonnegative matrix factorization to explore the solution space, which yields a probabilistic objective function that can be solved by an efficient algorithm using the majorization-minimization principle. We perform extensive experiments on multiple types of datasets, showing that our approach substantially improves the accuracy and computational efficiency of community detection, regardless of the size and complexity of the network.

Identifiability
Consider a network with N nodes that form r communities. The operation of assigning nodes into each community is described by an N × r binary community indicator matrixW, in which each entryW ik = 1 (or 0) indicates that the node i is (not) assigned to the community k. The dot product of the community indicator matrix is defined as the community incidence matrixB =WW T , in which each entryB ij = 1 (or 0) indicates that the nodes i and j are (not) assigned to the same community. To reduce the unbalanced partitions, it is more convenient to use the normalized community indicator matrix W with each entry [40,43]. Let S be the similarity matrix, in which each entry S ij accounts for the probability of similarity between nodes i, j. For example, the similarity of two nodes in the protein-protein interaction networks can be quantified by the Hamming distance between their amino acid sequences [1]. The probability of similarity is expected to be higher for the nodes within the same community than those part in different communities, so that the normalized community incidence matrix B converges to the similarity matrix S when nodes are accurately assigned to each community.
We can also interpret matrix B in a probabilistic form. Denote W ik = P(i|k) as the conditional probability that any node i belongs to a given community k, and P(k|i) the conditional probability that any community k has a given node i. Considering an uninformative prior for selecting nodes, i.e. P(i) = 1/N, we have according to the Bayes' rule. The probability that two nodes i, j mutually share at least one community is k is: The identifiability of communities D(S B) is defined as the discrepancy between the similarity matrix S and the normalized community incidence matrix B, so that the reduction in the identifiability indicates the improvement of the community partition (figure 1). Let C(S|r) = min (B|r) D(S B) be the optimized (or minimized) identifiability, given that the original network is partitioned into r communities. Evaluating C(S|r) over all possible number of communities r (1 r N), we can jointly estimate the optimal number and partition of network communities. The optimization of C(S|r) often need to search a sparse solution space [43][44][45], so we propose a new framework for minimizing the identifiability.

Optimization of identifiability via matrix decomposition
In this section, we first formulate the identifiability using the Kullback-Leibler divergence, and then introduce a multiplicative minimization algorithm that efficiently optimizes the identifiability. Illustration of optimizing the identifiability for detecting network communities. Each element of the similarity matrix S accounts for the probability of similarity between two nodes, and each element of the community incidence matrix B accounts for the probability that two nodes i, j mutually share at least one community. The identifiability of communities D KL (S B) is defined as the Kullback-Leibler divergence between the similarity matrix S and the community incidence matrix B. The reduction in the identifiability indicates the improvement of the community partition. Denote r as the number of partitioned communities for the original network, the optimal number and partition of network communities are jointly estimated by evaluating C(S|r) = min D KL (S B) across all possible values of r (1 r N).

Kullback-Leibler divergence
The use of simple discrepancy measures such as the Euclidean distance or Hamming distance does not work well for many real-world problems that only contain weakly informative data. Inspired by the seminal work of probabilistic latent semantic indexing [46], we resort to use the Kullback-Leibler divergence to measure the discrepancy (or approximation error) between the dense normalized community incidence matrix B and the sparse node similarity matrix S. As such, the optimization problem is summarized as follows: where the normalized community incidence B ij is estimated with the relative probability of randomly selecting two nodes i, j that belong to the same community k, and nonnegative W ik = P(k|i) denotes the probability that a given node i belongs to the community k. We use −S ij + B ij to measure the discrepancy between these two different matrices, since we can naturally consider that the community incidence matrix B should be close to the similarly matrix S for a good community partition. By eliminating all the constant terms (e.g. N i=1 N j=1 B ij = N) from the objective function of equation (3), the optimization is transformed to the maximization of N i=1 N j=1 S ij log B ij . It offers several advantages to improve the efficiency of optimization: (1) only non-zero elements in the similarity matrix S need to be evaluated by the objective and gradient functions; and (2) the logarithmic functions in the low-rank community incidence matrix B only need additive calculations. Therefore, we use a Dirichlet prior to lower the computational complexity (section 3.2) and use a convex-concave procedure [47] for the optimization (section 3.3).

Regularization
In equation (3), each row of matrix W is normalized. Suppose that the elements of each row in W follow a Dirichlet distribution [46], the complexity of matrix W can be constrained with log-Dirichlet prior, i.e., N i=1 r k=1 log W ik . Here, the complexity of a matrix also means the sparsity and regularity, and less the log-Dirichlet prior, less the complexity of a matrix. Therefore, we use the following regularized cost function: If all similarities S ij are integer-valued, the objective function of equation (3) can be the log-likelihood of generative models as follows: (1) drawing the rows of W according to a uniform Dirichlet distribution with parameter α; and (2) for t = 1, . . . , T, add one to each entry W ij ∼ multinomial ( 1  N B, 1). If α = 1, the Dirichlet prior disappears; whereas if α > 1, by smoothing each entry of matrix W, the prior provides a wider relaxation that is usually desired in preliminary stages of W learning. In this paper, we employ the Dirichlet prior only to simplify the optimization steps, without change the control of KL divergence.

Optimization
We implement the optimization using nonnegative matrix factorization with multiplicative updating rule [40]. We first split the gradient of the objective function J with respect to the nonnegative matrix W into two nonnegative components: We then iterate the optimization using the following multiplicative updating rule: which retains the nonnegativity of matrix W and does not need to adjust the step size per iteration. Since the optimization monotonically reduces the objective function J , the multiplicative updating rule can warrant the convergence of W. To deal with the row-based probability constraint on W, we introduce a Lagrangian multiplier {λ i , i = 1, . . . , N} which is subject to the following constraint: This improves the multiplicative update rule as follows: in which the gradient of the objective function J is split as: where s k = N v=1 W vk , and the ratio Z ij = S ij /B ij only calculates non-zero entries of S, without requirement to specify the entire matrix B. Given k W ik = 1, we have where Combining equation (11) and equation (8), we have To retain the positivity of W, we add the term b i to both the denominator and numerator of equation (12), which gives the following updating rule: This algorithm warrants the convergence of the optimization because the cost function J can be monotonically decreased in each iteration. We validate this point using the majorization-minimization principle in the following. W and W distinguish the current estimate and variable, respectively.
(1) Majorization are constants that are independent of the variable W. The first two inequalities depend on the convexity and concavity of the logarithmic functions. By further adding the same constant 1 a i + α W ik to both the denominator and numerator, the third inequality warrants the positivity of each entry in the renewed matrix, because x 1 + log x when x > 0. The upper boundaries mentioned above are tight at W = W, i.e. G(W, W) = L(W, λ).
(2) Minimization Since the gradient is reduced to zero when the cost function converges, we have where W new ik denotes the updated result. The above discussions verify that L(W new , λ) G(W new , W) L(W, λ). This indicates that our framework reduces each row of W as a probabilistic simplex, in which b i accounts for the sum of the rows of unconstrained multiplicative results, and a i accounts for the balance between the probabilistic simplex attraction and the gradient learning force.

Initialization and optimal number of communities
Our framework can start from any initial partition of communities. Given a randomly initialized community indicator matrix W, we provide a tiny positive perturbation (e.g. 0.05) to each entry of W before the optimization (i.e. with α = 1). The regularized condition (i.e. with various α = 1) can also be used to offer non-regularized initialization (i.e. with α = 1). Therefore, the parameter α is only used at the initialization, with its optimal value determined by the minimum of D(S B) (i.e. the optimal partition of communities). In addition, we set the similarity matrix S as the adjacent matrix A, which is sufficient to extract the reasonable community partitions.
The number of partitioned communities r can be treated as an independent variable, whose optimal value can be calculated proactively. Specifically, we evaluate the residual D(S B) with W discretized to the community indicator matrix against each possible value of r, and then choose the optimal number of communities r that lead to the lowest residual D(S B). We provide the pseudo-code of this algorithm inthe Supplementary Material.

Results
In this section, extensive experiments on multiple types of dateset are applied to demonstrate the high-level performance and efficiency of the proposed method.
We first consider the seminal Lancichinetti-Fortunato-Radicchi benchmark networks, which comprises a series of synthesized network models that maintain several major structural features observed in real-world networks [42,48]. We use the following set of parameters to generate the synthesized network models: the average node degree k = 10, maximum node degree k max = 50, minimum community size min c = 100, and maximum community size max c = 300. We examine several typical community configurations by varying the network size and mixing parameter, as specified in figure 2 We apply our method and 5 different well-known algorithms to the benchmark networks and evaluate the resultant quality of detected communities using normalized mutual information [48]. The normalized mutual information computes the accuracy in detecting the true communities: the higher the normalized mutual information, the better the performance of community partition. Figure 2 demonstrates the improved performance of our method.

Case study with political book co-purchasing network
We further apply our method to the famous co-purchasing network of Amazon political book [49] to show the detection of community correlation patterns. In the political book co-purchasing network, the nodes represent the books sold by the Amazon online website, and edges represent the pairs of books frequently co-purchased. More detailed data description can be found at [49]. As in figure 3(a), our method partitions the network into a 2 level hierarchy, with seven nonparametric and two parametric communities. Figure 3(b) shows the community correlation graph at the first level, which reveals two hubs (communities 2 and 7), an outlier (community 4), and two wings (communities Table 1. Structural properties and community partitions of six real-world large-scale networks obtained from the Stanford Network Analysis Platform [50]. The quality of community partitions is compared using the normalized mutual information (NMI), the modularity (Q) and the variation of information (VOI) with perturbation probability 0.3. Here, N and M represent the number of nodes and edges, R g and R m represent the number of communities obtained from the ground-truth and from our method, respectively. 1-3 vs communities 5-7).The two wings of books are linked by the two hub communities.The books in each wing share the same labels, so that they have a strong correlation even if they may never be purchased together. Figure 3(c) shows the two macroscopic modules at the second level, which is obtained according to the types of books. This indicates that the customers of Amazon are more likely to purchase the books attached with the same labels (about 24%), and they rarely or never purchase books attached with different labels (less than 1%). Based on the above analysis, we can build a correlation strength matrix C, in which each entry C ks calculates the coarse-grained relationship (i.e. the ratio of interconnect edges) between first-level communities k and s. The correlation strengths of each community can be ranked to guide the strategic planning for booksellers. Because of the high correlation among communities 1-3, the storage of books in these communities can be increased together if the purchasing of books in community 2 is increasing. In addition, since the community 2 is segregated from communities 5 and 7, the storage of books in the communities 5 and 7 can be reduced, because they are never co-purchased with the books in community 2.

Scalability of the algorithm
To further validate the efficiency of our method, we apply it to six large-scale real-world data-sets obtained from the Stanford Network Analysis Platform [50]: (1) the Amazon product network in which two products (nodes) are connected together if they are co-purchased frequently; (2) the coauthor cooperation network Dblp, in which two scientists are connected by an edge if they have coauthored a paper; and 4 social networks from the following sources (3) Youtube, (4) Livejournal, (5) Orkut, (6) Friendster. The ground truth partition is known for each of these networks. Table 1 presents the key structural properties and the community partition using our method for these networks. We use the modularity [1], the normalized mutual information [42] and the variation of information (VOI) to evaluate the performance. Since our results are very close to the ground-truth results (real communities) for almost all tested networks, our method is satisfactory to deal with large-scale networks with millions of nodes and edges.

Conclusions
In summary, we have developed a new community detection method, using nonnegative low-rank approximations. We have two major contributions: (1) an efficient definition of identifiability is introduced, which can be used to estimate the optimal number and partition of communities simultaneously; and (2) a new relaxed formulation with low-rank doubly stochastic matrix decomposition and corresponding multiplicative majorization-minimization algorithm are proposed, which allows high efficiency optimization. We use extensive in multiple types of experiments to demonstrate the near optimal performance of our method in dealing with a wide range of networks. We also show the scalability of our method using real world large-scale datasets with billions of nodes.