Bayesian Community Detection

We introduce a Bayesian estimator of the underlying class structure in the stochastic block model, when the number of classes is known. The estimator is the posterior mode corresponding to a Dirichlet prior on the class proportions, a generalized Bernoulli prior on the class labels, and a beta prior on the edge probabilities. We show that this estimator is strongly consistent when the expected degree is at least of order $\log^2{n}$, where $n$ is the number of nodes in the network.


Introduction
The stochastic block model (SBM) (Holland, Laskey and Leinhardt, 1983) is a model for network data in which individual nodes are considered members of classes or communities, and the probability of a connection occurring between two individuals depends solely on their class membership. It has been applied to social, biological and communication networks, for example in Park and Bader (2012), Bickel and Chen (2009) and Snijders and Nowicki (1997) amongst many others. There are many extensions of the SBM for various applications, including the degree-corrected SBM (Karrer and Newman, 2011;Zhao, Levina and Zhu, 2012) which accounts for possible heterogeneity among nodes within the same class, and the mixed-membership SBM (Airoldi et al., 2008), in which the assumption that the classes are disjoint is removed. These extensions allow for additional modelling flexibility.
Two main SBM research directions are the recovery of the class labels (community detection) and recovery of the remaining model parameters, consisting of the probability vector generating the class labels, and the class-dependent probabilities of creating an edge between nodes. In this paper, we focus on community detection, noting that once strong consistency of a community detection method has been established, consistency of the natural plug-in estimators for the remaining parameters follows directly by results in (Channarond, Daudin and Robin, 2012).
A large number of methods for recovering the class labels has been proposed. Those most closely related to this work are the modularities. Newman and Girvan (2004) introduced the term modularity for 'a measure of the quality of a particular division of a network'. They described one such measure for models in which edges are more likely to occur within classes than between classes, in which case there is a community structure in the colloquial sense, although the SBM does not require this assumption. Bickel and Chen (2009) studied more general modularities, defining them as functions of the number of connections between all combinations of classes and the proportion of nodes placed in each class. They introduced the likelihood modularity, and provided general conditions under which modularities are consistent. Their method and theory was extended to the degree-corrected SBM by Zhao, Levina and Zhu (2012).
2. Given Z = (Z 1 , . . . , Z n ), the edges are independently generated as Bernoulli variables with P(A ij = 1 | Z) = P Zi,Zj , for i < j, for a given K × K symmetric matrix P = (P ab ).
The probability vector π is considered fixed, but unknown. Although this is not visible in the notation, the matrix P may change with n, a case of particular interest being that P tends to zero, which gives a sparse graph. The order of magnitude of P ∞ = max a,b P ab is the same as the order of magnitude of ρ n = a,b π a π b P ab , the probability of there being an edge between two randomly selected nodes. The expected degree of a randomly selected node is λ n = (n − 1)ρ n , and twice the expected total number of edges in the network is µ n = n(n − 1)ρ n .
The likelihood for the model is given by where O ab (Z) is the number of edges between nodes labelled a and b by the labelling Z, n ab (Z) is the maximum number of edges that can be created between nodes labelled a and b, and n a (Z) is the number of nodes labelled a, and a and b range over {1, 2, . . . , K}.
More formally, for a given labelling e = (e 1 , . . . , e n ) ∈ {1, . . . , K} n of nodes, and class labels a, b ∈ {1, . . . , K}, we define Since the matrix A is symmetric with zero diagonal by assumption, for a = b the variable O ab (e) can also be written as i<j A ij [1 {ei=a,ej =b} + 1 {ej =a,ei=b} ], which explains the different appearances of the diagonal and off-diagonal entries. The numbers n ab (e) are equal to the numbers O ab (e) when all A ij are equal to 1. We collect the variables O ab (e) and n ab (e) in K × K matrices O(e) and n(e). Now consider the K × K probability matrix R(e, c) and K probability vector f (e) with entries R ab (e, c) = 1 n n i=1 1 {ei=a,ci=b} , f a (e) = n a (e) n . (2) The row sums of R(e, c) are equal to R(e, c)1 = f (e), while the column sums are equal to 1 T R(e, c) = f (c) T . Thus, the matrix R(e, c) can be seen as a coupling of the marginal probability vectors f (e) and f (c). If e = c, then it is diagonal with diagonal f (c) = f (e). More generally, the matrix can be viewed as measuring the discrepancy between labellings e and c. This can be precisely measured as half the L 1 -distance of R(e, c) to its diagonal, as evidenced by Lemma 2.1, which is noted in Bickel and Chen (2009). For a vector v we denote by Diag(v) the diagonal matrix with diagonal v, and for a matrix M we denote its diagonal by diag (M ).
Lemma 2.1. For every labelling c, e in the K-class stochastic block model: Proof. The diagonal of R(e, c) gives the fractions of labels on which c and e agree. Hence the left side of the lemma is 1 − a R aa (e, c) = a (f a (c) − R aa (c)) . The elements of both K × K matrices Diag(f (c)) and R(e, c) can be viewed as probabilities that add up to 1. Thus the sum of the differences of the diagonal elements is minus the sum of the differences of the offdiagonal elements. Because f a (c) ≥ R aa (e, c) for every a, we have a (f a (c) − R aa (e, c)) = a |f a (c) − R aa (e, c)|. Similarly the off-diagonal elements of Diag(f (c)), which are zero, are smaller than the off-diagonal elements of R(e, c) and hence we can add absolute values. Thus the sum over the diagonal is half the sum of the absolute values of all terms in Diag(f (c))−R(e, c).

Bayesian Approach to Community Detection
Our main results are presented in this section. We first discuss the choice of prior in Section 3.1, and define the estimator, in Section 3.2. The resulting Bayesian modularity is closely related to the likelihood modularity of Bickel and Chen (2009). The relationship is clarified in Section 3.3. We briefly consider the issue of identifiability in the SBM in Section 3.4, and conclude with our main theorem on the strong consistency of the Bayesian modularity in Section 3.5.

The prior
We adopt the Bayesian approach of Nowicki and Snijders (2001). We put prior distributions on the parameters of the stochastic block model with K known, the vector π and the matrix P , yielding a joint probability distribution of (A, Z, π, P ). Next we marginalize over π and P as in McDaid et al. (2013), leading to a joint distribution of (A, Z). Finally we "estimate" the unobserved vector Z by the posterior mode of the conditional distribution of Z given A. From a frequentist point of view this means that Z is treated as a parameter of the problem, equipped with a hierarchical prior that chooses first π and then Z. Accordingly we shall change notation from Z to e, reserving Z for the frequentist description of the stochastic block model in Section 2.
The prior on π is a Dirichlet, and independently the P ab for a ≤ b receive independent beta priors: π ∼ Dir(α, . . . , α), This is essentially the same set-up as in Nowicki and Snijders (2001) and McDaid et al. (2013), except that we use a more flexible Beta(β 1 , β 2 ) instead of a uniform prior on the P ab . We assume α, β 1 , β 2 > 0.
Abusing notation we write p(e), p(A | e) and p(e | A) for marginal and conditional probability density functions.

The Bayesian modularity
The Bayesian estimator of the class labels will be the posterior mode, that is: The posterior mode can be interpreted as a modularity-based estimator in the sense of Bickel and Chen (2009), in that it maximizes a function that only depends on the O ab (e) and the n a (e). This can be seen from the joint density of (A, e), which is found by marginalizing the likelihood (1) over π and P . The conjugacy between the multinomial and Dirichlet distributions gives the marginal density of the class assignment e as: Here the integral is relative to the Lebesgue measure on the K-dimensional unit simplex and D(α) = Γ(α) K /Γ(Kα) is the norming constant for the Dirichlet density. Similarly the conjugacy between the Bernoulli and Beta distributions gives the marginal conditional density of A given e as: where B(x, y) = Γ(x)Γ(y)/Γ(x + y) is the beta-function. The joint density of A and e is given by the product of (3) and (4), and n −2 times its logarithm is up to a constant that is free of e equal to This is a modularity in the sense of Bickel and Chen (2009), which we define as the Bayesian modularity. As p(e | A) is proportional to p(e, A), the posterior mode is equal to the class assignment that maximizes the Bayesian modularity, so the Bayesian estimator is equal to:

Similarity to the likelihood modularity
The Bayesian modularity Q B (e) consists of a two parts, originating from the likelihood and the prior on the classification, respectively. The first part is close to the likelihood modularity given by . This criterion, obtained in Bickel and Chen (2009), results from replacing in the log conditional likelihood of A given e (the logarithm of (1) with Z replaced by e and discarding the term involving the parameters π a ) the parameters P ab by their maximum likelihood estimatorsP ab = O ab (e)/n ab (e). In other words, the parameters are profiled out rather than integrated out as for the Bayesian modularity. The corresponding estimator is consistent, and hence one may hope that the Bayesian estimator can be proved consistent by showing that the Bayesian and likelihood modularities are close. This will indeed be our line of approach, but the execution must be done with care. For instance, the second, prior part of the Bayesian modularity does play a role in the proof of strong consistency, although it is negligible when proving weak consistency.
The following lemma links the Bayesian and likelihood modularities.
Consequently max e∈E Q B (e) − Q M L (e) = O log n/n .

Identifiability and consistency
A classification e is said to be weakly consistent if the fraction of misclassified nodes tends to zero (partial recovery), and strongly consistent if the probability of misclassifying any of the nodes tends to zero (exact recovery). In defining consistency in a precise manner, the complication of the possible unidentifiability of the labels needs to be dealt with. From the observed data A we can at best recover the partition of the n nodes in the K classes with equal labels Z i , but not the values Z 1 , . . . , Z n of the labels, in the set {1, 2, . . . , K}, attached to the classes. Thus consistency will be up to a permutation of labels.
To make this precise define, for a given permutation (1, . . . , K) → (σ(1), . . . , σ(K)), the permutation matrix P σ as the matrix with rows for e 1 , . . . , e K the unit vectors in R K . Then pre-multiplication of a matrix by P σ permutes the rows, and post-multiplication by P T σ the columns: P σ R is the matrix with jth row equal to the σ(j)th row of R, and RP T σ is the matrix with jth column the σ(j)th column of R. Thus P σ R(e, Z) is the matrix that would result if we would permute the labels of the classes of the assignment e, and P σ P P T σ and P σ R(e, Z)P T σ are the matrices that would result if we would relabel the classes throughout. Since we cannot recover the labels, the matrix P σ R(e, Z) is just as good or bad as R(e, Z) for measuring discrepancy between a labelling e and the true labelling Z; furthermore, nothing should change if we choose different names for the classes.
Thus, taking into account the unidentifiability of the labels, by Lemma 2.1, an estimator e is weakly consistent if P σ R( e, Z) − Diag(f (Z)) 1 → 0, for some permutation matrix P σ . The classification e is said to be strongly consistent if P(P σ R( e, Z) = Diag(f (Z))) → 1, for some permutation matrix P σ . The permutation matrix P σ is for large n uniquely defined: if (P σ ) j R − Diag(π) 1 ≤ min a π a , for j = 1, 2, then (P σ ) 1 = (P σ ) 2 . This follows because the assumption implies that (P σ ) −1 1 Diag(π)− (P σ ) −1 2 Diag(π) 1 ≤ 2 min a π a , by the triangle inequality and the fact that the L 1 -norm is invariant under permutations. Furthermore, for P σ = (P σ ) 2 (P σ ) −1 1 the left side is P σ Diag(π) − Diag(π) 1 , which is at least two times the sum of the two smallest coordinates of π if P σ = I.
A necessary requirement for consistency is that the classes can be recovered from the likelihood, i.e. the model parameters must be identifiable. If π has strictly positive coordinates, so that all labels will appear in the data eventually, then as explained in Bickel and Chen (2009) an appropriate condition is that P does not have two identical rows. If π a = 0 for some a, then class a will never be consumed; the identifiability condition should then be imposed after deleting the ath column from P . Thus, we call the pair (P, π) identifiable if the rows of P are different after removing the columns corresponding to zero coordinates of π. Throughout we assume that P is symmetric.

Consistency results and assumptions
We are now ready to present our results on consistency for the Bayesian maximum a posteriori (MAP) estimator (5). Theorem 3.2 shows strong consistency of the Bayesian estimator if λ n (log n) 2 . The proof rests on a proof of weak consistency under similar conditions, stated in the appendix as Theorem A.1.
Recall that ρ n = a,b π a π b P ab is the probability of a new edge, and λ n = (n − 1)ρ n is the expected degree of a node.
Theorem 3.2 (strong consistency). (i) If (P, π) is fixed and identifiable with 0 < P < 1 and π > 0 then the MAP classifier e = arg max e Q B (e) is strongly consistent. (ii) If P = ρ n S, where (S, π) is fixed and identifiable with S > 0 and π > 0, then the MAP classifier e = arg max e Q B (e) is strongly consistent if λ n (log n) 2 .
The theorem distinguishes two cases: (i) is the dense case, while (ii) is the sparse case. The second is the most interesting of the two, as it touches on the question how much information is required to recover the underlying community structure. Much recent research effort has gone into determining detection and computational boundaries, in particular for special cases of the SBM with K = 2 (see e.g. Mossel, Neeman and Sly (2012), Chen and Xu (2014), Abbe, Bandeira and Hall (2014) and Zhang and Zhou (2015)).
Weakly consistent estimation of the class labels for an arbitrary, but known, number of classes is possible under the assumption λ n log n, as this was shown to hold for spectral clustering by Lei and Rinaldo (2015). Strong consistency of maximum likelihood was shown to hold in the special cases of planted bisection and planted clustering if K = 2 by Abbe, Bandeira and Hall (2014); Chen and Xu (2014), again under the assumption λ n log n. Gao et al. (2015) and Gao et al. (2016) achieve optimality in different senses, under assumptions on the average withincommunity and between-community edge probabilities; Gao et al. (2015) introduce a two-stage procedure which achieves the optimal proportion of misclassified nodes in a special case where P ab can only take two values, while Gao et al. (2016) obtain minimax rates for the proportion of misclassified nodes in the degree corrected SBM.
Strong consistency of the likelihood modularity for an arbitrary number of classes K has been claimed under the same assumption λ n log n (Bickel and Chen, 2009), and those results have been extended to the degree-corrected SBM (Zhao, Levina and Zhu, 2012). However, these results were obtained by application of an abstract theorem to the special case of the likelihood modularity, which would require the function τ (x) = x log x + (1 − x) log(1 − x), or the function σ(x) = x log x, to be globally Lipschitz. As τ and σ are only locally Lipschitz, it is still unclear whether λ n log n is a sufficient condition for either weakly or strongly consistent estimation by maximum likelihood. From our proof of Theorem 3.2, which proceeds by comparing the Bayesian modularity the likelihood modularity, it immediately follows that λ n (log n) 2 is certainly sufficient. Given weak consistency the problem can be reduced to a neighbourhood of the true parameter on which the Lipschitz condition is reasonable. However, it is precisely our proof of weak consistency that needs the additional log n factor.
The Largest Gaps algorithm of Channarond, Daudin and Robin (2012) is strongly consistent provided that min a =b | K k=1 α k (P ak − P bk )| is at least of order log n/n, implying that at least one of the P ab is of the same order, and thus λ n √ n log n. This much stronger condition is not surprising, as the Largest Gaps algorithm only uses the degree of a node and does not take into account any finer information on the group structure, such as the information contained in the O ab .
To the best of our knowledge, for K > 2, it remains to be shown that λ log n is sufficient for strong consistency of any community detection method for the general SBM. For the minimax rate for the proportion of misclustered nodes in community detection, when only classes of sizes proportional to n are considered, a phase transition when going from the case K = 2 to K ≥ 3 was observed by Zhang and Zhou (2015). Their results show that if K = 2, communities of the same size are most difficult to distinguish, while if K ≥ 3, small communities are harder to discover. This shift in the nature of the communities that are harder to detect may be what has been preventing a general strong consistency result under the assumption λ n log n so far.

Application
Some options for implementing the Bayesian modularity are given in Section 4.1, after which the results of applying the Bayesian and likelihood modularities to the well-studied karate club data of Zachary (1977) are discussed in Section 4.2.

Implementation
Two recent works explicitly discuss implementation of Bayesian methods for the SBM. McDaid et al. (2013) followed the approach of Nowicki and Snijders (2001) and added a Poisson prior on K. After marginalizing over π and P , they employ an allocation sampler to sample from the joint density of K and z given A, and use the posterior mode to estimate K. Their algorithm can scale to networks with approximately ten thousand nodes and ten million edges. Côme and Latouche (2014), claiming that the algorithm of McDaid et al. (2013) suffers from poor mixing properties, propose a greedy inference algorithm for the same problem. For the karate club data in Section 4.2, the network was small enough that a tabu search (Glover, 1989), run for a number of different initial configurations, yielded good results. We used α = 1/2 for the Dirichlet prior, and β 1 = β 2 = 1/2 for the beta prior.

Karate club
Zachary (1977) described a karate club which split into two clubs after a conflict over the price of the karate lessons. The new club was led by Mr. Hi, the karate teacher of the original club, while the remainder of the old club stayed under the former Officers' rule. The data consists of • • Figure 1. Communities detected by the Bayesian modularity when K = 2 (left) and K = 4 (right), with α = β 1 = β 2 = 1/2. The polygons contain the two groups the karate club was split into; the left one is Mr. Hi's club, the right one is the Officers' club. The shapes of the nodes represent the communities selected by the modularities. Figure made using the igraph package (Csardi and Nepusz, 2006).
an adjacency matrix for those 34 individuals who interacted with other club members outside club meetings and classes. Each of these individuals' affiliations after the conflict is known. The communities selected by the Bayesian modularity for K = 2 and K = 4 are given in Figure  1. In both instances, the tabu search led to nearly the same solution for both the Bayesian and likelihood modularities, only differing at one node for K = 4, which is not surprising in light of Lemma 3.1. For K = 2, the results of Bickel and Chen (2009) for this data set are recovered. For K = 4, the partition in Figure 1 yields a higher value of the likelihood modularity than the partition into four classes found by Bickel and Chen (2009), and an even higher value is obtained by switching club member 20 to the second-largest class. This discrepancy is likely due to the heuristic nature of the tabu search algorithm, and for the same reason, it may be the case that improvement over the partitions found by the Bayesian modularity in Figure 1 are possible.
For K = 2, the communities found by the algorithms do not correspond in the slightest to the two karate clubs, instead grouping the nodes with the highest degrees, corresponding to Mr. Hi, the president of the original club, and their closest supporters, together. Incidentally, this partition is the same as the one returned by the Largest Gaps algorithm of Channarond, Daudin and Robin (2012), which solely uses the degrees of the nodes and discards all other information.
These bad results are no reason to shelve the Bayesian and likelihood modularities, as there is no reason to believe that the two karate clubs form communities in the sense of the stochastic block model. Mr. Hi and the club's president are clear outliers within their groups, and neither of the algorithms were designed to be robust to such a phenomenon. The communities selected by the modularities are communities in the sense that they form connections within and between the groups in a similar fashion. This sense does not correspond to the social notion of a community in this setting.
The results for four classes unify the social and stochastic senses of community. The prominent members of each of the new clubs are placed into two separate, small, communities. The other members are classified nearly perfectly, with two exceptions. However, one of those exceptional individuals is the only person described by Zachary (1977) as being a supporter of the club's president before the split, who joined Mr. Hi's club, making this person's affiliation up for debate. The second is described as only a weak supporter of Mr. Hi. The increased number of communities allows for some outliers within the social communities, and leads to a more detailed understanding of the dynamics within both of the groups. We essentially recover the two communities, each with a core that is more connective than the remainder of te nodes.

Discussion
An advantage of Bayesian modelling is that it does not solely result in an estimator, but in a full posterior distribution. The posterior mode studied in this paper is but one aspect of the posterior, and its good behaviour in terms of consistency is encouraging. Further study into other aspects in the posterior may prove to be fruitful. One possible research direction would be to use the posterior to quantify uncertainty in the estimate of the class labels. A second issue that may be resolved by the Bayesian approach is the question of estimating the number of classes, K. This remains an important open question, as noted by Bickel and Chen (2009), despite recent attempts (e.g. Saldana, Yu and Feng (2014), Chen and Lei (2014) and Wang and Bickel (2015)). By introducing a prior on K, such as the Poisson-prior suggested by McDaid et al. (2013), the number of communities K can be detected by the posterior.

Appendix A: Proofs
After stating some repeatedly used notation, this appendix starts with the proof of Theorem A.1, which is a theorem on weak consistency of the Bayesian modularity. It is followed by a number of supporting Lemmas, after which we proceed to the proof of Theorem 3.2, and some additional supporting Lemmas.
We write diag (P ) for the diagonal of P if P is a matrix, and Diag(f ) for the diagonal matrix with diagonal f if f is a vector.

A.1. Weak consistency
The following quantities will be used in the course of multiple proofs. The function H P , with domain K × K probability matrices, is given by, for τ (u) = u log u + (1 − u) log(1 − u), For τ 0 (u) = u log(u) − u, define The sums defining these functions are over all pairs (a, b) with 1 ≤ a, b ≤ K, unlike the sums defining the modularities Q B and Q M L , which are restricted to a ≤ b.
Theorem A.1 (weak consistency). (i) If (P, π) is fixed and identifiable, then the MAP classifier e = arg max z Q B (e) is weakly consistent.
(ii) If P = ρ n S for ρ n → 0, and (S, π) is fixed and identifiable, then the MAP classifier e = arg max z Q B (e) is weakly consistent provided nρ n (log n) 2 .
Proof. By Lemma 3.1 the Bayesian modularity Q B is equivalent to the likelihood modularity Q M L up to order (log n)/n. With the notation O ab (e) = O ab (e) if a = b, and O ab (e) = 2O ab (e) if a = b, the likelihood modularity is in turn equivalent up to the same order to Indeed the terms of Q M L (e) for a < b are identical to the sums of the terms of L(e) for a < b and a > b, while for a = b the terms of Q M L (e) and L(e) differ only subtly: the first uses n aa (e) = 1 2 n a (e)(n a (e) − 1), where the second uses 1 2 n a (e) 2 . Thus the difference is bounded in absolute value by the sum over a of (where e is suppressed from the notation) .
where l(x) = x(1∨log(1/x)), in view of Lemma A.4. We now use that n a l(u/n a ) log n a ≤ log n, for 0 ≤ u ≤ 1.
Combining the preceding, we conclude that Combining this with the preceding paragraph, we conclude that L( e) ≥ L(Z) − 2(η n,1 + η n,2 ). Proof of (i). For given δ > 0, let R δ be the set of all probability matrices R with min Pσ P σ R − Diag(R T 1) 1 ≥ δ, and min a:πa>0 Here the minimum is taken over the (finite) set of all permutation matrices P σ on K labels. Furthermore, set where H P is as defined in (6). Because R δ is compact and the maps R → H P (R) and R → Diag(R T 1) are continuous, the infimum in the display is assumed for some R ∈ R δ . Because no R ∈ R δ can be transformed into a diagonal element by permuting rows and every R ∈ R δ has a nonzero element in every column a with π a > 0, Lemma A.5 shows that η n > 0. Because L(e) = H P (R(e, Z)) for every e, and R(Z, Z) = Diag(f (Z)) = Diag(R( e, Z) T 1), we conclude that H P (Diag(R( e, Z) T 1)) − H P (R( e, Z)) ≤ 2(η n,1 + η n,2 ).
If 2(η n,1 + η n,2 ) is smaller than η n , then it follows that R( e, Z) cannot be contained in R δ . Since R( e, Z) T 1 = f (Z) P → π, by the law of large numbers, for sufficiently small δ > 0 this must be because R( e, Z) fails the first requirement defining R δ . That is, P σ R( e, Z) − Diag(f (Z)) 1 ≤ δ for some permutation matrix P σ . As this is true eventually for any δ > 0, it follows that min Pσ P σ R( e, Z) − Diag(π) 1 P → 0. Proof of (ii). In view of Lemma A.6, the number η = η n , which now depends on n, is now bounded below by ρ n times a positive number that depends on (S, π). The preceding argument goes through provided η n,1 +η n,2 is of smaller order than η n . This leads to l ρ n /n +log(n)/n ρ n , or (ρ n /n) log 2 n/(ρ n S ∞ ) ρ 2 n .
Proof. This Lemma is adapted from Lemma 1.1 in Bickel and Chen (2009). There are K n possible values of e and · ∞ is the maximum of the K 2 entries in the matrix. We use the union bound to pull these maxima out of the probability, giving the factor K n+2 on the right. Next it suffices to bound the tail probability of each variable The n ab (e) variables in this sum are conditionally independent given Z, take values in [−2, 2], and have conditional mean zero given Z and conditional variance bounded by 4 var(A ij | Z) ≤ 4P ZiZj (1 − P ZiZj ) ≤ 4 P ∞ . Thus we can apply Bernstein's inequality to find that P O ab (e) − E O ab (e) | Z > xn 2 ≤ 2e −x 2 n 4 /(8n ab (e) P ∞+4xn 2 /3) .
Finally we use the crude bound n ab (e) ≤ n 2 and cancel one factor n 2 . Proof. A similar expression, not taking into account the absence of self-loops, appears in Bickel and Chen (2009).
Proof. Write the difference between x log x and y log y as | y x (1 + log s) ds|. The function s → 1 + log s is strictly increasing on [0, 1] from −∞ to 1 and changes sign at s = e −1 . Therefore the absolute integral is bounded above by the maximum of (1 + log s) ds ≤ |x − y|.

Proof of Lemma 3.1
Proof. The second assertion of the lemma follows from the first and the fact that max e Q P (e) (log n)/n. It suffices to prove the first assertion.
Recall that the Bayesian modularity is given by We shall show that the first sum on the right is equivalent to Q M L (e), and the second sum is equivalent to Q P (e). We show this by comparing the sums defining the various modularities term by term. For clarity we shall suppress the argument e. We will repeatedly use the following bound from (Robbins, 1955): for n ∈ N ≥1 , with (12n+1) −1 ≤ a n ≤ (12n) −1 , as well as the fact that Γ(s) is monotone increasing for s ≥ 3/2. In addition, we will bound remainder terms by using the inequality x log((x + c)/x) ≤ c for c ≥ 0 and the fact that x log((x − 1)/x) is bounded for x > 1.
First sum of (8). Upper bound, case 1: O ab = 0 and n ab = O ab We apply (9): where α ab , β ab and γ ab are bounded by constants. By the inequality x log((x + c)/x) ≤ c for c ≥ 0, and the fact that x log((x − 1)/x) is bounded for x > 1, we find the upper bound: Upper bound, case 2: n ab = 1 and O ab = 0 or n ab = O ab , or n ab = 0 In both cases, the corresponding term of the likelihood modularity vanishes, whereas the contribution of the Bayesian modularity is either log B(1 + β 1 , β 2 ), log(β 1 , 1 + β 2 ), or log B(β 1 , β 2 ).
Lower bound The computations for the lower bound are completely analogous, except that we require O ab + β 1 ≥ 2 and n ab − O ab + β 2 ≥ 2. We study four cases. The cases (1) O ab ≥ 2 and n ab − O ab ≥ 2, (2) n ab = 0 and (3) n ab > 0 and n ab = O ab or O ab = 0 are similar to cases 1, 2 and 3 respectively of the upper bound. The fourth case is n ab − O ab = 1 and O ab ≥ 2, or O ab = 1 and n ab − O ab ≥ 1. In both instances, the likelihood modularity is equality to a bounded term minus log n ab . By similar calculations as before, the Bayesian modularity is of the order log n ab as well.
Lemma A.5. For any probability matrix R, Furthermore, if (P, π) is identifiable and the columns of R corresponding to positive coordinates of π are not identically zero, then the inequality is strict unless P σ R is a diagonal matrix for some permutation matrix P σ .
Proof. This Lemma is related to the proof that the likelihood modularity is consistent given in Bickel and Chen (2009). This proof however rests on their incorrect Lemma 3.1, and thus we provide full details on how the argument can be adapted to avoid the use of their Lemma 3.1 altogether.
For R a diagonal matrix the numbers (RP R T ) ab /(R1) a (R1) b reduce to P ab . Consequently, by the definition of H P , For a general matrix R, by inserting the definition of τ , Because (R1) a (R1) b − (RP R T ) ab = (R(1 − P )R T ) ab , with 1 the (K × K)-matrix with all coordinates equal to 1, we can rewrite this as By the information inequality for two-point measures, the expressions in square brackets becomes bigger when (RP R T ) ab /(R1) a (R1) b is replaced by P a b , with a strict increase unless these two numbers are equal. After making this substitution the terms in square brackets becomes τ (P a b ), and we can exchange the order of the two (double) sums and perform the sum on (a, b) to write the resulting expression as This proves the first assertion (10) of the lemma. If R attains equality, then also for every permutation matrix P σ , by the equality H P (P σ R) = H P (R) and the fact that (P σ R) T 1 = R T 1, we have We shall show that if R satisfies this equality and P σ R has a positive diagonal, then P σ R is in fact diagonal. Furthermore, we shall show that there exists P σ such that P σ R has a positive diagonal. Fix some (P σ ) m that maximizes the number of positive diagonal elements of P σ R over all permutation matrices P σ , and denoteR = (P σ ) m R. Because the information inequality is strict, the preceding argument shows that (12) can be true for P σ = (P σ ) m (giving P σ R =R) only if Denote the matrix on the right of the equality by Q.
IfR has a completely positive diagonal, then we can choose a = a and b = b and find from equation (13), that P ab = Q ab , for every a, b. If alsoR aa > 0, then we can also choose b = b and find that P a b = Q ab , for every b. Thus the ath and a th rows of P are identical. Since all rows of P are different by assumption, it follows that no a = a withR aa > 0 exists.
IfR does not have a fully positive diagonal, then the submatrix ofR obtained by deleting the rows and columns corresponding to positive diagonal elements must be the zero matrix, since otherwise we might permute the remaining rows and create an additional nonzero diagonal element, contradicting that (P σ ) m already maximized this number. If I and I c are the sets of indices of zero and nonzero diagonal elements, then the preceding observation is thatR ij is zero for every i, j ∈ I. If π > 0, then we need to consider only R with nonzero columns. For i ∈ I a nonzero element in the ith column ofR must be located in the rows with label in I c : for every i ∈ I there exists k i ∈ I c withR kii > 0. Then, for i, j ∈ I, We combine these three assertions to conclude that, for a, i ∈ I and b ∈ I c , Together these imply that the ath and the k a th row of P are equal. Since by assumption they are not (if π > 0), this case can actually not exist (i.e. k = 0).
Finally if π a = 0 for some a, then we follow the same argument, but we match only every column i ∈ I with π i > 0 to a row k i ∈ I c . By the assumption on R such k i exist, and the construction results in two rows of P that are identical in the coordinates with π a > 0.
Lemma A.6. For any fixed (K × K)-matrix P with elements in [0, 1], uniformly in probability matrices R, as ρ n → 0, Furthermore, if (P, π) is identifiable and the columns of R corresponding to positive coordinates of π are not identically zero, then the right side is strictly positive unless SR is a diagonal matrix for some permutation matrix S.
Proof. From the fact that |(1 − u) log(1 − u) + u| ≤ u 2 , for 0 ≤ u ≤ 1, it can be verified that, ρ −1 n τ (ρ n u) − u log ρ n + τ 0 (u) ≤ ρ n → 0, uniformly in 0 ≤ u ≤ 1. It follows that, uniformly in The first term on the right is equal to log ρ n (R T 1) T P (R T 1), and hence is the same for R and Diag(R T 1). Thus this term cancels on taking the difference to form the left side of (14), and hence (14) follows. The right side of (14) is nonnegative, because the left side is, by Lemma A.5. This fact can also be proved directly along the lines of the proof of Lemma A.5, as follows. Write By the information inequality for two Poisson distributions the term in square brackets becomes bigger if (RP R T ) ab /(R1) a (R1) b is replaced by P a b . It then becomes τ 0 (P a b ) and the double sum on (a, b) can be executed to see that the resulting bound is G P Diag(R T 1) . Furthermore, the inequality is strictly unless (13) holds, withR = R. Since also G P (P σ R) = G P (R), for every permutation matrix P σ , the final assertion of the lemma is proved by copying the proof of Lemma A.5.

A.2. Strong consistency
We need slightly adapted versions of the function H P , given by, with δ ab equal to 1 or 0 if a = b or not, For given functions t ab : [0, 1] → R, let X(e) be the K × K matrix with entries

Proof of Theorem 3.2 [strong consistency]
Proof. (i). By Theorem A.1, e is weakly consistent, and hence with probability tending to one it belongs to the set of classifications e such that the fractions f (e) are close to π, and the matrices R(e, Z) are close to Diag(π) after the appropriate permutation of the labels (that is, of rows of R(e, Z)). Therefore, it is no loss of generality to assume that e is restricted to this set. By Lemmas A.2 and A.3, the matrices O(e)/n 2 are then close to R(e, Z)P R(e, Z) T → Diag(π)P Diag(π), and hence are bounded away from zero and one if P has this property. If e and Z differ at m nodes, then e belongs to the set of e with R(Z, Z)−R(e, Z) 1 = m(2/n), by Lemma 2.1. In that case Q B (e) ≥ Q B (Z), for some e in this set, and hence by Lemma 3.1 Q M L (e) − Q M L (Z) + Q P (e) − Q P (Z) ≥ −η n , for some η n of order (log n)/n 2 . It follows that: The first term on the right is bounded below by a multiple of m/n, by Lemmas A.7 and 2.1.
Because (x + α) log x − (y + α) log y = y x (log s + (s + α)/s) ds is bounded in absolute value by a multiple of |x − y| log(x ∨ y), if α ≥ 0 and x, y > 0, the second term −|Q P (e) − Q P (Z)| is bounded below by a multiple of m(log n)/n 2 , for some positive constant C 2 , which is of smaller order than m/n. We conclude that the left side of (17) is bounded below by C 1 m/n. The left side is a,b X ab (e) − X ab (Z) , for X defined in (16)  The sum of the right side over m = 1, . . . , n tends to zero.
(ii). We follow the proof for (i), but in (17) use that H P,n R(Z, Z) − H P,n R(e, Z) ≥ ρ n C R(Z, Z) − R(e, Z) 1 ≥ ρ n C2m/n, by Lemma A.9. Since ρ n (log n)/n by assumption, we have that the contribution m(log n)/n 2 of Q P (e) − Q P (Z) is still negligible and hence ρ n C2m/n is a lower bound for the left side of (17). As a bound on the left side of the preceding display, we then obtain This sum tends to zero provided that nρ n log n.
Lemma A.7. If P is fixed and symmetric and every pair of rows of P is different and 0 < P < 1 and π > 0, then, for sufficiently small δ > 0, Proof. We can reparametrize the K×K matrices R by the pairs (R T 1, R−Diag(R T 1)), consisting of the K vector f = R T 1 and the K ×K matrix R−Diag(R T 1). The latter matrix is characterized by having nonnegative off-diagonal elements and zero column sums, and can be represented in the basis consisting of all K × K matrices ∆ bb , for b = b , defined by: (∆ bb ) b b = −1, (∆ bb ) bb = 1 and (∆ bb ) aa = 0, for all other entries (a, a ), i.e. the b th column of ∆ bb has a 1 in the bth coordinate and a −1 on the b th coordinate and all its other columns are zero. Given any matrix R ≥ 0 the matrix R − Diag(R T 1) can be decomposed as for λ bb = R bb ≥ 0. Since every ∆ bb has exactly one nonzero off-diagonal element, which is equal to 1, and in a different location for each b = b, the sum of the off-diagonal elements of the matrix on the right side is b,b λ bb . Because the sum of all its elements is zero, it follows that its sum of absolute elements is given by R − Diag(R T 1) 1 = 2 b =b λ bb . Thus we obtain a further reparametrization R ↔ (f, λ), in which R = Diag(f )+ b =b λ bb ∆ bb . For given P , f and n, define the function Then we would like to show that there exists C such that for every f in a neighbourhood of π, λ in a neighbourhood of 0 intersected with {λ : λ ≥ 0}, and every sufficiently large n. The numerator in the quotient is f (0) − f (1) for the function f (s) = G(sλ). Writing this difference in the form −f (0) − 1 0 f (s) − f (0) ds gives that the numerator is equal to It suffices to show that the first term is bounded below by a multiple of λ 1 and that the second is negligible relative to the first, as n → ∞, uniformly in f in a neighbourhood of π and λ in a neighbourhood of 0 intersected with {λ : λ ≥ 0}. Thus it is sufficient to show first that for every coordinate λ bb of λ minus the partial derivative of G at λ = 0 with respect to λ bb is bounded away from 0, as n → ∞ uniformly in f , and second that every partial derivative is equicontinuous at λ = 0 uniformly in f and large n.
We have for f (λ) = f + bb λ bb (∆ bb 1), By a lengthy calculation, given in Lemma A.8, for K(p q) = p log(p/q) + (1 − p) log (1 − p)/(1 − q) the Kullback-Leibler divergence between the Bernoulli distributions with success probabilities p and q. The numbers f a are bounded away from zero for f sufficiently close to π, and hence so is a f a K(P ab P ab ), unless the bth and b th column of P are identical. The whole expression is bounded below by the minimum over (b, b ) of these numbers minus (2n) −1 times the maximum of the numbers K(P b b P bb ), and hence is positive and bounded away from zero for sufficiently large n.
To verify the equicontinuity of the partial derivatives we can compute these explicitly at λ and take their limit as n → ∞. We omit the details of this calculation. However, we note that every term of G(λ) is a fixed function of the quadratic forms in λ f a + bb λ bb (∆ bb 1) a f a + bb λ bb (∆ bb 1) a − δ aa /n , These forms are obviously smooth in λ, and their dependence and that of their derivatives on n is seen to vanish as n → ∞. For f and λ restricted to neighbourhoods of π and 0, the values of the quadratic forms are restricted to a domain in which the transformation mapping them into G(λ) is continuously differentiable. Thus the desired equicontinuity follows by the chain rule.
Lemma A.8. The partial derivatives of the function G at 0 defined by (20) are given by (21).
Proof. For given differentiable functions u and v the map → u( . We apply this for every given pair (a, a ) to the functions u and v obtained by taking λ bb in (22) and (23) equal to and all other coordinates of λ equal to zero. Then It follows that v(0)/(u(0) − v(0)) = P aa /(1 − P aa ), and u(0)/(u(0) − v(0)) = 1/(1 − P aa ). Hence in view of (15) the partial derivative in (21) is equal to We combine this with the equalities Lemma A.9. If S is fixed and symmetric, every pair of rows of S is different and S > 0 and π > 0 coordinatewise, then there exists C > 0 such that, for sufficiently small δ > 0 and any Proof. In the notation of the proof of Lemma A.7 we must now show that G(0)−G(λ) ≥ Cρ n λ 1 , as n → ∞, uniformly in f in a neighbourhood of π, and λ in a positive neighbourhood of 0. As in that proof we write G(0) − G(λ) in the form (19) and see that it suffices that the partial derivatives of G at 0 divided by ρ n tend to negative limits, and that ∇G(λ) − ∇G(0) /ρ n becomes uniformly small as λ is close enough to zero. The partial derivative at 0 with respect to λ bb is given in (21), where we must replace P by ρ n S. Since the scaled Kullback-Leibler divergence ρ −1 n K(ρ n s ρ n t) of two Bernoulli laws converges to the Kullback-Leibler divergence K 0 (s t) = s log(s/t)+t−s between two Poisson laws of means s and t, as ρ n → 0, it follows that for ρ n → 0, uniformly in f , The right side is strictly negative by the assumption that every pair of rows of S differ in at least one coordinate. If P = ρ n S, then the function λ → v(λ) given in (23) takes the form v = ρ n v S , for v S defined in the same way but with S replacing P . The function u given in (22) does not depend on P or S. Using again that the derivative of the map → u( )τ v( )/u( ) is given by v log v/(u − v) − u log u/(u − v) , we see that the partial derivative with respect to λ bb of the (a, a ) term in the sum defining G takes the form Here u and V S are as in (22) and (23) (with P replaced by S), and depend on (a, a ). From the fact that the column sums of the matrices R(λ) do not depend on λ, we have that a,a (R(λ)SR(λ) T ) aa − δ aa n k P kk R(λ) ak = R(λ) T 1SR(λ) T 1 − k P kk a R(λ) ak is constant in λ. This shows that a,a v S = 0 and hence the contribution of the term ρ n v S log ρ n to the partial derivatives of G vanishes. The term −(ρ n v S − u ) log(1 − ρ n v S /u) can be expanded as (ρ n v S −u )ρ n v S /u up to O(ρ 2 n ), uniformly in f and λ. Since these are equicontinuous functions of λ, it follows that ρ −1 n ∇G(λ) − ∇G(0) becomes arbitrarily small if λ varies in a sufficiently small neighbourhood of 0.
Proof. Given Z there are at most n m groups of m candidate nodes that can be assigned to have e i = Z i , and the label of each node can be chosen in at most K − 1 ways. Thus conditioning the probability on Z, we can use the union bound to pull out the maximum over e, giving a sum of fewer than n m K m terms. Next we pull out the norm giving another factor K 2 . It suffices to combine this with a tail bound for a single variable X a,b (e) − X a,b (Z). Write t for t a,b . Assume for simplicity of notation that e i = Z i , for i > m, and decompose 1 n 2 O ab (e) = 1 n 2 i≤m or j≤m A ij 1 ei=a,ej =b + i>m and j>m A ij 1 ei=a,ej =b =: S 1 + S 2 .
Since the first and second derivatives of t are uniformly bounded by 1, it follows that The variable S 1 − ES 1 is a sum of fewer than 2mn independent variables, each with conditional mean zero, bounded above by 1/n 2 and of variance bounded above by P ∞ /n 4 . Therefore Bernstein's inequality gives that P |S 1 − ES 1 | > x ≤ e − 1 2 x 2 /(2mn P ∞/n 4 +x/(3n 2 )) .
The exponent has a similar form as before, except for an additional factor n/m ≥ 1.