Rate optimal Chernoff bound and application to community detection in the stochastic block models

The Chernoff coefficient is known to be an upper bound of Bayes error probability in classification problem. In this paper, we will develop a rate optimal Chernoff bound on the Bayes error probability. The new bound is not only an upper bound but also a lower bound of Bayes error probability up to a constant factor. Moreover, we will apply this result to community detection in the stochastic block models. As a clustering problem, the optimal misclassification rate of community detection problem can be characterized by our rate optimal Chernoff bound. This can be formalized by deriving a minimax error rate over certain parameter space of stochastic block models, then achieving such an error rate by a feasible algorithm employing multiple steps of EM type updates. MSC 2010 subject classifications: Primary 62F03; secondary 60G05.


Introduction
Many classification and clustering problems in statistical literature can be reduced to symmetric hypothesis testing. In a classical setting, given two hypotheses H 0 and H 1 , where H i assumes that observing data from a measurable space with distribution P i , one discriminates between them according to certain decision rule. Type-I error occurs if one accepts H 0 while the data are generated from distribution P 1 , and vice versa on type-II error. Symmetric hypothesis testing indicates that the hypotheses are equiprobable, and the loss function weighs type-I error and type-II error equally. Therefore, we would like to focus on the Bayes error probability, which averages two kinds of error probabilities.
The asymptotic behavior of the Bayes error probability becomes an essential problem in symmetric hypothesis testing. Given probability density functions (PDFs) or probability mass functions (PMFs) ϕ 0 and ϕ 1 of distribution P 0 and P 1 respectively, the Chernoff information, defined as is known as the best exponent of Bayes error probability [1]. A Chernoff type lower bound was investigated in [2,3]. It is a powerful tool in many researches, such as community detection [4,5,6] and quantum information theory [7,8]. However, the ratio between Chernoff coefficient, defined as exp(−D α * (ϕ 0 ϕ 1 )), and the Bayes error probability has not been investigated in previous literature.
In this paper, we will propose a rate optimal Chernoff bound for Bayes error probability. Observing i.i.d. samples with distribution either ϕ 0 or ϕ 1 , we will show that the Bayes error probability is asymptotically equivalent to up to a constant factor. This result can be generalized to the non-i.i.d. case. Although a comparable second-order asymptotics for asymmetric hypothesis testing was investigated in [9,10], there is no direct application to our situation. This paper will also apply the main result of the rate optimal Chernoff upper and lower bound to one of popular clustering problems in statistics, namely community detection. Particularly, we will focus on the stochastic block models (SBM). Many effective algorithms and related theories have been proposed for solving community detection in SBMs, including global approaches such as spectral clustering [11,12,13,14,15,16,17,18,19,20,21,22,23] and convex relaxations via semidefinite programs (SDPs) [24,25,26,27,28,29,30,31,32,33]. Global approaches usually involve a single optimization step (either spectral clustering or SDP after convex relaxation) and do not require good initialization. However, these algorithms are usually not optimal on their own, because both spectral clustering and SDP lose the block structure in SBM. The pseudo-likelihood approach [34] fills in the gap with local refinement and makes optimal clustering possible. The general idea was concluded as "Good Initialization followed by Fast Local Updates" (GI-FLU) by [35]. Since the minimax error rate proposed in [36], algorithms in the manner of GI-FLU are developed in [37,35,4,6]. However, as the rate optimal Chernoff upper and lower bounds were not used in these papers, the minimax rate is not sufficiently accurate and very few algorithm have been proved to be optimal. Details can be found in the following table. Here, n is the number of nodes and d is the average degree of a node in the network. K denotes number of communities. I indicates Chernoff information in (1.1). The specific one for community detection will be defined in (3.7). o(1) is some eventually positive sequences converging to 0. Table 1 Comparison with existing results.
paper density symmetry minimax error algorithmic error [37] not needed yes not derived exp(−CnI), (C < 1) [4] Θ(log n) yes not derived o(1/n) [35] not needed yes exp(−(1 + o(1))nI) exp(−(1 − o(1))nI) [6] O Some features or assumptions of the problem are described as follows. Density indicates the average degree of a node. Symmetry means the paper assumes that the network is an undirected graph. Community detection on symmetric network is usually more difficult since half edges are duplicated. Minimax error rate can be considered as fundamental limit of community detection problem. Algorithmic error rate are the theoretical guarantees of feasible algorithms.
Block partitioning skills introduced in [37] generate enough independence between different steps of their algorithm. However, the last local update can only be applied on half of dataset, so the error rate is much higher than exp(−nI). Algorithm derived in [35] has error rate similar to the minimax error rate in [36], but the ratio between upper and lower bound has order exp(o(1)nI), which can be arbitrary divergent sequence. The analysis in [4] focuses on the density regime Θ(log n), but it cannot generalize to other densities. To achieve an optimal error rate, the algorithm in [6] allows twice local update. However, their approach cannot extend to undirected network. We will combine different existing techniques and propose a new algorithm that achieves the minimax error rate (up to a constant).
We summarize the contributions of this paper as follows: 1. We investigate the rate optimal Chernoff upper and lower bound for Bayes error probability. 2. Considering certain parameter space, we propose the second-order asymptotics for minimax lower bound for community detection in SBM. 3. We provide a feasible algorithm which guarantees to achieve the minimax lower bound up to a constant factor.
The rest of the paper will be organized as follows. We introduce the Chernoff type upper and lower bound in Section 2, then we present our minimax lower bounds and the provable community detection algorithm with its analysis in Section 3. Simulations will appear in Section 5. Proofs of Theorems and corollaries in Section 2 will appear in Section 6. Proofs about minimax error rate and consistency of community detection can be found in Section 7.
Here, we briefly introduce the notations will be used in this paper.
[n] is the set of integers from 1 to n, i.e., [n] = {1, 2, . . . , n}. A random variable X ∼ f means X has probability mass function or density function f . a n b n or equivalently a n = O(b n ) holds if there exists a constant C such that a n ≤ Cb n for sufficiently large n. If a n b n and b n a n , then a n b n . a n = o(1) means a n converges to 0 and nonnegative for sufficiently large n. Furthermore, we use a ∨ b and a ∧ b to denote max(a, b) and min(a, b) respectively.

Rate optimal Chernoff upper and lower bounds
We will introduce a fundamental testing problem under a Bayes setting, then present a new Chernoff type upper and lower bound of Bayes error probability. We will also introduce its application exponential families.

Symmetric hypothesis testing
We will define a symmetric hypothesis testing problem and its Bayes error probability. Let ϕ 0j and ϕ 1j for j ∈ [n] be two sequences of measurable PDFs for one-dimensional real random variables. Same results hold if they are PMFs, but we only consider PDFs for brevity. We assume for every j ∈ [n], ϕ 0j and ϕ 1j are defined on the same measure space (Ω j , Σ j , μ j ). Let us consider the product measure space (Ω, Σ, μ) defined by Ω := Ω 1 × · · · × Ω n , Σ = Σ 1 ⊗ · · · ⊗ Σ n and μ = μ 1 × · · · × μ n (2.1) and measurable PDFs We assume both D KL (ϕ 0 ϕ 1 ) and D KL (ϕ 1 ϕ 0 ) exist, which implies Ω (ϕ 0 + ϕ 1 ) log ϕ1 ϕ0 dμ < ∞. In particular, it requires ϕ 0 and ϕ 1 to have the same support, and take different values on a set with non-zero measure. For a pair of density functions satisfying these conditions, we say Now we randomly draw a number z ∈ {0, 1} with equal prior probability 1/2. We note that the following arguments about the rate optimal Chernoff bound still hold as long as the prior probability is nondegenerate, but we assume equiprobable for simplicity. Then we draw a random sample X = (X 1 , . . . , X n ) where X j ∼ ϕ zj independently. We aim to recover the label z given the observation X = x = (x 1 , . . . , x n ). For any estimatorẑ :=ẑ(x) of z, we define the Bayes error probability, also called Bayes risk, given by Z. Zhou and P. Li as η(ϕ 0 , ϕ 1 ), which will be defined as follows: The naming of total variation affinity comes from the fact that Here, D TV (ϕ 0 ϕ 1 ) is the total variation distance. Now we can focus on the total variation affinity and express it as where l = ϕ 0 /ϕ 1 is the likelihood ratio defined pointwisely on Ω. This ratio is well defined since we assume that ϕ 0 and ϕ 1 have the same support. We observe that ϕ 1−α 0 ϕ α 1 is a PDF on Ω up to a normalizer and min(l α , l α−1 ) is a real valued function on Ω, so it would be convenient to express η(ϕ 0 , ϕ 1 ) as an expectation. For α ∈ (0, 1), we define PDF We call D α (ϕ 0 ϕ 1 ) the Chernoff α-divergence between ϕ 0 and ϕ 1 . We also define a real valued function, which will play an important role in the analysis of the higher-order term: Then by direct calculation from (2.5), we have (2.8) is a random vector with independent elements on the product space Ω, and one can observe that (2.9) Let l j = ϕ 0j /ϕ 1j and Z j := log l j (Y j ), then we can decompose log l(Y ) as log l(Y ) is indeed the sum of independent random variables, so it is approximately normally distributed under some regularization condition, which will be specified in the following theorem. Obtaining normal approximation from the Berry-Esseen theorem (Theorem 6.1), we have the following result.

Remark 1.
A possible (but not necessarily optimal) choice of C 2 , C 3 and C 4 can be The gap between C 2 and C 4 vanishes when we observe samples with normal distribution, but a positive gap exists in general, e.g., for Bernoulli distribution. We will be demonstrate this fact empirically by simulation in Section 5.

Remark 2.
The names of Chernoff information/coefficient/divergence in this paper are according to the survey [38]. The Chernoff information indicates the Chernoff α-divergence with α = α * which maximize D α (ϕ 0 ϕ 1 ). We are not going to calculate the exact value of α * in this work. D α * ( · · ) represents the Chernoff information, and α * might be different for variant inputs. α * in Theorem 2.1 is unique since we assume ϕ 0 and ϕ 1 are different on a set with positive measure.
To gain better understanding of Theorem 2.1, we will introduce a corollary of the i.i.d. case. Under the assumption of i.i.d. sampling, some quantities in the theorem become constants. Since many existing results only consider i.i.d. cases, the following corollary will be helpful for comparison.

Comparison with existing results
The Chernoff type lower bound can be traced back to early literature. [2,Theorem 5] produced the following lower bound for Bayes risk, namely (2.11) is strictly weaker than the result in Theorem 2.1. This lower bound has been applied to the Bayes risk of quantum hypothesis testing, such as [8]. For the i.i.d. case, D α * (ϕ 0 ϕ 1 ) has been shown to be the best achievable exponent [39, Theorem 11.9.1] and it is restated in [7, Theorem 2.1] as follows: under the same conditions as Corollary 1. This result can be obtained from Corollary 1 since the exponent of η(φ 0 ,φ 1 ) has the form log η(ϕ (1). The log n term was investigated in [40], and applied to hypothesis testing problem in [4,Lemma 11]. However, the result can only be applied to Poisson distribution when the samples are i.i.d. in a fixed settingσ n log n n . If the samples are not identical, their lower bound is not valid. The authors of [6] generalize the result to other setting; however, their bounds cannot be applied to the case when observed data are not i.i.d. Poisson. Therefore, neither of them proposed a minimax lower bound that matches their algorithmic error rate in community detection problems.

Application to exponential families
With a concrete expressions of ϕ 0 and ϕ 1 , we can write the Chernoff type bound in Theorem 2.1 with a closed form up to a constant factor. In this section, we are interested in exponential families with PMFs or PDFs of the form for x ∈ Ω and θ ∈ Θ. We assume the parameter space Θ is a convex subset of Euclidean space and A is a smooth function on Θ. Let {ϕ zj : z ∈ {0, 1}, j ∈ [n]} belongs to an exponential family. To be more specific, we assume that there exist parameters θ zj ∈ Θ, z ∈ {0, 1}, j ∈ [n] such that We still define ϕ 0 and ϕ 1 as in (2.2) on some measure space such that (ϕ 0 , ϕ 1 ) ∈ F(Ω, Σ, μ, n) (see (2.3)). Let us define θ αj := (1 − α)θ 0j + αθ 1j , then θ αj is a valid parameter since we assume the parameter space Θ is convex. The Chernoff α-divergence has a close form: (2.14) See Section 6.2 for derivation of the last equation. Suppose Y j ∼ ϕ α * j , then using the definition of Z j in (2.10), is the Hessian matrix of A evaluated at θ. Now we can establish a corollary when ϕ zj 's belong to exponential families. Var Suppose n j=1 E|Z j | 3 ≤ C 1 nσ 2 n , using the same constants C 2 , C 3 and C 4 in Theorem 2.1, then

Application to Bernoulli distribution
We are going to investigate a specific exponential family. Let p zj ∈ (0, 1) and θ zj = log pzj 1−pzj for z ∈ {0, 1}, j ∈ [n], and define PMFs of Bern(p zj ): It coincides with (2.13) if we let A(θ) = log(1 + e θ ), Let us briefly recall the testing problem in Section 2.1. We randomly draw a number z ∈ {0, 1} with equal probability 1/2, and draw a random sample X = {X 1 , . . . , X n } where X j ∼ Bern(p zj ) independently. As usual, we want to recover z given X = x ∈ {0, 1} n . Then we have and recall the definition of α * andσ n from Theorem 2.1, then we have then the upper and lower bound of η(ϕ 0 , ϕ 1 ) = 2R(ẑ, z) in the theorem holds.
Finally, it is worth mentioning a special case when p z1 = · · · = p zn :=p z for z ∈ {0, 1}. Let ψ z (x) be the PMF of Bin(n,p z ). Then one can check that This is due to the fact that, given the observed data x, the optimal test only relies on the minimal sufficient statistic n j=1 x j . This observation can generalize Corollary 1 to the cases when only the sufficient statistic, which is the sum of i.i.d. random variables, are observed. For example, one can apply Corollary 1 to Bayes error probability of Poisson parameter testing by the fact that a Poisson variable is the sum of multiple i.i.d. Poisson variables.
Calculation details in this section will appear in Section 6.4 to 6.7.

Community detection in the stochastic block models
The results in Section 2 can apply to many clustering and classification problem in statistics. A typical example is community detection in the stochastic block models (SBM). Given some good estimates of the parameters, community detection is indeed a classification problem. Hence the clustering error rate of the label estimates heavily depends on the Bayes error probability.

Background of stochastic block models
We will focus on a network which can be represented by a symmetric adjacency matrix A ∈ {0, 1} n×n , where the nodes are indexed by [n]. We assume that there are K communities on [n], and the membership of the nodes are given by . We let n k := |{i : z i = k}| be the size of kth community. Under the assumptions of SBM, given a symmetric connectivity matrix P ∈ [0, 1] K×K , and . That is, the connectivity of nodes only depends on their memberships, and there are no self-loops. A fundamental task of community detection on SBM is to recover z given A and K. For consistency of notation with Section 2.3, we define are the same at all entries but the ith one. When n is large, the effect of one entry is merely a constant factor. Here p k * = (p k1 , . . . , p kn ), and similarly A i * is the ith row of A. This notation will be used in the rest of this paper. We will consider the parameters satisfying for some fixed constants K, β > 1, ε ∈ (0, 1), and ω > 1. β controls the balance between different communities. There are no too small or too large communities. All connectivity probabilities are bounded above by 1−ε, which is a mild sparsity assumption.
For estimateẑ of z, we are interested in the misclassification rate defined as where S K is the symmetric group which contains all permutations of [K] and the permutation π will apply entrywisely on the label vectorẑ.

Fundamental limit
Let us first consider a simplified symmetric hypothesis testing problem in SBM.
In the community detection problem described in the previous section, only the adjacency matrix A and number of community K is given. Now suppose additionally, we know z −i , i.e., all the labels but the ith one, and the connectivity matrix P , our goal is to recover z i . To further simplify the problem, we assume z i ∈ {k, }, then the hypothesis problem becomes comparison between the parameters p k * and p * defined in (3.2). Since the distribution of Bernoulli vector Bern(p j ) , Bern(p j ) .
which also denote the same quantities if we input the corresponding PMF's. Substituting p 0 * and p 1 * with p k * and p * in Section 2.3, and using the assumptions about SBM in Section 3.1, we have the following lemma.

Lemma 1.
Given adjacency matrix A and parameters K, z −i and P , and knowing that z i = k or with probability 1/2, then Bayes estimatorẑ î Now we will derive a minimax lower bound of community detection problem. We will consider the following parameter space: Θ(n, K, p, q) := (z, P ) : P ∈ (0, 1) K×K , P = P , P kk ≥ p, (3.6) Theorem 3.1 (minimax lower bound). We define If p > q and nI 2 K ≥ Cp for some C only depending on β, ε, K and ω, then The proof will appear in Section 7.3, which is inspired by [36]. Compared with the minimax lower bound in [36], which states Considering the case p/q → c > 1 and np → ∞, the higher order term (np) −1/2 converges to 0 as average degree increases. It requires extra effort to find an algorithm achieving this sharp lower bound.

Remark 3. The condition nI 2
K ≥ Cp is not required in [36], but it is needed in this theorem due to a technical reason. Essentially, the lower bound in Theorem 2.1 requires the condition √ nσ n (1 − α * )α * ≥ C 3 for some sufficiently large C 3 . The corresponding condition nI 2 K ≥ Cp in community detection scenario is needed due to Lemma 4. In other words, the prefactor

Algorithm achieving the minimax lower bound
Our algorithm is inspired by the pseudo-likelihood approach in [34]. We define an operator to estimate P given adjacency matrix A and estimated labelsz: (3.10) We will also use likelihood ratio classifier defined as follows: Note that we can apply these two operators on submatrices of A with the corresponding indices if needed. This is an EM-type algorithm if we repeat (3.10) and (3.11) iteratively, i.e., (3.10) is the expectation step and (3.11) is the maximization step. As pointed out in [6], it requires at least two iterations of EM-type update to achieve the optimal error rate up to a constant. To generate enough independence between iterations, we combine the block partition method in [37] and "leave-one-out" trick in [35]. It is worth noting that besides the dependence between A andz in L(A,P ,z), other dependence can be handled by uniform bounds. Details about Algorithm 1 will be describe as follows: Step 3 to 4: We apply spectral clustering on the whole adjacency matrix. However, we will only use its output in the matching step (step 9) and approximate an initial estimateP of P . The dependence betweenP and A can be handled by uniform bounds.
Step 5: This is the block partitioning trick. Data in different blocks will be used in different steps to acquire independence.
Step 6 to 7: This is the "leave-one-out" trick. In each iteration, we only use the data of the jth node in step 12, so the last likelihood ratio classifier will be independent with other steps in the for loop.
Step 8 to 9: We apply spectral clustering on two of the subblocks. Although the labels from spectral clustering have consistent misclassification, but the corresponding optimal permutations in (3.4) are not necessarily the same in general. This issue can be solved by step 9. After the matching step, the new label Apply degree-truncation to A to obtain Are.

16:
Apply SVD on Are so that Are = U ΣU T . LetΣ contains top K singular values on the diagonal andÛ contains corresponding singular vectors. 17: Output the K-means clustering result on the rows ofÛΣ. 18: end function Outputz. 22: end function vectorz has the same permutation asz when computing the misclassification rate. This fact will be clarified in the proof. Note that althoughz depends on A, z I andz J only depend on the corresponding subblocks as long as the spectral clustering algorithm outputs good enough labels.
Step 10: We apply the first likelihood ratio classifier on a different subblock using estimated connectivity matrix from step 4 and labels from step 9.
Step 11: We obtain the updated labels and estimate the connectivity matrix byP according to the new labels..
Step 12: We update the label again according to the newP andz obtained in step 11 by likelihood ratio classifier.
Step 14 to 18: A spectral clustering algorithm proposed in [22]. Details of the degree-truncation step appears in Section 8.
Step 19 to 22: A matching algorithm finding the optimal permutation between labels. A linear assignment algorithm with computational complexity O(K 3 ) can find the exact solution ofz [41].
The following block matrix might help understand the partitioning of adjacency matrix A in the algorithm.
. . . " represents the block A j×(I ∪J ) . We can see that the second spectral clustering and both likelihood ratio tests are applied on different blocks of the adjacency matrix, so we do not need to worry about dependence between steps. Now we present the theoretical guarantees of the output of Algorithm 1.
There exists constant C only depending on β, ε, K and ω such that, if Cnp * ≤ (D * ) 2 , then the outputẑ from Algorithm 1 satisfies: (1), thenẑ achieves exact recovery with high probability, i.e., Moreover, η * in (a) can be replaced by Case (a) and case (b) in the theorem have described all situations in the model. Case (a) assumes D * ≤ 2 log n, and if D * > 2 log n, it is easy to check that nη * = o(1) with the help of Lemma 1.
Theorem 3.2 immediately implies the error rate on the parameter space Θ(n, K, p, q) defined in (3.6) by considering the least favorable submodel. It shows that the misclassification error of our algorithm is rate optimal in this parameter space. We summarize this result in the following corollary.
Remark 4 (Comparison with existing results.). We have already compared some results in literature. Here, we will summarize the novelty in details.
1. Existing papers either consider the asymptotic behavior of optimal community detection in general undirected or bipartite SBM [6] or symmetric assortative SBM [37,35,42]. We extend the algorithms to general SBM. 2. We apply twice local updates (likelihood ratio tests) on symmetric adjacency matrix in our algorithm. It is also possible to apply multiple times by partitioning more blocks. Although multiple steps of local updates are allowed in [43] by variational inference, data splitting method for initialization is required and lacking in their algorithm. 3. By the new Chernoff bound introduced in Theorem 2.1, we provide sharpened minimax error rate and tight misclassification rate for our algorithm.
In particular, we replace the uncertain term exp(o(1)nI K ) in [36] by an explicit expression. Although a high order term is also discovered in [42], their result only applies to assortative SBM with K = 2, P 11 = P 12 = p > P 12 = P 21 = q and p q log n n . Our algorithm applies to general SBM. 4. In a more general setting than [42], the authors of [21,32,33] consider K = 2, P 11 = P 22 = p > P 12 = P 21 = q and their algorithms can achieve error rate exp(− (1 − o(1))nI 2 ), where I 2 is defined in (3.7). In particular, the term o(1) can behave like O((nI 2 ) −1/2 ) in [32,33]. The error rates in these work is not as sharp as the result in Corollary 3 because D * = nI 2 in this setting and our theorem shows the error rate can be as sharp as exp(− (1 + o(1))nI 2 ), where the o(1) term is of the form (3.9). 5. When considering the error rate in Corollary 3, the condition Cnp * ≤ (D * ) 2 is not required in [21,32,33]. However, this is due to the fact that they consider a simpler model. Under the same setting, if we use the error rate proposed in these works for our initialization steps, e.g., step 8 in Algorithm 1, then the condition Cnp * ≤ (D * ) 2 can be removed in Corollary 3. However, we have not found the generalization of the results in [21,32,33] to the cases when K > 2 with sharp enough error rates, so we still require the condition Cnp * ≤ (D * ) 2 in our theorem.

Discussion
We discuss some possible extensions and future works in this section.

Rate optimal Chernoff bound for quantum hypothesis testing
The Quantum Chernoff bound has been shown to be the upper [44] and lower [7] bound of symmetric quantum hypothesis testing. In the proof of Chernoff lower bound, the authors reduce the problem from quantum setting to classical probability space, and apply classical Chernoff bound. Hence, the second-order term in Theorem 2.1 can apply to lower bound immediately. However, its application to Chernoff upper bound is more technical.

Simplified feasible algorithm for community detection
It was pointed out in [35] that the "leave-one-out" trick is not necessary in practice, so it only requires a single spectral clustering. See Algorithm 3 in their paper. However, they cannot provide theoretical guarantee for this simplified algorithm. To the best of our knowledge, considering general SBM, there is no algorithm with finite number of global method (either spectral clustering or semidefinite programming) can achieve the error rate in Theorem 3.2. The idea of "leave-one-out" can be generalized to leave more than one out [42], but minimum number of global methods still grows as the number of nodes increases. With the assortativity assumption, a concurrent work [32] shows that in the simplest SBM setting, i.e., K = 2, P 11 = P 22 = p > P 12 = P 21 = q, a semidefinite programming approach can achieve a sharp error rate of the form exp ((1 − o(1))I 2 ), where I 2 is defined in (3.7). With such a tight error rate in the initialization step, it only requires one step of global update the achieve the minimax lower bound in Theorem 3.1. However, the "leave-one-out" step is still required, and their method does not apply to general SBM, e.g., when q > p.

Simulation
We will show that, in some asymptotic setting, the Bayes error probability converges with a rate expected in Theorem 2.1 by simulation. Since the exponent of the error rate has been well known, we will focus on the second-order asymptotics in our experiments. Let us consider Bernoulli distributions analyzed in Section 2.3. Let p 01 = p 02 = · · · = p 0n = p 1(n+1) = p 1(n+2) = · · · = p 1(2n) = 0.55, and p 11 = p 12 = · · · = p 1n = p 0(n+1) = p 0(n+2) = · · · = p 0(2n) = 0.45, i.e., ). Now we consider the Chernoff α-divergence. The optimal α * in Theorem 2.1 is 1/2 by symmetry. Using the notation in (3.5), by (2.16), we have By (2.17) with some details in Section 6.4, we have By Theorem 2.1, we expect Or equivalently, a n := log η(p 0 * , p 1 * ) − 2n log(2 √ 0.55 · 0.45) asymptotically behaves like − 1 2 log n + C. We can also think of p 0 * and p 1 * as the parameters in (3.2) associated with SBM with community sizes n 0 = n 1 = n and connectivity matrix also tends to − 1 2 log n + C. Note that 2 comes from the fact that η(p 0 * , p 1 * ) = 2 · (Bayes error probability) would help the simulation scale better. We will use the true Bernoulli PMF to compute η(p 0 * , p 1 * ), then find the misclassification rate of Algorithm 1 and compute b n .  From Figure 1, we observe that n increases, both a n and b n behave like − 1 2 log n + C for the same constant C. For smaller n, the misclassification rate is large since initialization in Algorithm 1 is not accurate enough; however, b n becomes stable when n gets larger.
Another interesting empirical result we want to show by simulation is that, there is a gap between the constants C 2 and C 4 in Theorem 2.1 in general. We let p 0 * = 0.3 · 1 n and p 1 * = 0.7 · 1 n , i.e., By symmetry, we have α * = 1/2, so Again, we let a n = log η(p 0 * , p 1 * ) − n log(2 √ 0.3 · 0.7). The following plots shows the behavior of a n . They are plots of a n in different ranges of n. Although a n asymptotically behaves like − 1 2 log n, it oscillates up and down until infinity. This simulation result empirically shows that, a n + 1 2 log n does not converge to any constant for such p 0 * and p 1 * .

Proposition 1. Let Φ be the cumulative distribution function of standard normal distribution, then for
In particular, Proof. For x > 0, we have divergent series expanded at ∞: which implies (6.2). We also have the power series expanded at 0. Let m!! = 1 · 3 · · · · · m for odd integer m, we have which implies (6.3). Then we expand e −x 2 /2 x + x 3 3 and have Then we obtain (6.4).
On the other hand, for any t ∈ [0, 1], (6.6) The first integral in the last step can be evaluated as For the second integral in the last step of (6.6), we similarly have

Z. Zhou and P. Li
By (6.2) and the fact that the function 1 .
Hence the integral in (6.6) has lower bound .

Proof of variance of (2.15)
The variance of Z j can be directly derived from the following proposition. Its proof is skipped for brevity.
By (2.14), we can write the Chernoff coefficient for p 0j and p 1j in terms of p αj 's: Hence,

The rest of proof of Bernoulli example in Section 2.3
To show the upper and lower bound, it remains to show that is a sufficient condition of Theorem 2.1. We have This implies n j=1 E|Z j | 3 ≤ C 1 nσ 2 n .

Proof of Lemma 1
It is required to check the assumptions in Theorem 2.1 are satisfied. Firstly, we need to check that Var[Z j ]. We use the notation in Section 2.3, and replace p 0 * and p 1 * by p k * and p * , under (3.3), we have Secondly, by Lemma 5, we can remove α * (1 − α * ) since it is bounded below by constant and bounded above by 1/4. Thirdly, by Lemma 4, nσ 2 n in (2.17) is sufficiently large under the assumption np * ≤ C(D α * (p k * p * )) 2 . Furthermore, we need to check that nσ 2 n can be replaced by up to a constant factor. For all k, and j, Under the assumption of the block structure and (3.3), for at least n βK many i ∈ [n], Now we combine (7.1) and (7.2) and obtain n j=1 Finally, by Lemma 5, we can remove α * (1 − α * ) since it is bounded below by constant and bounded above by 1/4.

Auxiliary lemmas for proof of Lemma 1
Lemma 4. Under the assumption (3.3), for any C 1 , there exists C 2 only depends on β, ε, K and ω such that if D α * (p k * p * ) 2 ≥ C 2 np * , then √ nσ n α * (1 − α * ) ≥ C 1 , whereσ n is defined in (2.17). In particular, we can choose large enough C 2 so that C 1 is also sufficiently large.
Proof. We briefly write α = α * in this proof. We recall that In this proof, we briefly denote p 0 := p kj , p 1 := p j = p 1 and a := Dα(p k * p * ) n . Without loss of generality, we assume p 1 ≥ p 0 . Then the inequality above implies By straightforward rearrangement, we have where the second inequality is due to the derivative of log(p 1 + x) is at least 1/p 1 on (−p 1 , 0], and the last inequality uses the fact that 1 − e −x ≥ x − x 2 2 for x ≥ 0. Therefore, We recall that we assume p 1 ≥ p 0 , so log 1−p0 1−p1 ≥ 0. By (7.2), Since we choose α = α * to be optimal, then by Lemma 5, are bounded below by constant. Under the assumption D α * (p k * p * ) 2 ≥ C 2 np * , by Lemma 10 and using C ε in that lemma, which implies np * ≥ C 2 /C 2 ε . Now we consider the first term and have By choosing sufficiently large C 2 , the RHS is sufficiently large. For the second term, we recall a = Dα(p k * p * ) n and the assumption that p 1 ≤ p * , so This term is also sufficiently large by chosen a big enough C 2 .
Proof. We first consider the case n = 1 and briefly denote p 01 := p and p 11 := q (in this proof only). Let Since f is smooth and convex, α * minimize f (α) if and only if f (α * ) = 0. Let us define x := log q p and y := log 1−q 1−p , then p = 1−e y e x −e y and 1 − p = e x −1 e x −e y . Without loss of generality, we assume p > q, so x < 0 and y > 0. Hence We can observe that g is a strictly increasing smooth function on R, and g ∈ (0, 1). α * is the slope of a secant line that intersects the function g at x and y, so α * can only take value g (z) for some z ∈ [x, y]. Since x ∈ [− log ω, log ω] and y ∈ [1 − ε, 1 1−ε ], there exists δ which only depends on ω and ε such that α * ∈ [δ, 1 − δ]. Now we can generalize the conclusion to n > 1. Let Since each positive convex function f j is decreasing on [0, δ] and increasing on [1 − δ, 1], so is their product pointwise f . Therefore, f achieves minimum on [δ, 1 − δ].

Proof of Theorem 3.1
It suffices to replace Lemma 5.2 in [36] with the Chernoff lower bound in Section 2.3. Let n = n/K , Let X i ∼ Bern(p) and Y i ∼ Bern(q) for i ∈ [n ]. For sufficiently large np, This inequality provides a lower bound of B τ (σ(1)) in [36]. The rest of proofs follow from the arguments in [36].

Auxiliary lemmas for proof of Theorem 3.2
In this section, we will use the following concentration inequality [47, p. 118]:

Same bound holds for P(S < −vt).
Lemma 6 (Uniform Parameter Estimation). ForP obtained from the operation B(A,z), and assuming Mis(z, z) ≤ γ for 1 n ≤ γ ≤ 1 2βK with optimal permutation π * =id, we have P sup{ P − P ∞ : for every τ > 0. If γ < 1 n , we can replace nγ log γ by 0. Proof. We only consider the case k = . If k = , the arguments will similarly fol- Hence by definition ofP k from (3.10), we have upper bound For lower bound, we havê Therefore, usingn k ≥ n 2βK again, we have There are at most The second term on the RHS is a geometric sum, using 1 ≤ nγ ≤ n 2βK ≤ n 4 , we have If γ < 1 n , thenz = z is unique. Taking the union bound, we obtain the desired probability.
Let ϕ(x; p) be the PMF evaluated at x of a Poisson-Binomial variable with parameters p = (p 1 , . . . , p n ). In particular, if p =p1 n , then ϕ(x, p) is the PMF of a binomial distribution with parameters n andp.
By the inequality 1 1−x ≤ exp x 1−x for x ∈ (0, 1), we have For the other term, 1 + x ≤ e x implies Therefore, we have A ij be the degree of node i, where A is the adjacency matrix in SBM (see (3.1)), and assuming max j∈[n] E[A ij ] ≤ p * ≤ 1 − ε. Then there exists C ε > 0, which only depends on ε such that Proof of Lemma 9. We choose large enough C ε such that Now we want to find the upper bound of the following probability: where the last inequality holds by the choice of C ε .

Lemma 11 (Perturbed Likelihood Ratio Test). We assuming the parameters satisfies (3.3), given the ith row of the adjacency matrix
Proof. Firstly, we define the following probability mass functions: Bin(n r , P kr ),ψ 0 ∼ K r=1 Bin(n r , P kr + ρ), Bin(n r , P r + ρ), Therefore, log a+x b−x is a 4 ε -Lipschitz function. As a result, Hence, using Now we consider the tail bound of Y ik . The only random variable on the RHS of the equation above is b ir for r ∈ [K]. Letψ 0 be the probability mass function of (b ir ) ∈ Z K + . If z i = r, then b ir follows a Poisson binomial distribution with at leastn r −nγ parameters equal to P kr . If z i = r, thenn r −nγ need to be replaced byn r −nγ−1 in the previous sentence. Since γ ≤ 1 2βK , son r ≥ n βK − n 2βK = n 2βK . The proportion of parameters different from P kr is at most Sinceψ 0 is the joint probability mass function of a Poisson binomial distribution, by Lemma 8, we havẽ x r .
Proof of Lemma 12. We have n I k ∼ Hypergeometric( n/2 , n k , n). For any fixed k ∈ [K], the concentration of hypergeometric distribution [48] gives n I k − n k 2 ≤ nξ with probability at least 1−2 exp(−nξ 2 /3) when n is sufficiently large. Taking the union bound over all k ∈ [K] gives the desired result.

Proof of Theorem 3.2
Under the assumption (D * ) 2 ≥ C 2 np * for sufficiently large C 2 , by Lemma 4, we Choosing sufficiently large C 2 , C 1 is also large enough. To simplify the notation, we assume np * ≥ C 1 . We will analyze the algorithm step by step. Each step fails with some probability, which will be summed up before calculating the error rate.
Spectral clustering and matching Assuming D * := min k = D α * (p k p ) is sufficiently large, and let r = 4, by Lemma 13, we have Mis(z, z) ≤ C 3 /D * ≤ that is, n Mis(z, z) = n i=1 1{z i = z i }. Now we consider spectral clustering in the for loop. Using Lemma 12 and let ξ = 1 6βK , when n is sufficiently large, n k 3 ≤ n k 2 − n 6βK ≤ n I k := |{i ∈ I : z i = k}|.
For sufficiently large n, we have n/(4βK) ≤ n k /4 ≤ n I k . Similar bound holds for n J k , i.e., n k /4 ≤ n J k . This guarantees the community size in each partitioned subgraph is sufficiently large. Hence for α ∈ (0, 1), n k 4 log(P 1−α kr P α r + P 1−α kr P α r ) = D α (p k * p * ) 4 . (7.14) Let α * = arg max α∈(0,1) D α (p k * p * ), then Then the outputz I of first spectral clustering in step 8 satisfies when D * is sufficiently large with probability at least 1 − (n/2 − 1) −4 ≥ 1 − (n/3) −4 . Now we consider the first matching algorithm in step 9. Let π * = arg max π∈S K i∈I 1{z i = π(z i )}, then i∈I 1{z i = π * (z i )} ≤ |I | 8βK ≤ n 16βK . On the other hand, since n I k ≥ n 4βK , we must have |{i ∈ I :z i = k}| ≥ n 4βK − n 16βK = 3n 16βK . Hence for every k ∈ [K], |{i ∈ I :z i = π * (z i ) = k}| ≥ 3n 16βK − n 16βK = n 8βK . On the other hand, for any π = π * , i∈I 1{z i = π(z i )} ≥ 2 · n 8βK = n 4βK because at least two labels have been permuted and at least n 4βK of them matchz under the permutation π * . Then by triangle inequality of the hamming distance, we have i∈I Therefore, π * is the unique permutation such that i∈I 1{z i = π * (z i )} ≤ n 16βK . In other words, the matching algorithm succeed to find the optimal permutation betweenz I and z I . The second matching algorithm will similarly work. Therefore, the updatedz I andz J are consistent with z.
Thus, we have where the second and the third inequalities hold when D * is sufficiently large, and the last inequality is due to the fact that −x log x is increasing on [0, 1/e]. Therefore, with probability at most exp − n 2 p * h 1 (τ 1 ) we have P −P ∞ ≤ C 5 (8βKγ 1 +τ 1 )p * fails, where C 5 corresponds to constants in Lemma 6, and the last inequality holds for sufficiently large D * .
Second estimated parameters As we have obtained labelsz with higher accuracy, we would like to updateP as well. The proof is similar as the first estimated parameter, but with τ 2 and γ 2 different from τ 1 and γ 1 . Let τ 2 := 16βK(1∨ √ p * D * ) np * . Since np * is sufficiently large and by Lemma 10, √ D * np * 1 √ D * , τ 2 is arbitrarily small. Using h 1 (t) ≥ t 2 /8 for t ∈ [0, 1] again, we have