A Simple Spectral Algorithm for Recovering Planted Partitions

In this paper, we consider the planted partition model, in which $n = ks$ vertices of a random graph are partitioned into $k$"clusters,"each of size $s$. Edges between vertices in the same cluster and different clusters are included with constant probability $p$ and $q$, respectively (where $0 \le q<p \le 1$). We give an efficient algorithm that, with high probability, recovers the clusters as long as the cluster sizes are are least $\Omega(\sqrt{n})$. Informally, our algorithm constructs the projection operator onto the dominant $k$-dimensional eigenspace of the graph's adjacency matrix and uses it to recover one cluster at a time. To our knowledge, our algorithm is the first purely spectral algorithm which runs in polynomial time and works even when $s = \Theta(\sqrt n)$, though there have been several non-spectral algorithms which accomplish this. Our algorithm is also among the simplest of these spectral algorithms, and its proof of correctness illustrates the usefulness of the Cauchy integral formula in this domain.


Introduction and previous work
In the Erdős-Rényi random graph model [14], graphs G(n, p) on n vertices are generated by including each of the possible n 2 edges independently at random with probability 0 < p < 1. A classical conjecture of Karp [25] states that there is no efficient algorithm for finding cliques of size (1 + ǫ) log 1/p n, though cliques of size at least 2 log 1/p n will almost surely exist [7].
Jerrum [24] and Kučera [27] introduced a potentially easier variant called the planted clique problem. In this model, one starts with a random graph, but additionally, edges are added deterministically to an unknown set of s vertices (known as the "plant") to make them form a clique. The goal then is to determine a.s. exactly which vertices belong to the planted clique, which should be easier when s becomes large.
When s = Ω( √ n log n), the clique can be found by simply taking the s vertices with the largest degrees [27]. This bound was improved using spectral methods to Ω( √ n) by Alon et al. [2] and then others [5,10,13,15,16,29]. These methods also handle a generalization of this problem in which edges within the plant are added merely with higher probability rather than deterministically. A more general version of the problem is to allow for planting multiple disjoint cliques, sometimes called a planted clustering. In the most basic version, known as the planted partition model (also called the stochastic block model), n nodes are partitioned into k disjoint clusters of size s = n/k, which are "planted" in a random graph. Two nodes u and v get an edge with probability p if they are in the same cluster and with probability q if they reside in different clusters (with p > q constant). The goal is now to recover the unknown clustering from the random graph generated according to the model, i.e., to determine exactly the vertices in each cluster a.s.
As in the planted clique case, a relatively simple algorithm can recover the clustering when the clique sizes are Ω( √ n log n)-in this case pairs of vertices with the most common neighbors can be placed in the same cluster [9]. However, when the cluster sizes are only required to be Ω( √ n), the problem, as in the planted clique case, becomes more difficult because a simple application of the Azuma-Hoeffding inequality no longer suffices. Our main result is that this can, in fact, be done when the clusters are size s = Ω( √ n): There exists a deterministic, polytime algorithm which, for sufficiently large n, with probability 1−o(1) correctly recovers planted partitions in which all clusters are size s ≥ c √ n, where Note that in this paper we consider only the setting in which p and q are constant and all clusters are the same size s = n/k. We discuss more general settings in [11].
Our algorithm is, to our knowledge, the first purely spectral algorithm which runs in polynomial time and recovers the planted partition a.s. even when all clusters are size Θ( √ n), though there have been several non-spectral algorithms which work in this setting [4,8,31]. In particular, the well-known spectral algorithms [29,34] require that k = o( √ n) and hence do not work when all clusters are size Θ( √ n) (though they work in considerably more general settings). On the other hand, Giesen and Mitsche's algorithm [20] works when all clusters are size Θ( √ n) but has running time exponential in k. See Appendix A for comparison with previous work Efficient algorithms for planted clustering typically rely on either convex optimization [4,8,31] or spectral techniques [20,29,34]. The latter, including ours, often involve looking at the projection operator onto the vector space spanned by the k eigenvectors corresponding to the k largest eigenvalues of the adjacency matrixÂ of the randomly generated graphĜ and showing that it is "not too far" from the projection operator of the expectation matrix E[Â] onto its own k largest eigenvalues. Our algorithm is among the simplest of these spectral algorithms: we don't randomly partition the vertices beforehand, and hence there is no messy "cleanup" step at the end.
A natural approach for identifying all the clusters would be to identify a single cluster, remove it, and recurse on the remaining vertices. This is hard to make work because the randomness of the instanceĜ is "used up" in the first iteration, and then subsequent iterations cannot be handled independently of the first. Existing spectral approaches bypass these difficulties by randomly splitting the input graph into parts, thus forcing independence in the randomness on the parts [20,29,34]. This partitioning trick works at the cost of complicating the algorithm. We, however, are able to make the natural recursive approach work by "preprocessing the randomness": we show that certain (exponentially many) events all occur simultaneously with high probability, and as long as they all occur our algorithm definitely works. Ω( √ n) cluster size is generally accepted to be the barrier for efficient algorithms for "planted" problems. Evidence for the difficulty of beating the √ n barrier dates back to Jerrum [24], who showed a specific Markov chain approach will fail to find smaller cliques. Feige and Krauthgamer [15] showed that Lovász-Schrijver SDP relaxations run into the same barrier, while Feldman et al. [17] show that all "statistical algorithms" also provably fail to efficiently find smaller cliques in a distributional version of the planted clique problem. Recently, Ailon et al. [1] were able to recover planted clusterings in which some of the cluster sizes are o( √ n), but their algorithm's success depends on the simultaneous presence of clusters of size Ω( √ n log 2 n).

Outline
In Section 2 we formally define the planted partition model. In Section 3 we present our algorithm for identifying the clusters, and we briefly discuss its running time in Section 3.1. We prove its correctness in Section 7. Sections 4-6 are dedicated to developing the linear algebra tools necessary for the proof: in Section 4 we introduce tools from random matrix theory which we use in Section 5 to characterize the eigenvalues of the (unknown) expectation matrix A and the randomly generated adjacency matrixÂ. This, in turn, allows us to bound the difference of their projections in Section 6. Showing that the projection operators of A andÂ are "close" is the key ingredient in our proof.

The planted partition problem
We now formally define the planted partition problem.
Definition 2 (Planted partition model). Let C = {C 1 , . . . , C k } be a partition of the set [n] := {1, . . . , n} into k sets of size s = n/k, called clusters (assume s|n). For constants 0 ≤ q < p ≤ 1, we define the planted partition model G(n, C, p, q) to be the probability space of graphs with vertex set [n], with edges ij (for i = j) included independently with probability p if i and j are in the same cluster in C and probability q otherwise.
See Figure 1. Note that the case k = 1 gives the standard Erdős-Rényi model G(n, p) [14], and the case k = n gives G(n, q).
We will denote as follows the main quantities to consider in this paper.
• A = (a ij ) n i,j=1 := E[Â] + pI n -the expectation of the adjacency matrixĜ with ps added to the diagonal (to make it a rank k matrix and simplify the proofs).
In this paper we give an algorithm to recover the clusters which is based on the k largest eigenvalues ofÂ and the corresponding eigenspaces.

Graph and matrix notation
We will use the following notation throughout this paper: • N G (v) -neighborhood of vertex v in a graph G. We will omit the subscript G when the meaning is clear.
• A[S] -the principal submatrix of A with row and column indices restricted to S.
• λ i (A) -the ith largest eigenvalue of a symmetric matrix A (recall that symmetric matrices have real eigenvalues).
• λ i (G) -the ith largest eigenvalue of G's adjacency matrix.
• P k (A) -orthogonal projection operator onto the subspace of R n spanned by eigenvectors corresponding to the largest k eigenvalues of an n × n symmetric matrix A, represented in the standard basis for R n .
• I n -the n × n identity matrix.
• J n -the n × n 1s matrix.
• E[X] -the expectation of a random variable X. If X is matrix or vector valued, then the expectation is taken entrywise.

The cluster identification algorithm
The main result of this paper is that Algorithm 1 below recovers clusters of size c √ n:

2.
Let P k (Â) =: (p ij ) i,j∈V be the orthogonal projection operator onto the subspace of R n spanned by eigenvectors corresponding to the largest k eigenvalues ofÂ.
3. For each column j of P k (Â), letp i 1 j ≥ . . . ≥p i n−1 j be the entries other thanp jj in nonincreasing order. Let W j := {j, i 1 , . . . , i s−1 }, i.e., the indices of the s − 1 greatest entries of column j of P k (Â), along with j itself.
It will be shown that W j * has large intersection with a single cluster C i ∈ C a.s.
5. Let C be the set of s vertices inĜ with the most neighbors in W j * . It will be shown that C = C i a.s. The overview of Algorithm 1 is as follows. The algorithm gets a random graphĜ generated according to G(n, C, p, q). We first construct the projection operator which projects onto the subspace of R n spanned by the eigenvectors corresponding to the largest k eigenvalues ofĜ's adjacency matrix. This, we will argue, gives a fairly good approximation of at least one of the clusters, which we can then find and "fix up." Then we remove the cluster and repeat the algorithm.
Note that we ensure that Algorithm 1 works in every iteration a.s. by "preprocessing the randomness"; more precisely, we will show that a.s. certain events occur simultaneously on all (exponentially many) subgraphs ofĜ induced on a subset of the clusters, and that as long as they all hold Algorithm 1 will definitely succeed. See Section 7.

Running time
Let us analyze the running time of one iteration of Algorithm 1. Steps 2 and 4 are the most costly.
• In step 2, computing P k (Â) can be done via classical subspace iteration methods in time O(n 2 k) [21,22]. Alternatively, one may utilize one of several recent randomized algorithms [22,23,26,30] which allow this to be done faster, e.g. in time O(n 2 log k) [23].
• Step 4 can be done naïvely in O(n 3 ) time. However, this can be improved to O(n 2 k) by instead multiplying P k (Â)Ĥ and taking the norm of each column, whereĤ is defined as in Section 7.1. From step 2 we get an orthonormal decomposition of P k (Â), i.e. an n × k orthogonal matrix U such that U U ⊤ = P k (Â). Thus, we can compute P k (Â)Ĥ = U U ⊤Ĥ in O(n 2 k) time by first multiplying a k × n matrix and an n × n matrix, then an n × k matrix and a k × n matrix.
In theory, this step can be sped up further using a fast matrix multiplication algorithm [12,28], but such algorithms are rarely used in practice due to numerical instability and large constants hidden in their asymptotic running times.
Thus, each iteration of Algorithm 1 can be done in O(n 2 k) time. Since there are k iterations, the overall running time is O(n 2 k 2 ). In particular, as k ≤ √ n, this is O(n 3 ).

Eigenvalues of random symmetric matrices
In Section 5 we will show that the eigenvalues of the random matrixÂ are close to those of its expectation matrix A. To do so, we will need the following well-known result of Füredi and Komlós about the concentration of eigenvalues of random symmetric matrices [19, Theorem 2]: Theorem 5. Let X = [x ij ] ∈ R n×n be a random symmetric matrix where x ij are independent random variables for 1 ≤ i ≤ j ≤ n. Assume that there exists K, σ > 0 so that the following conditions hold independent of n: with probability ≥ 1 − n −10 for n ≥ n 0 .
Note that the original paper by Füredi and Komlós assumes that E[x 2 ij ] = σ 2 for all i, j, which in turn makes the bound (4.1) tight. However, if all we need is the upper bound in (4.1), as is the case in this paper, then the proof in [19] goes through with E[x 2 ij ] ≤ σ 2 . (Actually, it was pointed out by Vu [33] that the proof in [19] contains a minor mistake, so we follow the corrected proof in [33].) Unfortunately, the n −10 failure probability isn't small enough to guarantee our algorithm's success in every iteration, as we will need to apply Theorem 5 simultaneously to 2 O( √ n) submatrices ofÂ (see Section 7.3); however, we may combine it with the following concentration result to get exponentially small failure probability [3, Theorem 1]: Then for every 1 ≤ j ≤ n the probability that λ j (X) deviates from its median by more than t is at most 4e −t 2 /32j 2 .
Combining Theorems 5 and 6, we get the following: Proof. By Theorem 5, For n ≥ n 0 , we have Let λ be the median of the random variable λ 1 (X). We claim that Indeed, . Now consider the random matrix −X. It satisfies the assumptions of Theorem 5. Therefore we have As λ n (−X) = −λ 1 (X), this is the same as Pr[λ 1 (X) ≤ −2(σ + 0.1K) √ n]. Hence (4.3) follows by definition of median. We now ready to apply Theorem 6. Let Y = 1 K X. So now each entry of Y is in [−1, 1]. Clearly the median of the random variable λ 1 (Y ) is λ/K. By (4.3) and Theorem 6 Similarly, we may apply the entire argument above to −X to get Noting that max i |λ i (X)| is either λ 1 (X) or −λ n (X) = λ 1 (−X), we get as claimed.
Note that there have been some recent results which give tight bounds on the spectra of more general random matrices [6], but the above are sufficient for our purposes.
Finally, we will need the following fact from linear algebra: . Let X, Y ∈ R n×n be symmetric matrices. Then See, e.g., [18,Theorem 4.4.6] for proof.

Eigenvalues of A andÂ
The goal of this section is to prove a separation of the first k eigenvalues of both A andÂ from the remaining n − k. We begin by examining the eigenvalues of A.
The following lemma is easily verified: So we see that the smallest positive eigenvalue is proportional to the size of the clusters. We continue by bounding the spectral norm ofÂ − A (recall that the spectral norm of a symmetric matrix X ∈ R n×n is ||X|| 2 = max n i=1 |λ i (X)|; see [18,Corollary 4.11.13]). Lemma 10. For sufficiently large n, . Let σ ij be the standard deviation of x ij and let σ ≥ σ ij for i, j ∈ [n]. Hence, X satisfies the conditions of Theorem 7, with Thus, with probability > 1 − e −n for n > n 0 .
We can now use the lemma above to characterize the eigenvalues ofÂ (and A) as follows: The lemma thus follows by definition of c ′ .
Note that the upper bound of n follows from the fact that for any X = (x ij ) ∈ R n×n we have λ 1 (X) ≤ max i j |x ij |.
Lemma 11 shows that a.s. we have a separation between the largest k eigenvalues and the remaining eigenvalues of both A andÂ, provided that c ′ > 8, or equivalently We will assume this is the case from now on.
6 Deviations between the projectors P k (Â) and P k (A) In this section, we will prove bounds on P k (Â) − P k (A) 2 and P k (Â) − P k (A) F , where · 2 and · F are the spectral and the Frobenius matrix norms, respectively. The following lemma characterizes P k (A): where H ∈ {0, 1} n is the "true" cluster matrix whose (i, j)th entry is 1 if and only if i and j are in the same cluster.
Proof. Let u i := 1 √ s 1 C i ∈ R n for i = 1, . . . , k, and let U be the subspace of R n spanned by eigenvectors corresponding to λ 1 (A), . . . , λ k (A). It is easily verified that u 1 , . . . , u k are an orthonormal basis for U. Thus, letting P U denote the orthogonal projection operator onto U, we get If we assume C 1 = {1, . . . , s}, C 2 = {s + 1, . . . , 2s}, . . . , C k = {n − s + 1, . . . , n} as in Section 5, then P k (A) looks like: when represented in the standard basis for R n . So we see that the columns of P k (A) are essentially the indicator vectors of the unknown clusters C 1 , . . . , C k . The central idea behind Algorithm 1 is that if ||P k (A) − P k (Â)|| F is sufficiently small, then some column of P k (Â) is a good approximation to the corresponding column of P k (A) and can thus be used to recover the corresponding cluster.

The Cauchy integral formula for projections
To prove such bounds on ||P k (A) − P k (Â)|| 2 and ||P k (A) − P k (Â)|| F we will employ the Cauchy integral formula. Similar applications of the Cauchy integral formula are studied in matrix perturbation theory [21,32] and could be adapted to obtain our bound on ||P k (A) − P k (Â)|| 2 , but we include the full proof for the sake of exposition.
Recall that an analytic function f : C → C can be extended to a function of matrices via its Taylor series [18]: In particular, if Z is diagonalizable as Z = P DP −1 (as is any symmetric matrix), then f (Z) = P f (D)P −1 , where f (D) is evaluated by simply applying f to each diagonal entry. Accordingly, we also get an extension of the Cauchy integral formula to matrices [18, Theorem 3.4.2]: Theorem 13. Let Ω be an open set in C. Assume that Γ is a finite set of disjoint simple, closed curves such that Γ is the boundary of an open set D, and Γ ∪ D ⊂ Ω. Assume that Z ∈ C n×n and λ i (Z) ∈ D for i = 1, . . . , n. Then for any φ : C → C analytic on Ω We get the following as a corollary [

A bound on
As P k (Â) and P k (A) are projection operators, we have In fact, we can make this difference arbitrarily small by increasing the cluster size appropriately, as shown in the following lemma.
Lemma 15. AssumeÂ satisfies (5.1) and s ≥ c √ n. Then if c is sufficiently large.
Proof. Define γ to be a square in the complex plane with the length 2M ≫ m. Its sides are parallel to the x-and y-axes. The center of of the square is on the x-axis. The left and right sides of the square are on the lines x = x 0 := (c ′ +8) √ n 2 and x = x 0 + 2M , respectively, where c ′ is defined in (5.3). The upper and lower sides of the square are on the lines y = ±M . Note that by Lemma 11 the interior of γ contains the k largest eigenvalues of A andÂ and the exterior of γ contains the other n − k eigenvalues of A andÂ (see Figure 3). To get our estimate (6.2) we will let M → ∞.
Applying Theorem 14, and so we get Observe that for each z ∈ C the matrices zI n −Â, zI n − A are normal. Hence Let us first estimate the contribution to the integral (6.3) on the left side of γ. Let z = x 0 + yi, y ∈ R. That is, z lies on the line x = x 0 . Therefore, by Lemma 11 Also recall from (5.1) that Â − A 2 ≤ 8 √ n. Hence for z = x 0 + yi one has the estimate: .
Next we estimate the contribution of the integral (6.3) on the other three sides. Consider first the side on the line y = M . Since the eigenvalues ofÂ and A are real it follows that Hence the contribution of (6.3) on the upper side of the square is bounded above by 8M √ n 2πM 2 = 4 √ n πM . The same upper estimate holds for the lower side of the square on the line y = −M . We now estimate from above the contribution of (6.3) on the right side of the square on x = x 0 + 2M . Since the eigenvalues ofÂ and A are real and at most n it follows that Hence the contribution of (6.3) on the righthand side of the square is bounded above by 4M √ n π(2M −n) 2 . Therefore completing the proof.

A bound on
Now we estimate the Frobenius norm of P k (Â)−P k (A). Recall that for any matrix B = (b ij ) ∈ R m×n Moreover, if B is a symmetric, then Therefore we obtain the following lemma: Proof. Recall that P k (Â) and P k (A) have rank k. Hence P k (Â) − P k (A) has rank at most 2k. So P k (Â) − P k (A) has at most 2k nonzero eigenvalues. The lemma thus follows from (6.5).

Proof of algorithm's corectness
The proof of Algorithm 1's correctness goes roughly as follows. We will prove using the spectral analysis in Sections 5-6 that a.s. there is a column j for which ||P k (Â)1 W j || 2 is "large" (Lemma 17). Next, we will show that any for any such j, the set W j consists mostly of vertices from a single cluster (Lemma 18). Finally, we show how to recover this cluster exactly by looking at how many neighbors each vertex has in W j (Lemmas 19-20). This argument shows that Algorithm 1 succeeds in iteration 1 a.s. To show that it succeeds in every iteration, we will apply the same argument to all "cluster subgraphs" ofĜ-i.e., those subgraphs induced on a subset of the clusters. We will prove that all 2 k such subgraphs have certain desirable properties a.s., in which case our algorithm deterministically succeeds in identifying a cluster. Therefore, when we remove it we are considering another cluster subgraph, so the algorithm again succeeds, and so on. Thus, we are able to restrict our analysis to these cluster subgraphs, bounding the number of events that need to occur in order to ensure the algorithm's success. This is how we avoid the need to randomly split the graph into parts, as in [20,29,34]. The details of this approach, which we call "preprocessing the randomness," are presented in Section 7.3

Notation
We will use the following notation in our proof: • H = (h ij ) n i,j=1 -the "true cluster matrix" as defined in (6.1), i.e., h ij = 1 if i and j are in the same cluster, 0 else.
•Ĥ = (ĥ ij ) n i,j=1 := (1 W 1 , . . . , 1 Wn ) -the "estimated cluster matrix." The idea is that at least one column ofĤ will be a good approximation of the corresponding column of H, and we will give a way to find such a column. Note that each column ofĤ has exactly s 1s and that H need not be symmetric.

Technical lemmas
The proof of Theorem 4 relies on several additional lemmas. Lemmas 17-20 fit together roughly as follows: • Lemma 17 says that a.s. there is a column j for which ||P k (Â)1 W j || 2 is large.
• Lemma 18 says that for such a column j, W j consists mostly of vertices from a single cluster C i .
• Lemmas 19 and 20 say that a.s. vertices in C i will have many neighbors in W j , while vertices outside C i will have relatively few neighbors in W j ; hence, we can recover C i by taking the s vertices with the most neighbors in W j .
Recall also (6.1) that P k (A) = 1 s H. Therefore The triangle inequality then yields: so by averaging there exists a column j * such that Finally, note that by Lemma 15 we have ||P k (Â) − P k (A)|| 2 ≤ ǫ, so the triangle inequality yields the desired result: Lemma 18. AssumeÂ satisfies (5.1) and j satisfies (7.1).
where the minimum is taken over all W ⊆ [n] such that |W | = s and We will argue that which proves the lemma. The first inequality is easy: for W = W j , the triangle inequality yields • W satisfies (7.3).
We will argue that W must have a special structure: namely, it is split between two clusters. By relabeling the clusters, we may assume without loss of generality that We claim that It is easy show that the maximum occurs when x 1 = x 2 = s 2 and x 3 = . . . = x k = 0. Hence k i=1 t i (W ) 2 ≤ s 2 2 . By (7.3) we have Choosing ǫ sufficiently small (ǫ ≤ .1 works) we get a contradiction.
In particular, C 3 ∩ W and C 2 \ W are both nonempty. Now constructW from W by replacing a vertex from C 3 ∩ W with one from C 2 \ W . Clearly |W | = s, and t(W ) = τ since only t 2 increases and t 2 (W ) < t 1 (W ) = τ . But contradicting the maximality of ||P k (A)1 W || 2 .
Thus, W is split between two clusters C 1 and C 2 ; i.e., W = U ∪ V , where U := W ∩ C 1 and V := W ∩ C 2 . So by (7.3) we have Solving the inequality for |U | yields provided ǫ is small enough (again ǫ ≤ .1 is sufficient). This completes the proof.
Lemma 19. Consider cluster C i and vertex j ∈ [n]. If j ∈ C i , then with probability ≥ 1 − e −ǫ 2 s , and if if j / ∈ C i , then with probability ≥ 1 − e −ǫ 2 s . Figure 4: If W has large overlap with C i , then a.s. vertices in C i will have many neighbors in W , while vertices not in C i will have relatively few neighbors in W .

Part b) follows by a similar argument.
This lemma gives us a way to differentiate between vertices j ∈ C i and vertices j / ∈ C i as shown in Figure 4, provided p − 4ǫ ≥ q + 4ǫ. (7.6)

Main proof
To prove Theorem 4, we will define certain (exponentially many) events on the probability space G(n, C, p, q) and show that 1. As long as they all occur, Algorithm 1 definitely succeeds.
2. They all occur simultaneously a.s. Therefore, Algorithm 1 succeeds a.s. Before we define the events let us introduce some notation: • For J ⊆ [k], defineĜ J to be the subgraph ofĜ induced by clusters C i , i ∈ J, i.e.Ĝ J := G i∈J C i . Then for any fixed J we havê G J ∼ G(|J|s, {C i : i ∈ J}, p, q). (7.7) • For an n × n matrix B define B J to be principal submatrix of B with row and column indices in the clusters C i , i ∈ J, i.e. B J := B i∈J C i .
We will refer to these subgraphs and submatrices as cluster subgraphs and cluster submatrices. Now we define two types of events in G(n, C, p, q): • Spectral events: for J ⊆ [k], let E J be the event that ||Â J − A J || 2 ≤ 8 |J|s.
• Degree events: for 1 ≤ i ≤ k, 1 ≤ j ≤ n, let D i,j be the event that |NĜ(j) ∩ C i | ≥ (p − ǫ)s if j ∈ C i , or the event that |NĜ(j) ∩ C i | ≤ (q + ǫ)s if j / ∈ C i . Thus, we have defined a total of 2 k + nk events. Essentially, these are the events that everyĜ J satisfies (5.1) and that (7.4) and (7.5) are satisfied for all i ∈ [k], j ∈ [n]. Note that the events are well-defined, as their definitions depend only on the underlying probability space G(n, C, p, q) and not on the random graphĜ sampled from the space. Now we are finally ready to prove the theorem: Proof of Theorem 4. Assume E J and D i,j hold for all J ⊆ [k], i ∈ [k], j ∈ [n]. We will prove by induction that Algorithm 1 succeeds in every iteration. For the base case, take the original graphĜ =Ĝ [k] considered in the fist iteration. Since E [k] is assumed to hold, (5.1) is satisfied. Thus, by Lemma 17, the column j = j * identified in step 4 satisfies (7.1). Then by Lemma 18 we have |W j * ∩ C i | ≥ (1 − 3ǫ)s for some i ∈ [k]. Finally, since D i,j is assumed to hold for all j ∈ [n], step 5 correctly identifies C = C i by Lemma 20. Now assume Algorithm 1 succeeds in the first t iterations, i.e., it correctly identifies a cluster and removes it in each of these iterations. Then the graph considered in the (t + 1)st iteration is a cluster subgraphĜ J for some J ⊆ [k], |J| = k − t. Note thatĜ J has |J|s = (k − t)s vertices. Now we apply Lemmas 11-18 withÂ J instead ofÂ, A J instead of A, k − t instead of k, and (k − t)s instead of n.
Since E J is assumed to hold, by Lemma 17 the column j = j * identified in step 4 of Algorithm 1 satisfies ||P k−t (Â J )1 W j || 2 ≥ (1 − 8ǫ 2 − ǫ) √ s. Note thatĤ and W j (Sections 7.1-7.2) are constructed fromĜ J , not the original graphĜ. Now by Lemma 18 we have |W j * ∩ C i | ≥ (1 − 3ǫ)s for some i ∈ J. Finally, since D i,j is assumed to hold for all j ∈ [n], step 5 once again correctly identifies C = C i by Lemma 20.
We have thus proved that Algorithm 1 succeeds as long as E J and D i,j hold for all J ⊆ [k], i ∈ [k], j ∈ [n]. Now, for any fixed nonempty J ⊆ [k] we haveĜ J ∼ G(|J|s, {C i : i ∈ J}, p, q), so by Lemma 10 Pr[E J ] ≥ 1 − e −|J|s ≥ 1 − e −s .
Taking a union bound over all J, i, j, the probability that all E J and D i,j hold is ≥ 1 − 2 k e −s − nke −ǫ 2 s . Therefore, as ǫ is constant and k ≤ √ n ≤ s, Algorithm 1 succeeds with probability ≥ 1 − 2 e − √ n − n 3/2 e − √ n .
Note that we require (7.6) in order for step 5 of Algorithm 1 to correctly recover a cluster according to Lemma 20. In addition, the proof of Lemma 18 requires ǫ ≤ .1. By (6.4), we can satisfy both of these conditions by setting c := max 88 p−q , 72 (p−q) 2 .