Statistical Convergence of the EM Algorithm on Gaussian Mixture Models

We study the convergence behavior of the Expectation Maximization (EM) algorithm on Gaussian mixture models with an arbitrary number of mixture components and mixing weights. We show that as long as the means of the components are separated by at least $\Omega(\sqrt{\min\{M,d\}})$, where $M$ is the number of components and $d$ is the dimension, the EM algorithm converges locally to the global optimum of the log-likelihood. Further, we show that the convergence rate is linear and characterize the size of the basin of attraction to the global optimum.


Introduction
The EM algorithm [12] is an instrumental tool for evaluating the maximum likelihood estimator of latent variable models. It is a Majorization-Minimization (MM) algorithm that minimizes a surrogate objective function to avoid evaluating the intractable marginal (negative) log-likelihood of the latent variable model. However, as is shown by Wu [29] and Tseng [25], the EM algorithm may not converge to a global minimizer of the log-likelihood. Instead, it may converge to a local minimizer or a stationary point. For a method that aims to evaluate the MLE, this is somewhat disappointing.
There is an recent line of research that shows the EM algorithm initialized in a neighborhood of the data generating parameters converges to the global minimizer. Unfortunately, this line of work does not encompass the EM algorithm for fitting mixtures of more than two Gaussians. This paper fills this gap in the literature by providing conditions under which the EM algorithm for fitting Gaussian mixture models with an arbitrary number of well-separated components and arbitrary mixing weights converges to the global minimizer. We show that the EM algorithm converges linearly as long as it is initialized in a neighborhood of the true centers. Our results are of the same flavor as those by Yan, Yin and Sarkar [31] and Balakrishnan, Wainwright and Yu [3], and our proofs follow the same general route. This paper is organized as follows. The rest of this section briefly reviews related work on the EM algorithm. Section 2 describes the EM algorithm for fitting mixtures of Gaussians and introduces a population version of the algorithm that appears in our study. Section 3 states our main results on the convergence of EM. In section 4, we present simulation results that validate some of our theoretical results. In section 5, we prove the main results. Finally, in section 6, we discuss our results and compare them to similar results in the literature.

Related work
Most closely related to our work is the line of work on the convergence of the EM algorithm for fitting Gaussian mixture models (GMM). On a mixture of two equally-weighted Gaussians, Balakrishnan, Wainwright and Yu [3] first derived statistical convergence results by specializing the general framework they proposed to this model. Their framework also applies to a variant of the EM algorithm known as gradient EM, and they used their framework to obtain similar results for gradient EM. Klusowski and Brinda [20] later obtained results of a similar flavor, but showed that there is a larger neighborhood of the true centers within which the EM algorithm converges. Finally, Xu, Hsu and Maleki [30] and Daskalakis, Tzamos and Zampetakis [11] completely characterized the global convergence behavior of the EM algoritm for fitting two equally weighted Gaussians. When there are more than two components, a result by Jin et al. [17] showed that bad local minima exists even in the idealized case of equally weighted mixtures of well-separated spherical Gaussians. Yan, Yin and Sarkar [31] proved local convergence results for the gradient EM algorithm for fitting mixtures of an arbitrary number of well-separated spherical Gaussians. Despite the recent progress, we are not aware of any results that characterize the local convergence behavior of the EM algorithm on mixtures of two or more Gaussians. For variants of the EM algorithm for fitting high-dimensional mixture models, we refer readers to Dasgupta and Schulman [10], Cai, Ma and Zhang [6], Wang et al. [28], Yi and Caramanis [32].
There is also a large body of work on other methods for learning mixtures of Gaussians and, more generally, finite mixtures. One major line of work [9,26,1,2,19,5] is based on dimension reduction techniques (such as spectral embeddings). Like the EM algorithm, these methods require the centers of the mixture components to be well-separated. Another more recent line of work [4,18,23,16,14] employs the method-of-moments, and allows the centers of the mixture components to be arbitrarily close (as long as the sample size is large enough). Some other important algorithms and theoretical work are Brubaker and Vempala [5], Chaudhuri and Rao [7], Chaudhuri et al. [8], Lu and Zhou [21]. For statistical properties such as the rate of convergence of the MLE or the rate of convergence of the estimated mixing distribution, we refer readers to Ghosal and van der Vaart [13], Nguyen [24], Heinrich and Kahn [15] and the references therein.

The EM algorithm on Gaussian mixture models
We consider Gaussian mixture models with known mixture weights and known common covariance structure. Formally, suppose there are M isotropic Gaussian distributions, N (µ * 1 , I d ), . . . , N (µ * M , I d ), and mixture weights, π 1 , . . . , π M ≥ 0. The Gaussian mixture model we consider is the set of densities where φ(z) = 1 (2π) 1/2 e − 1 2 z 2 2 is the standard Gaussian density in R d . The assumption that the components are isotropic leads to no loss of generality as long as (i) the mixture components share a common covariance structure and (ii) this structure is known. Without loss of generality, we also assume mixture is centered; i.e. it has mean zero. The task of fitting (1) boils down to estimating µ * from observations X 1 , . . . , X n iid ∼ p(·; µ * ).
The EM algorithm for fitting a Gaussian mixture model alternates between evaluating the posterior probabilities (E-step) of the labels and updating the estimates of the parameters (M-step). We combine the two steps to arrive at the update rule where the weights w i (X j ; µ) are defined as We see that w i (X; µ) is the probability that X comes from component i and µ + i is a weighted average of samples, where the weights are the w i (X j ; µ)'s. For this reason, the EM algorithm is known as a soft version of the K-means algorithm. In our analysis, we work with a population version of the EM algorithm. Its update rule is: We emphasize that the expectations in (4) are with respect to the true data generating process. We will first derive convergence results for this population version of the EM algorithm and then extend this result to (the sample-version of) the EM algorithm using concentration results. Notation: We summarize the notations in this paper: d is the dimension of the observations X 1 , . . . , X n and M is the number of mixture components. Define R max , R min as the largest and smallest distances between the centers of any pair of mixture components: Define κ as the smallest mixture weight: κ = min{π 1 , . . . , π M }. Given two positive sequences {a n }, {b n }, a n = O(b n ) means there exists an absolute constant C such that a n ≤ Cb n for all n; a n = Ω(b n ) there exists an absolute constant C such that a n ≥ Cb n for all n. We write a n = Θ(b n ) if a n = O(b n ) and a n = Ω(b n ); we write a n = o(b n ) if an bn → 0 as n → ∞.

Statement of the main results
In this section, we state our main results for the convergence of the EM algorithm on Gaussian mixture models. First, we show that the population version of the EM algorithm converges linearly to µ * as long as (i) certain signal strength conditions are met and (ii) the algorithm is initialized in a neighborhood of µ * . We also characterize the size of this neighborhood in terms of the properties of the data generating process.
Then for any iterate µ satisfying max i∈[M ] µ i − µ * i 2 ≤ a, the next iterate µ + given by (4) satisfies We remark that Theorem 3.1 does not imply cluster-wise convergence; i.e. Theorem 3.1 does not imply This is hardly surprising. If we initialize the EM algorithm at where µ 2 , . . . , µ M are arbitrary points in R d , there is no guarantee that µ + 1 remains at µ * 1 . Recall the mixture components are isotropic, so R min is the signal strength. Theorem 3.1 requires the signal strength to grow as Ω(min{M, d} 1 2 ). Theorem 3.1 also shows that the contraction coefficient ζ is decreasing in R min and κ. It also shows that the contraction radius a approaches 1 2 R min as R min goes to infinity. This contraction radius is essentially optimal because there are examples of the EM algorithm converging to non-global local minima if µ i − µ * i = 1 2 R min . For example, consider the task of fitting a mixture of two Gaussians. If we initialize the EM algorithm at the algorithm will never separate the centers and converges to a stationary point with identical estimates of the two centers. It is not clear whether the min{d, 2M } 1 2 term in (5) is optimal. We suspect not, because in the experiments we run, we have never seen EM fail to converge to the truth when the initializer satisfies max µ i − µ * i 2 < R min /2. The improved dependence of the contraction radius on min{d, 2M } instead of d is intuitive. Informally, if d > 2M , the components of the noise orthogonal to the subspace spanned by the centers and estimated centers cancel out when expectation is taken, which reduces the effective dimension of the problem from d to min{d, 2M }. The details are in the the proof of Theorem 3.1. We remark that this improvement is not present in studies of the convergence of the EM algorithm that does not go through a population level analysis (cf. the result of Dasgupta and Schulman [10]).
In the statement of theorem 3.1, there is no provision that ζ < 1. Rearranging ζ < 1 for conditions on a leads to Theorem 3.2, which is easier to parse but obscures the dependence of the contraction coefficient on the properties of the data generating process.
where C 0 , C 1 > 0 are universal constants, we have Second, we carry out a perturbation analysis to extend Theorem 3.2 to (the sample version of) the EM algorithm. At a high level, our result states that the iterates of the EM algorithm converge linearly to µ * up to the statistical precision of the estimation task. Our proof hinges on recent concentration results by Mei, Bai and Montanari [22]. Compared to the proof of an analogous result for the gradient EM algorithm by Yan, Yin and Sarkar [31], our proof is considerably simpler.
imsart-ejs ver. 2014/10/16 file: main.tex date: October 10, 2018 where C 0 , C 1 > 0 are universal constants. As long as the sample size n is large enough so that where with probability at least 1 − 2M n . We see that the first term on the right side of (10) converges to zero linearly while the second term does not depend on t. Initially, the first term on the right side dominates the second term and the right side decreases linearly. However, after sufficiently many iterations, the second term on the right side dominates the first term, and the right side settles down to a limit that is O(( M d κ 2 n ) 1 2 ). We recognize this limit as the statistical precision (modulo constant and logarithmic factors).
Convergence to the maximum likelihood estimator. We remark that Theorem 3.3 implies EM converges to a stationary point within a ball of radius 2 ) of the true centers. It is known that the maximum likelihood estimator (MLE) falls inside this ball with high probability. As long as there are no other stationary points in this ball, Theorem 3.3 implies EM converges to the MLE. This is a consequence of the gradient stability result by Yan, Yin and Sarkar [31], which implies µ * is the only stationary point of the expected log-likelihood function in ⊗ i (µ * i , a), and concentration results by Mei, Bai and Montanari [22], which imply the log-likelihood function only has one stationary point in

Simulation results
In this section, we present some simulation results to validate our theoretical results. In the first set of experiments, we check the prediction of Theorem 3.1 that the statistical error decreases linearly initially and eventually reaches a plateau. The data generating process is a mixture of M = 5 isotropic Gaussians in R 10 with one mixture component centered at the origin and the remaining four centers at the vertices of R min ∆ 9 , where ∆ 9 is the probability simplex in The components are equally weighted (π i = 1 5 ). The top left panel of Figure 1 shows the decrease of the optimization error max i μ i − µ t i 2 and statistical error max i µ * i − µ t i 2 over 10 runs of the EM algorithm. We set R min = 2 and generate n = 8000 samples from the Gaussian mixture model. We initialize the EM algorithm at where δ i is uniformly distributed on the sphere of radius 0.4 · R min . We see that the statistical error decreases linearly initially and eventually reaches a plateau after four iterations. This agrees with the implications of Theorem 3.3. The top right panel of Figure 1 shows the average log statistical errors over 10 trails as R min varies. We see that larger R min values lead to faster convergence, which agrees with the implications of Theorem 3.1.
The plots in the bottom panels are analogous to the plots above. We keep all simulation parameters, except we change the mixture weights from uniform to . We see that non-uniform mixture weights hurt the performance of the EM algorithm. Comparing the two plots on the left, we see that non-uniform weights causes the algorithm to converge slower and reduces the statistical precision of the output. The slower convergence agrees with the implications of Theorem 3.1, which shows that the contraction coefficient is inversely proportional to the minimum mixture weight κ. The reduced statistical precision is due to the fact that the centers of mixture components with smaller weights are estimated less accurately. This is because the mixture components with smaller weights have smaller effective sample sizes. We also see greater variation across the ten runs of the algorithm.

Proofs of the main results
We prove Theorem 3.1, 3.2, and 3.3 in this section, deferring the proofs of the technical lemmas to the appendices.

Proof of Theorem 3.1
We make a few observations before proceeding to the proof. Without loss of generality, we focus on the update rule for the first center µ 1 : The vector of true centers µ * is a fixed point of (4), which implies In the first step of the proof, we establish an upper bound on the norm of the numerator. In the second step, we establish a lower bound on the denominator in (11). Finally, in the third step, we combine the upper and lower bounds to show the EM update rule is a contraction.
Step 1 (upper bounding the numerator). Define µ t := µ * + t(µ − µ * ) and define g X (t) := w 1 (X; µ t ). We have We thus have where The rest of the first step consists of establishing bounds on the V i 's. We state the result here and defer the details to Appendix A.
then for any µ such that Step 2 (lower bounding the denominator). We state the lower bound and describe the underlying intuition, deferring the proof to the appendix. Let Z be the label of X. We observe that E w 1 (X; µ * ) = E P µ * (Z = 1|X) = π 1 > κ.
then for any µ such that Step 3: We combine the bounds for max i V i and E w 1 (X; µ) to show that the EM update rule is a contraction. Without loss of generality, we focus on the update rule for the first cluster. By (11), (13), and Lemma 5.2, we have Plugging in (17), we have We recognize the factor in front of max i µ i −µ * i 2 is smaller than the contraction factor ζ in Theorem 3.1. We can also check that (5) implies (16) and (18) so the conditions of Lemmas 5.1 and 5.2 are satisfied. Putting the two parts together, we see that Theorem 3.1 is correct.

Proof of Theorem 3.2
To prove Theorem 3.2, we start from (20) and solve for the contraction radius a that implies ζ < 1 2 . It is enough to find α such that imsart-ejs ver. 2014/10/16 file: main.tex date: October 10, 2018 By Lemma 5.1, the above condition is implied by Rearranging, we have Finally, we check that there is a universal constant C 1 such that implies (16), (18), and (22).

Proof of Theorem 3.3
The intuition underlying the proof of Theorem 3.3 is the population and sample update rules ( (2) and (4) respectively) are similar. In the proof, we appeal to the following technical lemmas on the uniform convergence of 1 n n j=1 w i (X j ; µ)(X j − µ * 1 ) and 1 n n j=1 w i (X j ; µ) to their population counterparts. Lemma 5.3. Define the event E 1,i as , C 3 is a universal constant, and We have P(E 1,i ) ≤ 1 n . Lemma 5.4. Define the event , C 2 is a universal constant, and .
for all i ∈ [M ]. By Lemmas 5.3 and 5.4, this event occurs with probability at least 1 − 2M n . The minimum sample size condition (9) implies The second inequality (24) in turn implies . Let µ 0 be the initial iterate. We have where we appealed to where the second step is a consequence of (23). This implies µ 1 is also in the contraction region. By applying (25) iteratively, we have imsart-ejs ver. 2014/10/16 file: main.tex date: October 10, 2018

Summary and discussion
Initialization. We emphasize that our convergence results are local: they assume the EM algorithm is initialized in a neighborhood of the true centers. To obtain a such an initial iterate, we appeal to approaches based on the method of moments, such as the method proposed by Hsu and Kakade [16]. These methods are consistent; i.e. as long as the sample size is large enough, they output a good initial iterate for the EM algorithm with high probability. Formally, as long as the sample size n is greater than a polynomial of d, M, 1/ , log(1/δ), 1/κ, the method proposed by Hsu and Kakade [16] outputs estimates of the centers that satisfy with probability at least 1 − δ, where λ max is the largest eigenvalue of the covariance matrix of the samples. The main benefit of combining a spectral method with the EM algorithm is the EM algorithm usually outputs a more accurate estimator. This is reflected in the smaller sample complexity of the EM algorithm.
Minimum separation between centers. Theorems 3.1 and 3.3 require the minimum separation between centers to grow as Ω(min{M, d} 1 2 ). Compared to other methods for fitting mixtures of isotropic Gaussians, this dependence is suboptimal. For example, Vempala and Wang [26] showed that spectral clustering can accurately recover the labels in a mixture of spherical Gaussians provided that the minimum separation is at least Ω((M log d) 1 4 ). Some approaches based on the method of moments are able to learn mixtures of Gaussians in which the centers are arbitrarily close together (as long as the sample size is large enough). However, the sample complexity of such methods are usually worse than that of the EM algorithm.
If we restrict to studies of the EM algorithm and its variants (including gradient EM and the K-means algorithm), our requirement on the minimum separation between centers is optimal. Yan, Yin and Sarkar [31] imposes the same condition in their study of the convergence of the gradient EM algorithm. Lu and Zhou [21] requires the minimum separation to grow proportionally to M in their study of the convergence of the K-means algorithm.
Contraction radius and convergence rate. This contraction radius in (5) is optimal in the sense that it is approximately 1 2 R min when R min is large and we can find examples of the EM algorithm converging to non-global local minima if µ i − µ * i 2 = 1 2 R min (see the remarks after Theorem 3.1). By comparison, the contraction radius for the gradient EM algorithm is very similar to (5) [31], and the contraction radius for K-means is roughly 1 2 R min − CM 3 4 [21, Theorem 6.2]. We show that the EM algorithm converges linearly up to statistical precision. This agrees with our simulation results and previous studies on the convergence of EM [3,25]. We see that the convergence rate decreases as κ increases, which agrees with the folklore that the EM algorithm converges slowly on imbalanced mixture models.
Minimum sample size. In terms of minimum sample size, our result is valid as long as n log n ≥ C and Lu and Zhou [21] established linear convergence of the K-mean algorithm as long as n log n ≥ C M κ 2 . The variations in the minimum sample size are due to differences in the concentration results that appear in the proofs. We believe it is possible to avoid the log n factor and improve the dependence on Rmax Rmin in the minimum sample size by refining the concentration results in our proofs.

Appendix A
In appendix A, we prove lemma 5.1 and 5.2.

A.1. Preliminaries
Before jumping into the proof, we need the following preliminary result, which is lemma 12 and 13 in Yan, Yin and Sarkar [31].
Lemma A.1. Suppose the minimum separation R min and radius a satisfy then for any µ such that µ i ∈ B(µ * i , a), ∀i ∈ [M ], we have the following inequalities for any p = 0, 1, 2 and any i = j ∈ [M ] A.2. Proof for lemma 5.1 We start from bounding V 1 : where a is the radius of the contraction region ⊗ i B(µ * i , a , it is independent ofX M . Also note that w 1 (X; µ t ) only depends onX 2M (the part involving X d−2M cancels out), we have Applying lemma A.1 with dimension min{2M, d} and p = 2, we have (29) Applying lemma A.1 with dimension min{2M, d} and p = 0, we have Combining (29), (30) with (28), we see Next we move to V i , i = 1. Using the same decomposition Apply the same rotation trick, we are able to show The proof is now complete. This implies w 1 (X; µ) = w 1 (X M ;μ) and Ew 1 (X; µ) = Ew 1 (X M ;μ) wherẽ . We thus have successfully reduced the effective dimension to min{M, d}.
The rotation step is optional and can only reduce dimension when d > M . For ease of notation, let us assume M ≥ d and we opt not to do it. The next step in bounding Ew 1 (X; µ) is to restrict ourselves to the event where a) X is generated by the first cluster, and b) X lies in a ball B(µ * 1 , r) for some radius to be selected later. Specifically, we have Notice on B(µ * i , r), by triangular inequality we have Also, since w 1 (x; µ) is decreasing in x − µ 1 2 and increasing in x − µ i 2 , we have imsart-ejs ver. 2014/10/16 file: main.tex date: October 10, 2018 where in the last step, we used numerical inequality a a+b ≥ 1 − b a . It thus follows where ε ∼ N (0, I d ). Moving forward. we naturally want to lower bound P( ε 2 ≤ r), and the following lemma (lemma 8 in [31]) allows us to do so.
Let us pretend for now that , then by chaining (34), (35) and applying lemma A.2, we have Therefore to let Ew 1 (X; µ) ≥ 3κ 4 , it suffices to let Now we collect all the conditions we need for all the inequalities to go through; they are , we can check . If we replace all d by min{M, d}, the proof goes through with only notational changes. Now, we can readily check the conditions on R min and a in lemma 5.2 imply the three conditions above. The proof is complete.

Appendix B
In appendix B, we prove lemma 5.4 and lemma 5.3 which facilitate the proof of Theorem 3.3. We first introduce some preliminary results on sub-gaussian random variables and then prove lemma 5.4 and lemma 5.3.

B.1. Preliminaries
Lemma B.1. Let X be a random variable.
1. (Sub-gaussian random variable). X is called sub-gaussian if there exists a finite t such that Eexp X 2 /t 2 ≤ 2. For a sub-gaussian X, its sub-gaussian norm X ψ2 is defined as 2. (Hoeffding's inequality). Let X 1 , . . . , X N be independent, mean zero, sub-gaussian random variables. Then, for every t ≥ 0, we have where c is an absolute constant. 3. (Centering). If X is a sub-gaussian random variable, then X − EX is sub-gaussian too, and where C is an absolute constant. 4. (Bounded random variable is sub-gaussian). Any bounded random variable X is sub-gaussian, with , i.e. P (Y = θ i ) = π i . It follows that Y is a bounded random variable and Y ψ2 ≤ C 1 R. Let ε ∼ N (0, 1). From standard results we know ε ψ2 ≤ C 2 . Note that X has the same distribution as Y + ε, we have 1).

B.2. Proof of lemma 5.3
Define L := 1.5R max , then for a ≤ 1 2 R min , we have ⊗ i B(µ * i , a) ∈ ⊗ i B(0, L). This is a natural consequence of µ * i 2 ≤ R max for all i. To see why µ * i 2 ≤ R max , suppose the opposite and, without loss of generality, let µ * 1 2 > R max , then all µ * i ∈ B(µ * 1 , R max ). Since EX = π i µ * i = 0 but B(µ * 1 , R max ) does not contain the origin, we get a contradiction. Also note that since theorem 3.1 requires R min ≥ C 0 min{d, 2M } 1 2 for a large C 0 , we can work under the premise that R max ≥ 1, because otherwise, even the population level convergence result does not apply.
Denote U = ⊗ i B(0, L), and we establish all uniform convergence results on U. Let n ε be the ε-covering number of B(0, L), then standard results [27] have it log(n ε ) ≤ d log(3L/ε). By doing cartesian product on such covers, we can get a cover on U. We denote this cover by Start by noting Then we have imsart-ejs ver. 2014/10/16 file: main.tex date: October 10, 2018 where the events A t , B t , C t are defined as For some δ > 0, we next derive conditions on t that suffice to let Then replacing δ with 1 n completes the proof. Upper bounding P(B t ): Let V 1/2 be a (1/2)-cover of B d (0, 1) with log |V 1/2 | ≤ d log 6. Then we know from standard result that Taking union bound, we have By part 5 of lemma B.1, Note that X − µ * 1 , v follows a one dimensional gaussian mixture model with Consequently, by Hoeffding's inequality, we have To ensure P(B t ) ≤ δ 2 , it suffices to let Upper bound P(C t ): Using the same integration expression (13) as in section 4.1, we have Since C t is deterministic, so as long as we have ε M i=1 U i < t 3 , C t will never happen. Upper bound P(A t ): Using markov inequality, we have To ensure P (A t ) ≤ δ 2 , it suffices to let t ≥ 6ε( U i )/δ. Note that whenever this holds, the condition that ensures P(C t ) = 0 is implied.
Knowing that all w ∈ [0, 1], we see  N (0, I d ). With a bit more calculation, we see U 1 ≤ C (L 2 + d), and the same bound also hold for other U i . It thus follows M i=1 U i ≤ C M (L 2 + d).

Conclusion:
We set ε = δL 6C nM (L 2 +d) , δ = 1 n , then any t satisfy the following ensures bad events happen with probability less than δ t ≥ max The second argument in maximum apparently dominates the first argument.
After meticulously checking, we conclude there exists a universal constant C 3 , such that P I n,1 (µ) ≥ R max C 3 M d log n n ≤ 1 n , whereC 3 = C 3 log(M (3R 2 max + d)).
For some δ > 0, we next derive conditions on t that suffice to let , P(C t ) = 0.
Then replacing δ with 1 n completes the proof. Upper bounding P(B t ): Since w 1 (X, µ) is bounded between [0, 1], it is sub-gaussian with w 1 (X, µ) ψ2 ≤ C for some absolute constant C. We can thus directly apply union bound and Hoeffding's inequality: To ensure P(B t ) ≤ δ 2 , it suffices to let t ≥ C M d log( 3L ε ) + log( 4 δ ) n . Upper bound P(C t ): Using the same integration expression (13) as in section 4.1, we have W i = E sup µ∈U w 1 (X; µ)w i (X; µ)(X − µ i ) 2 for i = 1.
Since C t is deterministic, so as long as we have ε M i=1 U i < t

Upper bound P(A t ):
Using markov inequality, we have W i . (due to mean value theorem) To ensure P (A t ) ≤ δ 2 , it suffices to let t ≥ 6ε( U i )/δ. Note that whenever this holds, the condition that ensures P(C t ) = 0 is implied.
Knowing that all w ∈ [0, 1], we see where Y is a random draw from centers {µ * i } i∈[M ] according to probabilities {π i } i∈[M ] and ε ∼ N (0, I d ). With a bit more calculation, we see U 1 ≤ C (L+ √ d), and the same bound also hold for other U i . It thus follows .

Conclusion:
We set ε = The second argument in maximum apparently dominates the first argument. After meticulously checking, we conclude there exists a universal constant C 2 , such that P sup µ∈U n i=1 w 1 (X i , µ) − Ew 1 (X; µ) ≥ C 2 M d log n n ≤ 1 n ,