Finite mixture regression: A sparse variable selection by model selection for clustering

We consider a finite mixture of Gaussian regression model for high- dimensional data, where the number of covariates may be much larger than the sample size. We propose to estimate the unknown conditional mixture density by a maximum likelihood estimator, restricted on relevant variables selected by an 1-penalized maximum likelihood estimator. We get an oracle inequality satisfied by this estimator with a Jensen-Kullback-Leibler type loss. Our oracle inequality is deduced from a general model selection theorem for maximum likelihood estimators with a random model collection. We can derive the penalty shape of the criterion, which depends on the complexity of the random model collection.


Introduction
With the increasing of high-dimensional data, even if the number of observations is not large, new methods in statistics have been needed to deal with the identifiability underlying problem. A classical assumption is the sparsity: if the number of parameters to estimate is larger than the sample size, we will assume that a few of parameters are nonzero. The Lasso estimator, introduced by Tibshirani in [20], is a classical tool in this context. Working well in practice, many efforts have been made recently on this estimator to have some theoretical results. Define the model and the estimator before enunce some theoretical results aready get. We consider a linear model, Y = Xβ + ǫ, with random variables (X, Y ) ∈ R p × R q , a regression matrix β unknown to estimate, and a white noise ǫ ∼ N (0, Σ). The dimensions p and q could be large. We observe the sample ((X i , Y i )) i∈{1,...,n} . The Lasso estimator is defined by with λ > 0 to specify. Under a variety of different assumptions on the design matrix, we could have oracle inequalities for the Lasso estimator. For example, we can state the restricted eigenvalue condition, introduced by Bickel, Ritov and Tsybakov in [4].
With this assumption, they get an oracle inequality, which show that the distance between the prediction losses of the Lasso estimators is of the same order as the distance between it and its oracle approximation. For an overview of existing results, cite for example [21] which present various conditions and various consequences.
Another type of results is about the variable selection. Whereas focus on the estimation, the Lasso could be used to select variables, and, for this goal, many results without hard assumptions are proved. The first result in this way is from Meinshausen and Buhlmann, in [14], who show that, for neighbordhood selection in Gaussian graphical models, under a neighborhood stability condition, the Lasso is consistent, even if the number of variables is of larger order than the sample size. Different assumptions, as the irrepresentable Condition, described in [22], are in the same idea: the true variables are selected consistently.
Another approach consists to refit the estimation, after the variable selection, with an estimator with better properties. This is the way consider in this article: we study the maximum likelihood estimator on the estimated active set. We could cite Massart and Meynet, [12], or Belloni and Chernozhukov, [3], or also Tingni Sun and Cun-Hui Zhang, [19] to use this idea. Nevertheless, we will study this estimator in a finite mixture regression model, in a final goal of clustering, which is, at our knowledge, not already studied.
The goal of clustering methods is to discover structures among individuals described by several variables. Specifically, in regression case, given n observations (x, y) = ((x 1 , y 1 ), . . . , (x n , y n )) which are realizations of random variables (X, Y ) with X ∈ R p and Y ∈ R q , one aims at grouping the data into a few clusters such that the conditional observations Y |X in the same cluster are more similar to each other than those from the other clusters. Different methods could be envisaged, more geometric or more statistical. We are dealing with model-based clustering, in order to have a rigorous statistical framework to assess the number of clusters and the role of each variable. Datasets are more and more in highdimension, and all the information should not be interesting for the clustering. To solve this problem, we propose a procedure which provide a data clustering from variable selection. This procedure is based on a modeling that recasts variable selection and clustering problems into a model selection problem in a regression framework. A global selection criterion choosing simultaneously the best number of clusters and the set of relevant variables is required. We use a penalized criterion to select a model from a non-asymptotic point of view. Penalizing the empirical contrast is an idea emerging from the seventies. Akaike, in [1], proposed the Akaike's Information Criterion (AIC) in 1973, and Schwarz in 1978 in [17] suggested the Bayesian Information Criterion (BIC). Those criteria are based on asymptotic heuristics. To deal with non-asymptotic observations, Birg and Massart in [6] and Barron et al. in [2], define a penalized data-driven criterion, which leads to oracle inequalities for model selection. Cohen and Le Pennec, in [8], generalize this result in the case of regression data. The aim of our approach is to define penalized data-driven criterion which leads to an oracle inequality for our procedure. In our context of regression, Cohen and Le Pennec, in [8], proposed a general model selection theorem for maximum likelihood estimation, adapted from Massart's theorem in [11]. Nevertheless, we can not apply it directly, because it is stated for a deterministic model collection, whereas our data-driven model collection is random, constructed by the Lasso. As Meynet done in [16] to generalize Massart's theorem, we extend the theorem to cope with the randomness of our model collection. By applying this general theorem to the finite mixture regression random model collection constructed by our procedure, we derive a convenient theoretical penalty as well as an associated non-asymptotic penalized criteria and an oracle inequality fulfilled by our Lasso-MLE estimator. The advantage of this procedure is that it does not need any restrictive assumption.
Let give the main result of this paper. Let (x i , y i ) i=1,...,n the observations, with unknown conditional density s 0 . Let (S m ) m∈M the model collection constructed by our procedure. We construct a collection of finite regression mixture of Gaussians with various numbers of clusters and different sets of relevant variables. Then, we estimate the conditional density by the maximum likelihood estimator in each model. This leads to a collection of estimators for the density. A final estimator has to be selected among this collection, which is equivalent to select a model among the model collection. Under some weak assumptions, we obtain a minimizer of pen(m) such that the estimatorŝm,ŝ being the maximum likelihood estimator, andm the model which minimizes the penalized log-likelihood, satisfies We will define JKL and KL later. The idea of this theorem is that the model choose by our procedure is as good as the best we can do among our collection, even if we have known the true density.
Before concluding the introduction, let give some notations which need to be fixed. In this general setting, we assume that the observations (x i , y i ) i=1,...,n are a sample of random variables (X, Y ) where X ∈ X and Y ∈ Y. Let S m a set of candidate conditional densities, in which we estimateŝ m with the maximum likelihood estimatorŝ To avoid existence issue, we work with almost minimizer of this quantity and define an η-log-likelihood minimizer as anyŝ m that satisfies The best model in this collection is the one with the smallest risk. However, because we do not have access to the true density s 0 , we can not select the best model, which will be called the oracle. Thereby, there is a trade-off between a bias term measuring the closeness of s 0 to the set S m and a variance term depending on the complexity of the set S m and on the sample size. A good set S m will be thus one for which this trade-off leads to a small risk bound. We are working with a maximum likelihood approach, the most natural quality measure is thus the Kullback-Leibler divergence denoted by KL. As we consider law with densities with respect to the Lebesgue measure dλ, we use the following notation Remark that, contrary to the quadratic loss, this divergence is an intrinsic quality measure between probability laws: it does not depend on the reference measure dλ. However, the densities depend on this reference measure, and this is stressed by the index λ. As we deal with conditional densities and not classical densities, the previous divergence should be adapted.
We define the tensorized Kullback-Leibler divergence by This divergence used in [8] appears as the natural one in this regression setting. Namely, we use the Jensen-Kullback-Leibler divergence JKL ρ with ρ ∈]0, 1[ defined by and the tensorized one This divergence is studied in [8]. We prefer this divergence rather than the Kullback-Leibler one because we get a boundness assumption on the controlled functions that is not satisfied by the log-likelihood differences differences − log sm s0 . When considering the Jensen-Kullback-Leibler divergence, those ratios are replaced by ratios − 1 ρ log (1−ρ)s0+ρsm s0 that are close to the log-likelihood differences when the s m are close to s 0 and always upper bounded by − log(1−ρ) ρ . Indeed, it is needed to use deviation inequalities for sums of random variables and their suprema, which is the key of the proof of oracle type inequality.
The aim of the model selection is to construct a data-driven criterion to select a model of proper dimension of a given list. A general theory of this topic is proposed in the works of Birg and Massart [5]. Besides, Massart, in [11], proposed a general theorem which gives the form of the penalty and associated oracle inequality in term of the Kullback-Leibler and Hellinger loss. In our case of regression, Cohen and Le Pennec, in [8], proposed a general theorem which gives the form of the penalty and associated oracle inequality in term of the Kullback-Leibler and Jensen-Kullback-Leibler loss. These theorems are based on the centred process control with the bracketing entropy, allowing to evaluate the size of the models. We compare the risk of the penalized maximum likelihood estimatorŝm with the benchmark inf m∈M E(KL ⊗n λ (s,ŝ m )). Our setting is more general, because we work with a random family denoted byM. We have to control the centred process thanks to Bernstein's inequality.
The rest of the article is organized as follows. In the section 2, we recall the multivariate Gaussian mixture regression model, and we describe the main steps of the procedure we propose. We also illustrate the requirement of refitting by some simulations. We present our oracle inequality in the section 3. Finally, in section 4, we give some tools to understand the proof of the oracle inequality, with a global theorem of model selection with a random collection in section 4.1 and sketch of proofs after. All the details are given in Appendix.

The Lasso-MLE procedure
In order to cluster high-dimensional regression data, we will work with the multivariate Gaussian mixture regression model. This model is developed in [18] in the scalar response case. We generalize it in section 2.1. Moreover, we want to construct a model collection. We propose, in section 2.2, a procedure called Lasso-MLE which construct a model collection, with various sparsity, of Gaussian mixture regression models. The different sparsities solve the high-dimensional problem. We conclude this section with a simulation, which illustrate the advantage of refitting.
2.1. Gaussian mixture regression model. We observe n independent couples (x i , y i ) 1≤i≤n of random variables (X, Y ), with Y ∈ R q and X ∈ R p coming from a probability distribution with unknown conditional density denoted by s 0 . To solve a clustering problem, we use a finite mixture model in regression. In particular, we will approximate the density of Y |X with a multivariate Gaussian mixture regression model. If the observation i belongs to the cluster r, we assume that there exists β r ∈ R p×q such that Thus, the random response variable Y ∈ R q depends on a set of explanatory variables, written X ∈ R p , through a regression-type model. Give more precisions on the assumptions.
• The variables Y i |X i are independent, for all i = 1, . . . , n ; S q ++ is the set of symmetric positive definite matrices on R q . We want to estimate the conditional density function s ξ from the observations. For all r ∈ {1, . . . , k}, β r is the matrix of regression coefficients, and Σ r is the covariance matrix in the mixture component r. The π r s are the mixture proportions. In fact, for all r ∈ {1, . . . , k}, for all z ∈ {1, . . . , q}, β t r,z x = p j=1 β r,j,z x j is the zth component of the mean of the mixture component r for the conditional density s ξ (.|x).
A variable is said to be irrelevant if, for each r ∈ {1, . . . , k}, β r = 0. A variable is relevant if it is not irrelevant. A model is said to be sparse if there is a few of relevant variables.
We denote by x [J] the restriction of x on J, and S (k,J) the model with k components and with J for relevant variables set: This is the main model used in this paper. Nevertheless, to deal with high-dimensional data, we use the Lasso estimator to construct the set of relevant variables, and the choice of the regularization parameter is known to be a difficult problem. We propose to construct a model collection to solve this problem.

2.2.
The Lasso-MLE procedure. The procedure we propose which is particularly interesting in highdimension could be decomposed into three main steps.
The first step consists of constructing a collection of models {S (k,J) } (k,J)∈M in which the model S (k,J) is defined by equation (2), and the model collection is indexed by M = K × J . Denote K ⊂ N * the possible number of components, and denote J a collection of subsets of {1, . . . , p} × {1, . . . , q}.
To detect the relevant variables, and construct the set J in each model, we penalize the empirical contrast by an ℓ 1 -penalty on the mean parameters proportional to ||P r β r || 1 = p j=1 q z=1 |(P r β r ) j,z |, where P t r P r = Σ −1 r . This leads to penalize simultaneously the ℓ 1 -norm of the mean coefficients and small variances. Computing those estimators lead to the relevant variables set. For a fixed number of mixture components k ∈ K, denote by G k a candidate of regularization parameters. Fix a parameter λ ∈ G k , we could then use an EM algorithm to compute the set of relevant variables. Then, varying k ∈ K and λ ∈ G k , we construct the relevant variables set J k,λ . We denote by J the random collection of all these sets, J = k∈K λ∈G k J (k,λ) . The second step consists of approximating the MLÊ using an EM algorithm for each model (k, J) ∈ M. The third step is devoted to model selection. We get a model collection, and we need to choose the best one. Because we do not have access to s 0 , we can not take the one which minimizes the risk. The theorem 4.1 solve this problem: we get a penalty achieving to an oracle inequality. Then, even if we do not have access to s 0 , we know that we can do almost like the oracle.

2.3.
Why refit the Lasso estimator? In order to illustrate our procedure, we compute multivariate data, the restricted eigenvalue condition being not satisfied, and run our procedure. We consider an extension of the model studied in Giraud et al. article [10] in the section 6.3. Indeed, this model is a linear regression with a scalar response which does not satisfy the restricted eigenvalues condition. Then, we define different classes, to get a finite mixture regression model, which does not satisfied the restricted eigenvalues condition, and extend the dimension for multivariate response. We could compare the result of our procedure with the Lasso, to illustrate the oracle inequality we have get. Let precise the model. Let x (1) , x (2) , x (3) be three vectors of R n defined by and for 4 ≤ j ≤ n, let x (j) be the j th vector of the canonical basis of R n . We take a sample of size n = 20, and vector of size p = m = 10. We consider two classes, each of them define by β j,z,1 = 10 and β j,z,2 = −10 for j ∈ {1, . . . , 2}, z ∈ {1, . . . , 10}. Moreover, we define the variance of the noise by a diagonal matrix with 0.01 for diagonal coefficient in each class. We run our procedure on this model, and compare it with the Lasso procedure, without refitting. We compute the model selected by the slope heuristic over the model collection constructed by the Lasso estimator. In figure 1 are the boxplots of each procedure, running 20 times. The Kullback-Leibler divergence is computed over a sample of size 5000.
We could see that a refitting after variable selection by the Lasso leads to a better estimation, according to the Kullback-Leibler loss.
3. An oracle inequality for the Lasso-MLE estimator Let denote the model collection constructed by the Lasso-MLE procedure by S = (S (k,J) ) (k,J)∈M L . The model S (k,J) is defined in (2), whereas we have denoted M L = K × J L , with J L a random subcollection of P({1, . . . , p} × {1, . . . , q}), constructed by the Lasso.
Remark 3.1. We have to denote that in this paper, the active variables set is designed by the Lasso. Nevertheless, any tool is used to construct this set, we could obtain analog results. We could work with any random subcollection of P ({1, . . . , p} × {1, . . . , q}), the control ed size being required in high-dimensional case.
..,n the observations, with unknown conditional density s 0 . Let S (k,J) as defined in (2). We denote by M L a random subcollection of M. For (k, J) ∈ M L , denote S B (k,J) the model defined in (3).
Consider the maximum likelihood estimator and let τ > 0 such thats ≥ e −τ s 0 . Let pen : M → R + , and suppose that there exists an absolute constant κ > 0 such that, for all (k, J) ∈ M, Then, the estimatorŝ (k,Ĵ) , with for some absolute positive constants C 1 and C 2 .
This result could be compare with the oracle inequality get in [18]. Indeed, under restricted eigenvalues condition (this assumption is explained in details in Bhlman and Van de Geer's book [7]) and fix design, they get an oracle inequality for the Lasso estimator in finite mixture regression model, with scalar response and high-dimension regressors. We get a similar result for the Lasso-MLE estimator. The good point is that we get the same type of inequality as comparable estimators. Moreover, our procedure work in a more general framework, without any assumptions about the design.

Tools for proof
In this section, we present the tools needed to understand the proof. First, we present a general theorem for model selection in regression among a random collection. Then, in subsection 4.2, we present the proof of this theorem, and in the next subsection we explain how use the main theorem to get the oracle inequality. All details are available in Appendix. 4.1. General theory of model selection with the maximum likelihood estimator. To get an oracle inequality for our clustering procedure, we have to use a general model selection theorem. Because the model collection constructed by our procedure is random, because of the Lasso estimator which select variables randomly, we have to generalize theorems already existing. Begin by some general theory of model selection.
Before enunciate the general theorem, begin by talking about the assumptions. First, we impose a structural assumption. It is a bracketing entropy condition on the model S m with respect to the Hellinger divergence d 2⊗n is a pair of functions such that for all (x, y) ∈ X × Y, t − (y, x) ≤ s(y|x) ≤ t + (y, x). The bracketing entropy H [.] (δ, S, d ⊗n H ) of a set S is defined as the logarithm of the minimum number of brackets [t − , t + ] of width d ⊗n H (t − , t + ) smaller than δ such that every functions of S belong to one of these brackets.
Assumption (H m ). There is a non-decreasing function φ m such that δ → 1 δ φ m (δ) is non-increasing on (0, +∞) and for every σ ∈ R + and every s m ∈ S m , Denote that the model complexity depends on the bracketing entropies not of the global models S m but of the ones of smaller localized sets. This is a weaker assumption.
For technical reason, a separability assumption is also required.
Assumption  m , log(t k (y|x)) goes to log(t(y|x)) as k goes to infinity. We also need an information theory type assumption on our collection. We assume the existence of a Kraft-type inequality for the collection: Assumption (K). There is a family (x m ) m∈M of non-negative numbers such that m∈M e −xm ≤ Σ < +∞.
The difference with Cohen and Le Pennec's theorem is that we consider a random collection of modelŝ M, included in the whole collection M. In our procedure, we deal with the high-dimensional models, and we cannot look after all the models: we have to restrict ourselves to a smaller subcollection of models.
Then we could write our main global theorem.
and let τ > 0 such that Introduce (S m ) m∈M some random subcollection of (S m ) m∈M . Consider the collection (ŝ m ) m∈M of η-log-likelihood minimizer in S m , satisfying, for all m ∈M, Then, for any ρ ∈ (0, 1) and any C 1 > 1, there are two constants κ 0 and C 2 depending only on ρ and C 1 such that, as soon as for every index m ∈ M, with κ > κ 0 , and where the model complexity D m is defined in (4), the penalized likelihood estimatê sm withm ∈M such that Obviously, one of the models minimizes the right hand side. Unfortunately, there is no way to know which one without knowing s 0 . Hence, this oracle model can not be used to estimate s 0 . We nevertheless propose a data-driven strategy to select an estimate among the collection of estimates {ŝ m } m∈M according to a selection rule that performs almost as well as if we had known this oracle, according to the absolute constant C 1 . Using simply the log-likelihood of the estimate in each model as a criterion is not sufficient. It is an underestimation of the true risk of the estimate and this leads to choose models that are too complex. By adding an adapted penalty pen(m), one hopes to compensate for both the variance term and the bias term between 1 n n i=1 − logŝm (yi|xi) s0(yi|xi) and inf sm∈Sm KL ⊗n λ (s 0 , s m ). For a given choice of pen(m), the best model Sm is chosen as the one whose index is an almost minimizer of the penalized η-log-likelihood.
Talk about the assumption (5). If s is bounded, with a compact support, this assumption is satisfied. It is also satisfied in other cases, more particular. Then it is not a hard assumption, and it is needed to control the random family.
This theorem is available for whatever model collection constructed, whereas assumptions (H m ), (K) and (Sep m ) are satisfied. In the following, we will specify the procedure we propose to cluster highdimensional data, and look for satisfying these assumptions. Nevertheless, this theorem is not specific of our context, and could be used whatever the problem considering. Let ν ⊗n n (g) denote the recentred process P ⊗n n (g)−P ⊗n (g). By concavity of the logarithm, kl(ŝ m ′ ) ≥ jkl(ŝ m ′ ), and then 8 P ⊗n (jkl(ŝ m ′ )) − ν ⊗n n (kl(s m )) ≤P ⊗n (kl(s m )) + pen(m) n − ν ⊗n n (jkl(ŝ m ′ )) + which is equivalent to Mimic the proof as done in Cohen and Le Pennec [8], we could obtain that except on a set of probability less than e −x m ′ −x , for all x, for all y m ′ > σ m ′ , under assumption (H m ), there exists absolute constants κ To obtain this inequality we use the hypothesis (Sep m ) and (H m ). This control is derived from maximal inequalities, described in [11].
Our purpose is now to control ν ⊗n n (kl(s m )). This is the difference with the theorem of Cohen and Le Pennec: we work with a random subcollection M L of M.
By definition of kl and ν ⊗n n , We want to apply Bernstein's inequality, which is recalled in appendix. If we denote by Z i the random variable . We need to control the moments of Z i to apply Bernstein's inequality.
We prove this lemma in Appendix 6.2. Because e −τ +τ −1 ≤ 2τ for all τ ≥ A. For τ ∈]0, A], because this function is continuous and equivalent to 2 in 0, there exists B > 0 such that then, for all u > 0, except on a set with probability less than e −u , ν ⊗n n (kl(s m )) ≤ √ 2vu + cu. 9 Thus, for all z > 0, for all u > 0, except on a set with probability less than e −u , We apply this bound to u = x + x m + x m ′ . We get that, except on a set with probability less than e −(x+xm+x m ′ ) , using that a 2 + b 2 ≥ a 2 , from the inequality (9), and, from the inequality (10), where we have chosen with θ > 1 to fix later, and with β > 0 to fix later. Coming back to the inequality (8), Recall thats m is chosen such that Put κ(β) = 1 + (β + β 2 ), and let ǫ 1 > 0, we define , and put κ 2 = Cρǫ1 κ0 . We get that Since τ ≤ 1∨τ , if we choose β such that (β +β 2 )(δ/2+1) = αθ −2 1 β −2 , and putting κ 1 = αγ −2 (β −2 +1), since 1 ≤ 1 ∨ τ , using the expressions of y m ′ and z m,m ′ , we get that Now, assume that κ 1 ≥ κ in condition (6), we get We then have simultaneously for all m ∈ M, for all m ′ ∈ M(m), except on a set with probability less than Σ 2 e −x , It is in particular satisfied for all m ∈M and m ′ ∈M(m), and, sincem ∈M(m) for all m ∈M, we deduce that except on a set with probability less than Σ 2 e −x , By integrating over all x > 0, because for any non negative random variable Z and any a > 0, E(Z) = a z≥0 P (Z > az)dz, we obtain that As δ KL can be chosen arbitrary small, this implies that

4.3.
Sketch of the proof of the oracle inequality 3.2. To prove the theorem 3.2, we have to apply the theorem 4.1. Then, our model has to satisfy all the assumptions. The assumption (Sep m ) is true when we consider Gaussian densities. If s 0 is bounded, with compact support, the assumption (5) is satisfied. It is also true in others particular cases. We have to look after assumption (H m ) and assumption (K). Here we present only the main step to prove these assumptions. All the details are in Appendix.
dǫ for all σ > 0. It could be better to consider more local version of the integrated square root entropy, but the global one is enough in this case to define the penalty. As done in Cohen and Le Pennec [8], we could decompose the entropy by Calculus for the proportions. We could apply a result proved by Wasserman and Genovese in [9] to bound the entropy for the proportions. We get that Calculus for the Gaussian. The family is an ǫ-bracket covering for F J , where u j,z is a net for the mean, R is the number of parameters needed to recover all the variance set, δ = . We obtain that and then we get .
Determination of a function φ. We could take This function is non-decreasing, and σ → is non-increasing. The root σ (k,J) is the solution of φ (k,J) (σ (k,J) ) = √ nσ 2 (k,J) . With the expression of φ (k,J) , we get Nevertheless, we know that σ * =

Assumption (K)
. We want to group models by their dimension. Then we have (k,J) e −x (k,J) ≤ 2.

Acknowledgment
I am grateful to Pascal Massart for suggesting me to study this problem, and for stimulating discussions.

Appendix: technical results
In this appendix, we give more details for the proofs. 6.1. Bernstein's lemma. Lemme 6.1 (Bernstein's inequality). Let (X 1 , . . . , X n ) be independent real valued random variables. Assume that there exists some positive numbers v and c such that Then, for every positive x, 6.2. Proof of lemma 4.2. This proof is adapted from the Meynet's thesis, [15]. First, let give some bounds of functions: Then, for all 0 < x < e τ , we get To prove this, we have to show that y → φ(y) y 2 is non-decreasing. We omit the proof here. We want to apply this inequality, in order to derive the lemma 4.2. As log and we could apply the previous inequality to s0 sm . Indeed, Integrating with respect to the densitys m , we get that This conclude the proof.
6.3. Determination of a net for the mean and the variance.
• Step 1: construction of a net for the variance Let ǫ ∈]0, 1], and δ = where R is chosen to recover everything. We want that We want R to be an integer, then R = . We get a net for the variance. We ), close to Σ (and deterministic, independent of the values of Σ), where i is a permutation such that , and that if Σ is fixed, Σ = diag(Σ 2 1 , . . . , Σ 2 q ).

• Step 2: construction of a net for the mean vectors
We select only the active variables detected by the Lasso.
Then, we get a net for the mean vectors. -Proof that [l, u] is an ǫ-bracket We will work with the Hellinger distance.