Simultaneous estimation of the mean and the variance in heteroscedastic Gaussian regression

Let $Y$ be a Gaussian vector of $\mathbb{R}^n$ of mean $s$ and diagonal covariance matrix $\Gamma$. Our aim is to estimate both $s$ and the entries $\sigma_i=\Gamma_{i,i}$, for $i=1,...,n$, on the basis of the observation of two independent copies of $Y$. Our approach is free of any prior assumption on $s$ but requires that we know some upper bound $\gamma$ on the ratio $\max_i\sigma_i/\min_i\sigma_i$. For example, the choice $\gamma=1$ corresponds to the homoscedastic case where the components of $Y$ are assumed to have common (unknown) variance. In the opposite, the choice $\gamma>1$ corresponds to the heteroscedastic case where the variances of the components of $Y$ are allowed to vary within some range. Our estimation strategy is based on model selection. We consider a family $\{S_m\times\Sigma_m, m\in\mathcal{M}\}$ of parameter sets where $S_m$ and $\Sigma_m$ are linear spaces. To each $m\in\mathcal{M}$, we associate a pair of estimators $(\hat{s}_m,\hat{\sigma}_m)$ of $(s,\sigma)$ with values in $S_m\times\Sigma_m$. Then we design a model selection procedure in view of selecting some $\hat{m}$ among $\mathcal{M}$ in such a way that the Kullback risk of $(\hat{s}_{\hat{m}},\hat{\sigma}_{\hat{m}})$ is as close as possible to the minimum of the Kullback risks among the family of estimators $\{(\hat{s}_m,\hat{\sigma}_m), m\in\mathcal{M}\}$. Then we derive uniform rates of convergence for the estimator $(\hat{s}_{\hat{m}},\hat{\sigma}_{\hat{m}})$ over H\"{o}lderian balls. Finally, we carry out a simulation study in order to illustrate the performances of our estimators in practice.

For this, we introduce a collection F = {S m × Σ m , m ∈ M} of products of linear subspaces of R n indexed by a finite or countable set M. In the sequel, these products will be called models and, for any m ∈ M, we will denote by D m the dimension of S m × Σ m . To each m ∈ M, we can associate a pair of independent estimators (ŝ m ,σ m ) = (ŝ m (Y [1] ),σ m (Y [2] )) with values in S m × Σ m that we precise later. The Kullback risk of (ŝ m ,σ m ) is given by E[K(P s,σ , Pŝ m ,σm )] and is of order of the sum of two terms, inf (t,τ )∈Sm×Σm K(P s,σ , P t,τ ) + D m . (1. 2) The first one, called the bias term, represents the capacity of S m × Σ m to approximate the true value of (s, σ). The second, called the variance term, is proportional to the dimension of the model and corresponds to the amount of noise that we have to control. To warrant a small risk, these two terms have to be small simultaneously. Indeed, using the Kullback risk as a quality criterion, a good model is one minimizing (1.2) among F . Clearly, the choice of a such model depends on the pair of the unknown parameters (s, σ) and make good models unavailable to us. So, we have to construct a procedure to select an indexm =m(Y [1] , Y [2] ) ∈ M depending on the data only, such that E[K(P s,σ , Pŝm ,σm )] is close to the minimal risk The art of model selection is precisely to provide procedure solely based on the observations in that way. The classical way consists in minimizing an empirical penalized criterion stochastically close to the risk. Considering the likelihood function with respect to Y [1] , imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16,2008 we choosem as the minimizer over M of the penalized likelihood criterion where pen is a penalty function mapping M into R + = [0, ∞). In this work, we give a form for the penalty in such way to obtain a pair of estimators (ŝm,σm) with a Kullback risk close to R(s, σ, F ). Our approach is free of any prior assumption on s but requires that we know some upper bound γ 1 on the ratio where σ * (resp. σ * ) is the maximum (resp. minimum) of the σ i 's. The knowledge of γ allows us to deal equivalently with two different cases. First, "γ = 1" corresponds to the homoscedastic case where the components of Y [1] and Y [2] are independent with a common variance (i.e. σ i ≡ σ) which can be unknown. On the other side, "γ > 1" means that the σ i 's can be distinct and are allowed to vary within some range. This uncommonness of the variances of the observations is known as the heteroscedastic case. Heteroscedasticity arises in many practical situations in which the assumption that the variances of the data are equal is debatable.
The research field of the model selection has known an important development in the last decades and it is beyond the scope of this paper to make an exhaustive historical review of the domain. The interested reader could find a good introduction to model selection in the first chapters of [16]. The first heuristics in the domain are due to Mallows [15] for the estimation of the mean in homoscedastic Gaussian regression with known variance. In more general Gaussian framework with common known variance, Barron et al. [7], Birgé and Massart ([9] and [10]) have designed an adaptive model selection procedure to estimate the mean for quadratic risk. They provide non-asymptotic upper bound for the risk of the selected estimator. For bound of order of the minimal risk among the model collection, this kind of result is called oracle inequalities. Baraud [5] has generalized their results to homoscedastic statistical models with non-Gaussian noise admitting moment of order bigger than 2 and a known variance. All these results remain true for common unknown variance if some upper bound on it is supposed to be known. Of course, the bigger is this bound, the worst are the results. Assuming that γ is known does not imply the knowledge of a such upper bound.
In the homoscedastic Gaussian framework with unknown variance, Akaike has proposed penalties for estimating the mean for quadratic risk (see [1], [2] and [3]). Replacing the variance by a particular estimator in his penalty term, Baraud [5] has obtained oracle inequalities for more general noise than Gaussian and polynomial model collection. Recently, Baraud, Giraud and Huet [6] have constructed penalties able to take into account the complexity of the model collection for estimating the mean with quadratic risk in Gaussian homoscedastic model with unknown variance. They have also proved results for the estimation of the mean and the variance factor with Kullback risk. This problem is close to imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16, 2008 X. Gendre/Simultaneous estimation of the mean and the variance 4 ours and corresponds to the case "γ = 1". A motivation for the present work was to extend their results to the heteroscedastic case "γ > 1" in order to get oracle inequalities by minimization of penalized criterion as (1.3). Assuming that the model collection is not too large, we obtain inequalities with the same flavor up to a logarithmic factor where C and R are positive constants depending in particular on γ and ǫ is a positive parameter.
A non-asymptotic model selection approach for estimation problem in heteroscedastic Gaussian model was studied in few papers only. In the chapter 6 of [4], Arlot estimates the mean in heteroscedastic regression framework but for bounded data. For polynomial model collection, he uses resampling penalties to get oracle inequalities for quadratic risk. Closer to our problem, Comte and Rozenholc [12] have estimated the pair (s, σ). Their estimation procedure is different from ours and it makes the theoretical results difficultly comparable between us. For instance, they proceed in two steps (one for the mean and one for the variance) and they give risk bounds separately for each parameter in L 2 -norm while we estimate directly the pair (s, σ) for Kullback risk.
As described in [8], one of the main advantages of inequalities such as (1.4) is that they allow us to derive uniform convergence rates for the risk of the selected estimator over many classes of smoothness. Considering a collection of histogram models, we provide convergence rates over Hölderian balls. Indeed, for α 1 , α 2 ∈ (0, 1], if s is α 1 -Hölderian and σ is α 2 -Hölderian, we prove that the risk of (ŝm,σm) converges with a rate of order of n log 1+ǫ n −2α/(2α+1) where α = min{α 1 , α 2 } is the worst regularity. To compare this rate, we can think of the homoscedastic case with only one observation of Y . Indeed, in this case, the optimal rate of convergence in the minimax sense is n −2α/(2α+1) and, up to a logarithmic loss, our rate is comparable to this one. To our knowledge, our results in non-asymptotic estimation of the mean and the variance in heteroscedastic Gaussian model are new.
The paper is organized as follows. The main results are presented in section 2. In section 3, we carry out a simulation study in order to illustrate the performances of our estimators in practice. The last sections are devoted to the proofs and to some technical results.

Main results
In a first time, we introduce the model collection, the estimators and the procedure. Next, we present the main results whose proofs can be found in the section imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16, 2008 4. In the sequel, we consider the framework (1.1) and, for the sake of simplicity, we suppose that there exists an integer k n 0 such that n = 2 kn .

Model collection and estimators
In order to estimate the mean and the variance, we consider linear subspaces of R n constructed as follows. Let M be a countable or finite set. To each m ∈ M, we associate a regular partition p m of {1, . . . , 2 kn } given by the |p m | = 2 km consecutive blocks For any I ∈ p m and any x ∈ R n , let us denote by x| I the vector of R n/|pm| with coordinates (x i ) i∈I . Then, to each m ∈ M, we also associate a linear subspace E m of R n/|pm| with dimension 1 d m 2 kn−km . This set of pairs (p m , E m ) allows us to construct a model collection. Hereafter, we identify each m ∈ M to its corresponding pair (p m , E m ).
For any m = (p m , E m ) ∈ M, we introduce the subspace S m ⊂ R n of the E m -piecewise vectors, and the subspace Σ m ⊂ R n of the piecewise constant vectors, The dimension of S m × Σ m is denoted by D m = |p m |(d m + 1). To estimate the pair (s, σ), we only deal with models S m × Σ m constructed in a such way. More precisely, we consider a collection of products of linear subspaces where M is a set of pairs (p m , E m ) as above. In the paper, we will often make the following hypothesis on the model collection: This hypothesis avoids handling models with dimension too great in front of the number of observations. Let m ∈ M, we denote by π m the orthogonal projection on S m . We estimate (s, σ) by the pair of independent estimators (ŝ m ,σ m ) ∈ S m × Σ m given bŷ imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16, 2008 X. Gendre/Simultaneous estimation of the mean and the variance Thus, we get a collection of estimators {(ŝ m ,σ m ), m ∈ M}.

Risk upper bound
We first study the risk on a single model to understand its order. Take an arbitrary m ∈ M. We define (s m , σ m ) ∈ S m × Σ m by Easy computations proves that the pair (s m , σ m ) reaches the minimum of the The next proposition permits us to compare this quantity with the Kullback risk of (ŝ m ,σ m ).
As announced in (1. 2), this result shows that the Kullback risk of the pair (ŝ m ,σ m ) is of order of the sum of a bias term K(P s,σ , P sm,σm ) and a variance term which is proportional to D m . Thus, minimizing the Kullback risk E [K (P s,σ , Pŝ m,σm )] among m ∈ M corresponds to finding a model that realizes a compromise between these two terms.
Let pen be a non negative function on M, we choosem ∈ M as some minimizer of the penalized criterion In the sequel, we denote by (s,σ) = (ŝm,σm) the selected pair of estimators. It satisfies the following result: Moreover, assume that there exist δ, ǫ > 0 such that where R = R(γ, θ, A, B, ǫ, δ) is a positive constant and C can be taken equal to The inequality (2.6) is close to an oracle inequality up to a logarithmic factor. Thus, considering the penalty (2.5) whose order is slightly bigger than the dimension of the model, the risk of the estimator provided by the criterion (1.3) is comparable the minimal one among the model collection F .

Convergence rate
One of the main advantages of an inequality as (2.6) is that it gives uniform convergence rates with respect to many well known classes of smoothness. To illustrate this, we consider the particular case of the regression on a fixed design. For example, in the framework (1.1), we suppose that ∀1 i n, s i = s r (i/n) and σ i = σ r (i/n), where s r and σ r are two unknown functions that map [0, 1] to R.
In this section, we handle the normalized Kullback-Leibler divergence and, for any α ∈ (0, 1) and any L > 0, we denote by H α (L) the space of the α-Hölderian functions with constant L on [0, 1], Moreover, we consider a model collection F P C as described in the section 2.
and Σ m as the space of the piecewise constant functions on p m , Then, the model collection that we consider is Note that this collection satisfies (2.3) with A = 1 and B = 0. The following result gives a uniform convergence rate for (s,σ) over Hölderian balls.
The exponent α corresponds to the worst regularity between s and σ. As we are estimating simultaneously the two functions, it is not paradoxical to obtain a rate of convergence that depends on the function that is the hardest to estimate. In order to compare the rate given by (2.7), we can think to the homoscedastic case with only one observation of Y . In this framework, it is known that, if s is in H α (L), the optimal rate of convergence in minimax sense is n −2α/(2α+1) for the estimation of s in quadratic risk. Moreover, in the heteroscedastic case with one observation of Y , Wang et al. [18] have proved that the order of the minimax rate of convergence for the estimation of the variance function is max n −4α1 , n −2α2/(2α2+1) once s ∈ H α1 (L 1 ) and σ ∈ H α2 (L 2 ). Thus, up to a logarithmic factor, the rate of convergence given by the previous property can be compared to these optimal rates.
In all this section, we consider the model collection F P C and we take k n = 10. imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16, 2008 In the case of M1, we can note that the procedure choose the "good" model in the sense that if the pair (s r , σ r ) belongs to a model of F P C , this one is generally chosen by our procedure. Repeating the simulation 25 000 times with the framework of M1 gives us that, with probability higher than 99.9%, the probability for making this "good" choice is about 0.9985 (±3 × 10 −5 ). Even if the mean does not belong to one of the S m 's, the procedure recover the homoscedastic nature of the observations in the case M2. By doing 25 000 simulations with the framework induced by M2, the probability to choose an homoscedastic model is around 0.99996 (±1 × 10 −5 ) with a confidence of 99.9%. For more general framework as M3 and M4, the estimators perform visually well and detect the changements in the behaviour of the mean and the variance functions.
The parameter γ is supposed to be known and is present in the definition of the penalty. So, we naturally can ask what is its importance in the procedure.
In particular, what happens if we do not have the good value? The following table present some estimations of the ratio for several values of γ. These estimated values have been obtained with 500 repetitions for each one. The main part of the computation time is devoted to the estimation of the oracle's risk.  Table 1: Ratio between the Kullback risk of (s,σ) and the one of the oracle.
As expected, the best estimations are obtained for the good value of γ. But it is interesting to observe that the ratio does not suffer to much from small errors on the knowledge of γ. For example, in the homoscedastic case M2, even if we have supposed that γ = 3, the ratio stays reasonably small.

Proofs
For any I ⊂ {1, . . . , n} and any x, y ∈ R n , we introduce the notations x, y I = i∈I x i y i and Let m ∈ M, we will use several times in the proofs the fact that, for any I ∈ p m , imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16, 2008

Proof of the proposition 1
Let us expand the Kullback risk of (ŝ m ,σ m ), To upper bound the expectation, note that We apply the lemmas 10 and 11 to each block I ∈ p m and, by concavity of the logarithm, we get Using (H θ ) and the fact that ρ I γd m /|I|, we obtain For the lower bound, the positivity of φ in (4.2) and the independence between imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16,2008 s m andσ m give us It is obvious that the hypothesis (H θ ) ensures d m |I|/2. Thus, we get σ * d m (|I| − d m )σ * and To conclude, we know that (ŝ m ,σ m ) ∈ S m × Σ m and, by definition of (s m , σ m ), it implies E [K(P s,σ , Pŝ m,σm )] E [K(P s,σ , P sm,σm )] .

Proof of theorem 2
We prove the following more general result: where R 1 (M) and R 2 (M) are defined by In these expressions, ⌊·⌋ is the integral part and C is a positive constant that could be taken equal to 12 √ 2e/( √ e − 1).
Before proving this result, let us see how it implies the theorem 2. The choice (2.5) for the penalty function corresponds to x m = D m log 1+ǫ D m in (4.2). Applying the previous theorem with α = 1/2 leads us to Using the upper bound on the risk of the proposition 1, we easily obtain the coefficient of the infimum in (2.6). Thus, it remains to prove that the the two quantities R 1 and R 2 can be upper bounded independently of n. For this, we denote by B ′ = B + 2 log(2Cθ 2 γ) + 1 and we compute . We have split the sum in two terms, the first one is for d = 1, The other part R ′′ 1 is for d 2 and is equal to imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16,2008 We now handle R 2 and (2.4) implies For any positive t, 1 + t 1+ǫ (1 + t) 1+ǫ , then we finally obtain where we have set We now have to prove theorem 4. For an arbitrary m ∈ M, we begin the proof by expanding the Kullback-Leibler divergence of (s,σ), The difference between the divergence and the likelihood can be expressed as K(P s,σ , Pŝ m ,σm ) − L(ŝ m ,σ m ) By the definition (2.2) ofm, for any α ∈ (0, 1), we can write (1 − α)K(P s,σ , Ps ,σ ) (4.3) K(P s,σ , Pŝ m ,σm ) + pen(m) + G(m) imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16, 2008 where, for any m ∈ M, We subdivide the proof of theorem 4 in several lemmas.

Lemma 5.
For any m ∈ M, we have Proof. Let us compute this expectation to obtain the inequality. By independence between ε [1] and ε [2] , we get It leads to E[G(m)] = E E G(m) ε [2] 0.
In order to control Z(m), we split it in two terms that we study separately, imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16,2008 Lemma 6. Let m ∈ M and x be a positive number. Under the hypothesis (H θ ), we get Proof. We begin by setting, for all 1 i n, and we denote by We lower bound the function φ by the remark Thus, we obtain and we use this inequality to get To control S(m), we use the inequality (4.2) in [14], conditionally to ε [2] . Let u > 0, By the remark (4.1), we can upper bound it by imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16, 2008 where the X I 's are i.i.d. random variables of law χ 2 (|I| − d m − 1) /|I|. Let t > 0, the first expectation is dominated by Using (H θ ), we roughly upper bound the maximum by the sum and we get Take t = 2αx/γ to conclude.
Lemma 7. Let m ∈ M and x be a positive number, then Proof. Note that for all u > 1, we have Let t > 0, we handle Z − (m) conditionally to ε [2] and, using the previous lower bound on φ, we obtain Let us note that thus, we can apply the inequality (4.1) from [14] to get This inequality leads us to Take t = αx to get the announced result.
It remains to control W 1 (m) and W 2 (m). For the first one, we now prove a Rosenthal-type inequality.
Lemma 8. Consider any m ∈ M. Under the hypothesis (H θ ), for any x > 0, we have where ⌊·⌋ is the integral part and C is a positive constant that could be taken equal to Proof. Using the lemma 10 and the remark (4.1), we dominate W 1 (m), Take x > 0 and an integer q > 1, then It is the sum of the independent centered random variables To dominate E V q + , we use the theorem 9 in [11]. Let us compute imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16,2008 and so, .
and we roughly upper bound the maximum by the sum Thus, it gives Injecting this inequality in (4.4) leads to Lemma 9. Consider any m ∈ M and let x be a positive number. Under the hypothesis (H θ ), we have Proof. Let us define imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16,2008 We apply the Gaussian inequality to W 2 (m) conditionally to ε [2] , It leads to e −t and thus, by the remark (4.1), where the X I 's are i.i.d. random variables of law χ 2 (|I| − d m − 1) /|I|. Finally, we integrate following ε [2] and we get We finish as we did for Z + (m), In order to end the proof of theorem 4, we need to put together the results of the previous lemmas. Because γ 1, for any x > 0, we can write We now come back to (4.3) and we apply the preceding results to each model. Let m ∈ M, we take x = x m 2(2 + α) imsart-ejs ver. 2007/12/10 file: meavar.hyper28178.tex date: July 16,2008 and, recalling (4.2), we get the following inequalities where R 1 (M) and R 2 (M) are the sums defined in the theorem 4. As the choice of m is arbitrary, we can take the infimum among m ∈ M in the right part of (4.5).
In the two situation, we obtain the announced result.

Technical results
This section is devoted to some useful technical results. Some notations previously introduced can have a different meaning here.
Lemma 10. Let Σ be a positive symmetric n × n-matrix and σ 1 , . . . , σ n > 0 be its eigenvalues. Let P be an orthogonal projection of rank D 1. If we denote M = P ΣP , then M is a non-negative symmetric matrix of rank D and, if τ 1 , . . . , τ D are its positive eigenvalues, we have Proof. We denote by Σ 1/2 the symmetric square root of Σ. By a classical result, M has the same rank, equal to D, than P Σ 1/2 . On a first side, we have P ΣP x, x x 2 = sup (x1,x2)∈ker(P )×im(P ) (x1,x2) =(0,0) P Σx 2 , x 2 On the other side, we can write Lemma 11. Let ε be a standard Gaussian vector in R n , a = (a 1 , . . . , a n ) ′ ∈ R n and b 1 , . . . , b n > 0. We denote by b * (resp. b * ) the maximum (resp. minimum) of the b i 's and θ For all x > 0, we define r(x) = log(1 + x)−x/(1+x), r is non-negative and increasing. Consider λ > 0, we upper bound the Laplace transform of Z, r(2λb * ) .