Spatial adaptation in heteroscedastic regression: Propagation approach

The paper concerns the problem of pointwise adaptive estimation in regression when the noise is heteroscedastic and incorrectly known. The use of the local approximation method, which includes the local polynomial smoothing as a particular case, leads to a finite family of estimators corresponding to different degrees of smoothing. Data-driven choice of localization degree in this case can be understood as the problem of selection from this family. This task can be performed by a suggested in Katkovnik and Spokoiny (2008) FLL technique based on Lepski's method. An important issue with this type of procedures - the choice of certain tuning parameters - was addressed in Spokoiny and Vial (2009). The authors called their approach to the parameter calibration"propagation". In the present paper the propagation approach is developed and justified for the heteroscedastic case in presence of the noise misspecification. Our analysis shows that the adaptive procedure allows a misspecification of the covariance matrix with a relative error of order 1/log(n), where n is the sample size.

The idea is to replace model (1.2) by a local parametric model where U h (x) def = {t : t − x ≤ h} and θ ∈ Θ ⊂ R p is an unknown finite-dimensional parameter. Then employing one of the well-developed parametric methods we can estimate θ by θ(y 1 , . . . , y d ; x) , and then use the estimator f θ(Y1,...,Y d ) (x) based on the observations from the "true" model (1.2) for estimation of f (x) . Therefore we have to choose the local model (the collection of estimators {f θ (·), θ ∈ Θ} ) and the appropriate degree of locality h . This method of local approximation originated from [14], [15], [16], [17]. In what follows we shall consider approximation by local linear models of the following type: (1.4) where Ψ i = Ψ(X i ) = (ψ 1 (X i − x), . . . , ψ p (X i − x)) ⊤ is a vector of basis functions {ψ j (·)} which already are fixed. The main issue then is the choice of the appropriate bandwidth h such that the estimator built on the base of the localized data would be a relevant estimator for f (x) . For this purposes the bandwidths selection should be done in a data-driven way, and the adaptive selection from the family { f θh (·)} h>0 for fixed basis is equivalent to the adaptive choice of bandwidth. Notice also that the coefficients θ (j) (x) as well as their estimators depends on x and should be calculated for every particular point of interest x . On the other side the localization reduces influence of the choice of the functions {ψ j (·)} allowing to use simple collections. Moreover, in our set-up the covariance matrix Σ 0 is not assumed to be known exactly, and the approximate model used instead of the true one reads as follows: where Ψ = (Ψ 1 , . . . , Ψ n ) is a p × n design matrix and Σ = diag(σ 2 1 , . . . , σ 2 n ) , min{σ 2 i } > 0 is available to a statistician. Thus the model is misspecified in two places: in the form of the regression function and in the error distribution.
Nonparametric estimation in heteroscedastic regression under the L 2 losses was studied in [12], [13] and series of papers [8], [9], [10]. For estimation of the mean with L 2 -risk in Gaussian homoscedastic model with unknown variance the penalties allowing to deal with the complexity of such a collection of models were proposed in [2]. However the problem of "local model selection" addressed in the present paper is quite different to the model selection in the sense of [3] and [25] related to estimation with global risk. The minimax pointwise estimation in heteroscedastic regression is in focus of [4].

.1 Local parametric estimation
We shall perform the adaptive selection from a collection of K estimators corresponding to model (1.4) with different sizes h of neighborhood U h (x) . Fix a point x ∈ R d as a center of localization and a basis {ψ j } . Let the localizing operator be identified with the corresponding matrix. For the next nonparametric step we will need a sequence of nested windows. Thus for every x the sequence of localizing schemes (scales) W k (x) , k = 1, . . . , K is given by the matrices W k (x) = diag(w k,1 (x), . . . , w k,n (x)) , where the weights w k,i (x) ∈ [0, 1] can be understood, for instance, as smoothing kernels w k,i (x) = W ((X i − x)h −1 k ) . Let a particular localizing function w (·,·) (x) be fixed; the aim is to choose on the base of available data the index k of the optimal bandwidth h k . To simplify the notation we sometimes suppress the dependence on the reference point x . Denote by Let Θ be a compact subset of R p . Inside of any "window" given by W k , k = 1, . . . K we calculate the quasi-maximum likelihood estimator (QMLE) R stands for the terms not depending on θ , and Recall that in the case of the polynomial basis the estimator θ k (x) is a local polynomial estimator of θ(x) corresponding to the k th scale. In what follows we assume that n > p , and that det B k > 0 for any k = 1, . . . , K . Because p = rank(B k ) ≤ min{p, rank(W k (x))} this requires the following conditions on the design matrix Ψ and the minimal localizing scheme W 1 (x) : (A1) The p × n design matrix Ψ is supposed to have full row rank, i.e., (A2) The smallest localizing scheme W 1 (x) is chosen to contain at least p design points such that w 1,i (x) > 0 , i.e., p ≤ #{i : Assumption (A2) in practise is automatically fulfilled, since, for example, in R 1 it means that for the local constant fit we need at least one observation and so on. Usually it is intrinsically assumed that, starting from the smallest window, at every step of the procedure every new window contains at least p new design points. The formulas (2.2) give a sequence of estimators { θ k (x)} K k=1 . It was noticed in [1] that in the case when the true data distribution is unknown the QMLE is a natural estimator for the parameter maximizing the expected log-likelihood. That is, for every k = 1, . . . , K , the estimator θ k (x) can be considered as an estimator of Recall that we do not assume f = Ψ ⊤ θ even locally. It is known from [29] that in the presence of a model misspecification for every k the QMLE θ k is a strongly consistent estimator for θ * k (x) , which also is the minimizer of the localized Kullback-Leibler [19] information criterion: with KL(P, P θ ) def = E P log dP/dP θ . For the properties of the Kullback-Leibler divergence see, for example, [28].
It follows from the above definition of θ * k (x) and from (2.2) that the QMLE θ k admits a decomposition into deterministic and stochastic parts: where ε ∼ N (0, I n ) . Notice that if f ≡ Ψ ⊤ θ , then θ * k ≡ θ for any k , and the classical parametric set-up takes place.

Adaptive bandwidth selection
Let a point x ∈ X ⊂ R n , a basis {ψ j } and the method of localization w (·,·) (x) be fixed. The crucial assumption for the procedure under consideration to work is that the localizing schemes (scales) W k (x) = diag(w k,1 , . . . , w k,n ) are nested. One can say that the localizing schemes are nested in the sense that for the corresponding matrices the following ordering condition is fulfilled: (A3) For any fixed x and the method of localization w (·,·) (x) the following relation holds: For the kernel smoothing this condition means the following. Let the sequence of bandwidths {h k } be ordered from the smallest to the largest one, i.e., h 1 < . . . < h K , and let W k (x) = diag(w k,1 , . . . , w k,n ) be the localizing matrix, corresponding to the bandwidth h k . Here the weights for any 0 < h l < h k < 1 , and W (u) → 0 as |u| → ∞ , or even is compactly supported.
Recall that, given x ∈ X , a basis {ψ j } , and the method of localization w (·,·) (x) , we look for the estimator f θ (x) of f (x) having form (1.5), where the coefficients θ (j) (x) are the components of the estimator  [20] method to the comparing of the maximized log-likelihoods L(W k , θ k ) . This is the idea of the FLL technique suggested in [18]. More precisely, to describe the test statistic, define for any θ , θ ′ ∈ Θ the corresponding log-likelihood ratio: (2.11) Then, using the approach suggested in [18], for every l = 1, . . . , K , the fitted log-likelihood (FLL) ratio is defined as follows: By Theorem 4.2, for any l and θ , the FLL is a quadratic form: This prompts the use, see [18], the FLL-statistics: In the algorithm the smallest bandwidths corresponding to k = 1 is always accepted, then the adaptive index k is selected by Lepski's selection rule with the FLL test statistics Finally we set θ = θ k . The procedure (2.13) involves parameters z 1 , . . . , z K−1 related to the large deviations of {T lm } , 1 ≤ l < m ≤ K . As the classical Lepski procedure, the (2.13) controls the risk of estimators for the case of dominating bias. The opposite case of the negligible w.r.t. the noise bias is usually handled by employing the advanced empirical process technique, however sometimes providing the constants far away of being optimal. Notice also that the Wilks-type Theorem 4.3 below gives the bound for the expected fitted log-likelihood ratio: where the constant C(p, r) does not depend on the degree of localization and is given by the r th moment of the χ 2 distribution with p degrees of freedom: Therefore we shall follow the practical idea from [27] and [18] allowing to avoid hard large deviations analysis and to calculate the thresholds rather sharp numerically. We assume at this step that the critical values z 1 , . . . , z K−1 are already fixed satisfying the following set of K − 1 inequalities: Let θ k denote the last accepted estimate after the first k steps of the procedure: The critical values z 1 , . . . , z K−1 satisfy where C(p, r) is defined by (2.15), α ∈ (0, 1] is an additional "free" turning parameter which can be taken equal to 1 , and E 0,Σ stands for the expectation w.r.t. the measure N (0, Σ) .
Remark 2.1. Lemma 4.1 from Section 4 shows that in the "no bias" situation the Gaussian distribution provides a nice pivotality property: the actual value of the parameter θ is not important for the risk of adaptive estimate, so one can put θ = 0 in (2.17).
Remark 2.2. Clearly at any step k ≤ K of the algorithm the "current value" of the adaptive estimator θ k depends on the thresholds z 1 , . . . , z k−1 . The theoretical aspects related to the heteroscedasticity of model and to the incorrectly known variance is the focus of the present paper. Thus we do not detail the practical aspects of the thresholds calibration only mentioning that in practise this can be done by Monte Carlo simulations under the known "parametric" measure N (0, Σ) . Moreover one needs to calculate them only once. For detailed consideration of the practical aspects of the calibration as well as for the computational results see [27] or [18] focused on the image denoising by local constant fitting. Demo-versions of the software are available on the web page http://www.cs.tut.fi/˜lasip/.

Upper bound for the critical values
Let us at this step recall the notion of the Löwner partial ordering: for any real symmetric matrices A and B we write A B if and only if ϑ ⊤ A ϑ ≤ ϑ ⊤ B ϑ for all vectors ϑ , or, equivalently, if and only if the matrix B − A is nonnegative definite. Assuming (A4) , the true covariance matrix Σ 0 Σ(1 + δ) , and the variance of the estimate θ k is bounded with B −1 k : The last inequality follows from the observation that all the entries of the "weight" matrix W k do not exceed one, implying W 2 k W k . The strict equality takes place if the {w k,i } are boxcar (rectangular) kernels and the noise is known, i.e., if δ = 0 . To justify the procedure one need to show that the critical values chosen by (P C) are finite. This is obtained under the following assumption: (A5) Let for some constants u 0 and u such that In the "one dimensional case" p = 1 , that is for the local constant fit, the "matrix" is just a weighted "local design size". Assume for simplicity that σ 2 i ≡ σ 2 , the weights are rectangular kernels w k,i (x) = I{|X i − x| ≤ h k /2} , and the design is equidistant. Then for n sufficiently large and the condition (A5) means that the bandwidths grow geometrically: h k = uh k−1 .
Now we are able to formulate a theorem on finiteness of the critical values.
3.2 Quality of estimation in the nearly parametric case: The small modeling bias condition and the propagation property The critical values of the procedure z 1 , . . . , z K−1 were selected by the propagation conditions (2.17) under the measure N (θ, Σ) . Now θ * 1 ≈ · · · θ * k ≈ θ up to some k ≤ K , and the covariance matrix is Σ 0 . The aim is to formalize the meaning of " ≈ " and to justify the use of the critical values in this situation. For this purposes we will take into account the discrepancy between the joint distributions of linear estimates θ 1 , . . . , θ k for k = 1, . . . , K under "no bias" assumption corresponding to the distributions with the mean θ * 1 = · · · = θ * k = θ and the incorrectly specified covariance matrix Σ , and in the general situation with θ * 1 = · · · = θ * k and the covariance Σ 0 . Denote the expectations w.r.t. these measures by E θ,Σ := E k,θ,Σ and E f ,Σ0 := E k,f ,Σ0 , respectively. Denote a p×k matrix of the first k estimators and the expectations correspondingly by Let A ⊗ B stands for the Kronecker product of A and B defined as Denote the pk × pk covariance matrices of vec where the matrix J k is a k × k matrix with all its elements equal to 1 , and the pk × nk matrix D k is defined as follows: By Lemma 4.8 from Section 4 under Assumption (A4) with the same δ the similar relation holds for the covariance matrices Σ k and Σ k,0 of the linear estimators: In spite of the moment generating function of vec Θ K has the form corresponding to the multivariate normal distribution, see Lemma 4.10 in Section 4, this representation makes sense only if Σ K is nonsingular. Notice that rank(J K ⊗ Σ) = n . From J K ⊗ Σ 0 it follows only that Σ K 0 , similarly, Σ K,0 0 . However, without any additional assumptions it is easy to show, see Lemma 4.9 in Section 4, that for rectangular kernels Σ K ≻ 0 . On the other hand, due to (3.8), it is enough to require nonsingularity only for the matrix Σ K corresponding to the approximate model (1.6), and its choice belongs to a statistician. In what follows we assume that . . , K , the distributions of vec Θ k under the null and under the alternative. Denote also the Radon-Nikodym derivative by Then, by Lemma 4.11 from Section 4, the Kullback-Leibler divergence between these measures has the following form: If there would be no any "noise misspecification", i.e., if δ ≡ 0 implying Σ = Σ 0 , then ∆ . Therefore, this quantity can be used to indicate deviation between the mean values in the true (1.1) and the approximate (1.6) models. Clearly, under (W) , the quantity ∆(k) grows with k , so following the terminology suggested in [27], we introduce the small modeling bias condition: (SM B) Let for some k ≤ K and some θ exist a constant ∆ ≥ 0 such that Monotonicity of ∆(k) and Assumption (SM B) immediately imply that (4.19) implies the bound for the Kullback-Leibler divergence in terms of δ : This means that, if for some k Assumption (SM B) is fulfilled and δ = O(1/K) , then the Kullback-Leibler divergence between the measures IP k θ,Σ and IP k f ,Σ0 is bounded by a small constant. Now one can state the crucial property for obtaining the final oracle result.

Theorem 3.2. Propagation property
Assume (A1) − (A5) and (P C) . Then for any k ≤ K the following upper bounds hold: , is the adaptive estimate at the k th step of the procedure.
The proof is given in Subsection 4.4.
Remark 3.2. Bounds (4.28) and (4.27) below give a condition on the relative error in the noise misspecification. As δ → 0+ for every k ≤ K it holds that where Z k is defined by (3.9). This bound implies, up to the additive constant log αE|χ 2 p | r /2 , the same asymptotic behavior for the logarithm of the risk of adaptive estimate at each step of the procedure. Because by (SM B) the quantity ∆(k) is supposed to be bounden by a small constant, and K is of order log n , then . This means that for the case when Σ is an estimate for Σ 0 only the logarithmic in sample size quality is needed. This observation is of particular importance, since it is known from [26] that over classes of functions with bounded second derivative the rate n −1/2 of variance estimation is achievable only for the dimension d ≤ 8 .

Quality of estimation in the nonparametric case: the oracle result
Define the oracle index as the largest index k ≤ K such that (SM B) holds: , the first estimate is always accepted in the testing procedure. Let k * be the oracle index. Then under the conditions (A1) , (A2) , (A4) , (W) , (A5) the risk between the adaptive estimate and the oracle one is bounded with the following expression: where ϕ(δ) is as in Theorem 3.2.
Proof. By the definition of the adaptive estimate θ = θ k . Because the events { k ≤ k * } and { k > k * } are disjunct, one can write Thus, to bound the first summand, it is enough to apply Theorem 3.2 with k = k * .
To bound the second expectation, i.e., to bound fluctuations of the adaptive estimate θ at the steps of the procedure for which the (SM B) condition is not fulfilled anymore, just notice that for k > k * the quadratic form coincides with the test statistics T k * , k But the index k was accepted, this means that T l, k ≤ z l for all l < k and therefore for l = k * . Thus

Componentwise oracle risk bounds
Theorem 3.3 provides the oracle risk bound for the adaptive estimator θ(x) = θ k (x) of the parameter vector θ(x) ∈ R p corresponding to the estimator f θ (x) of the type (1.5). It is interesting to have a look at the oracle quality of estimation of the components θ (1) , . . . , θ (p) of the vector θ having in mind that the choice of polynomial basis leads to the direct estimation of the value of regression function and the derivatives by the coordinates of θ . Denote by LP k (p − 1) a local polynomial estimator of order p − 1 corresponding to the k th degree of localization, and, respectively, by LP ad (p − 1) its adaptive counterpart, i.e., LP ad (p − 1) def = LP k (p − 1) . If the basis is polynomial and the regression function f (·) is sufficiently smooth in a neighborhood of x , then θ(x) is the LP ad (p − 1) of the vector (f (0) (x), . . . , f (p−1) (x)) ⊤ of the values of the function f and its derivatives at the reference point x ∈ R d . Now we are going to obtain a similar to the previous section oracle result for the components of the vector θ(x) , particularly for e ⊤ j θ(x) , j = 1, . . . , p , where e j = (0, . . . , 1, . . . , 0) ⊤ is the j th canonical basis vector in R p . As a corollary of this general result in the case of the polynomial basis we get an oracle risk bound for LP ad (p − 1) estimators of the function f and its derivatives at the point x .
Let LP k (p − 1) estimator of f (j−1) (x) be given by Then the adaptive local polynomial estimators are defined as follows: Similarly, the adaptive estimators of the function f and its derivatives corresponding to the k th step of the procedure are given by Thus, if the basis is polynomial, the estimator f (x) estimator of the value f (x) , and f (j−1) (x) with j = 2, . . . , p are, correspondingly, the LP ad (p − 1) estimators of the values of its derivatives. However it should be stressed that the results of Theorems 3.3 and 3.9 hold for any basis satisfying the conditions of the theorems. We shall need the following assumptions: (A6) There exist 0 < σ min (k) ≤ σ max (k) < ∞ such that for i : X i ∈ U h k (x) , with U h k (x) given by W k the variances of errors from the parametric (known) model (1.6) are locally uniformly bounded: (A7) Let assumption (A6) be satisfied. There exists a number Λ 0 > 0 such that for any k = 1, . . . , K the smallest eigenvalue λ p (B k ) ≥ nh d k Λ 0 σ −2 max (k) for n sufficiently large.
Then, because B k ≻ 0 , for any k = 1, . . . K and for any γ ∈ R p we have where σ 2 max (k) def = max 1≤l≤k σ 2 max (l) . Thus we have the following lemma: To obtain the "componentwise" oracle risk bounds we need to recheck the "propagation property". Firstly, notice that the "propagation conditions" (2.17) on the choice the critical values z 1 , . . . , z K−1 imply the similar bounds for the components e ⊤ j θ k (x) . Recall that θ k def = θ min{k, k} . Then, by (2.17), Lemma 3.4, and the pivotality property from Lemma 4.1, we have the following simple observation: Under the propagation conditions (P C) for any θ ∈ R p and all k = 2, . . . , K we have: ≤ αC(p, r).
As before we suppress the dependence on x . To get the propagation property we study for k = 1, . . . , K the joint distributions of e ⊤ j θ 1 , . . . , e ⊤ j θ k , that is the distribution of e ⊤ j Θ k , the j th row of the matrix Θ k . Obviously, . . , e ⊤ j θ). Recall that the matrices Σ k,0 and Σ k have a block structure. Now, for instance, to study the estimator of the first coordinate of the vector θ = θ(x) , or of f (x) in the case of the polynomial basis, we take the first elements of each block and so on. Denote the k × k covariance matrices of the j th elements of the vectors θ 1 , . . . , θ k by where J k is a k ×k matrix with all its elements equal to 1 , and the k ×nk block diagonal matrices D k,j is defined by Moreover, the following representation holds: where Σ k is defined by (3.5). Similarly, Thus, the important relation (3.8) is preserved for Σ k,j and Σ k,0,j obtained by picking up the (j, j) th elements of each block of Σ k and Σ k,0 respectively. With usual notation γ (j) for the j th component of γ ∈ R k , denote by Theorem 3.6. "Componentwise" propagation property Under the conditions (A1) − (A7) and (P C) for any k ≤ K the following upper bound holds: with ϕ(δ) as in Theorem 3.2. Proof. The proof essentially follows the line of the proof of Theorem 3.2. If the distributions of vec Θ k were Gaussian, then any subvector is also Gaussian. Denote by IP k,j θ,Σ = N (e ⊤ j θ, . . . , e ⊤ j θ) ⊤ , Σ k,j and by IP k,j f ,Σ0 = N (e ⊤ j θ * 1 , . . . , e ⊤ j θ * k ) ⊤ , Σ k,0,j , k = 1, . . . , K , the distributions of e ⊤ j Θ k under the null and under the alternative. By the Cauchy-Schwarz inequality and Lemma 3.5 with the Radon-Nikodym derivative given by Z k,j = dIP k,j f ,Σ0 /dIP k,j θ,Σ . By inequalities (3.24) and (3.25) the analog of Condition (A4) is preserved for Σ k,0,j and Σ k,j , that is, there exist δ ∈ [0, 1) such that (1 − δ)Σ k,j Σ k,0,j (1 + δ)Σ k,j At this point we introduce the "componentwise" small modeling bias conditions: (SM Bj) Let for some j = 1, . . . , p , some k(j) ≤ K , and some θ (j) = e ⊤ j θ exist a constant
Definition 3.8. For each j = 1, . . . , p the oracle index k * (j) is defined as the largest index in the scale for which the (SM Bj) condition holds, that is Theorem 3.9. Assume (A1) − (A7) and (P C) . Let the smallest bandwidth h 1 be such that the first estimate e ⊤ j θ 1 (x) be always accepted in the adaptive procedure. Let k * (j) be the oracle index defined by (3.31), j = 1, . . . , p . Then the risk between the j th coordinates of the adaptive estimate and the oracle one is bounded with the following expression: where ϕ(δ) is as in Theorem 3.6. Proof. To simplify the notation we suppress the dependence on j in the index k . Similarly to the proof of Theorem 3.3 we consider the disjunct events { k ≤ k * } and { k > k * } . Therefore, By Lemma 3.4 and the definition of the test statistic T k * , k the second summand can be easily bounded: To bound the first summand we use the "componentwise" analog of Theorem 3.2, particularly, Theorem 3.6, and this completes the proof.

SMB and the bias-variance trade-off
In [27] it was shown that the small modeling bias (SM B1) condition (3.30) can be obtained from the "bias-variance trade-off" relations. Notice that our set-up includes the set-up from [27] as a particular case. To prove that the similar relation holds in the present case we need the following definition. Given a point x and the method of localization w , for any j = 1, . . . , p the "ideal adaptive bandwidths", see [22], [23] is defined as follows: where C j (w) is a constant depending on the choice of the smoother w , and f (0) stands for the function f itself. To bound the "modeling bias" ∆ j (k) we need the following assumption: (A8) There exists a constant s j > 0 such that for all k ≤ K ] is a diagonal matrix composed of the diagonal elements of Σ k,j . Thus we have the following result: Proof. Consider the quantity b j (k) ⊤ Σ −1 k,j,diag b j (k) . Suppose that e ⊤ j θ(x) = f (j−1) (x) . In view of relation (4.15) for the weights {w l,i (x)} the form of the matrix Σ k,j,diag is particularly simple: Remark 3.4. Using the standard technique it is easy to derive from the above result that for estimation of functions over Hölder classes the methodology proposed in [18] and [27] and generalized in the present paper delivers the minimax rate of convergence up to a logarithmic factor.

Pivotality and local parametric risk bounds
Lemma 4.1. Pivotality property Let (A3) hold. Let θ 1 = · · · = θ κ = θ for κ ≤ K . Then for any k ≤ κ the risk associated with the adaptive estimate at every step of the procedure does not depend on the parameter θ : where E 0 denotes the expectation w.r.t. the centered measure N (0, Σ) or N (0, Σ 0 ) .
Proof. After the first k steps θ k coincides with one of θ m , m ≤ k , and this event takes place if for some l ≤ m the statistics T l, m+1 > z l . In view of the decomposition (2.8) it holds {T l, m+1 > z l for some l = 1, . . . , m |H m+1 } with ε ∼ N (0, I n ) . The probability of this event does not depend on the shift θ , so without loss of generality θ can be taken equal to zero. The risk associated with θ k admits the following decomposition: Under the conditions of the lemma for all m < k the joint distribution of ( does not depend on θ by the same argumentation. To justify the statistical properties of the considered procedure we need the following simple observation. Let for any θ , θ ′ ∈ Θ the corresponding log-likelihood ratio L(W k , θ, θ ′ ) be defined by (2.11). Then

Theorem 4.2. Quadratic shape of the fitted log-likelihood
Let for every k = 1, . . . , K the fitted log likelihood (FLL) be defined as follows: .
Proof. Notice that L(W k , θ) defined by (2.3) is quadratic in θ . The assertion follows from the Taylor expansion of the second order at the point θ k , because it is the point of maximum, and the second derivative is a constant matrix B k .
Let the matrix S be defined as follows: Then for the distribution of L(W k , θ k , θ * k ) one observes so-called "Wilks phenomenon", see [7], described by the following theorem: Theorem 4.3. Let the regression model be given by (1.1) and the parameter maximizing the expected local log-likelihood θ * k = θ * k (x) be defined by (2.6). Then for any k = 1, . . . , K the following equality in distribution takes place: with p = rank(B k ) = dim Θ = p . Here λ 1 (S), . . . , λ p (S) are the non-zero eigenvalues of the matrix S , and ε i are independent standard normal random variables. Moreover, under (A4) the maximal eigenvalue λ max (S) ≤ 1 + δ , and for any z > 0 where η is a random variable distributed according to the χ 2 law with p degrees of freedom. Proof. By Theorem 4.2 and the decomposition (2.8) it holds that: where the symmetric matrix S is defined by (4. Therefore, On the other hand, the matrix S = Σ can be rewritten as: k . Notice that Π k is an orthogonal projector on the linear subspace of dimension p = rank(B k ) spanned by the rows of matrix Ψ . Indeed, Π k is symmetric and idempotent, i.e., Π 2 k = Π k . Moreover, rank(Π k ) = tr(Π k ) = tr(W Therefore Π k has only p unit eigenvalues and n−p zero ones. Notice also that the n× n matrix S has rank(S) = rank(Π k W . . , λ p (S) are the non-zero eigenvalues of the matrix S . Recall the definition of the matrix norm induced by the L 2 vector norm: (4.5) Thus, taking into account Assumption (A4) , the induced L 2 -norm of matrix S can be estimated as follows: Therefore, the largest eigenvalue of matrix S is bounded: λ max (S) ≤ 1 + δ . The last assertion of the theorem follows from the simple observation that IP λ 1 (S)ε 2 1 + · · · + λ p (S)ε 2 p ≥ z ≤ IP λ max (S)(ε 2 1 + · · · + ε 2 p ) ≥ z .

Proof. By (4.3) and the independence of ε
Let η ∼ χ 2 p . Integrating by parts yields the second inequality:

Proof of the bounds for the critical values
Denote for any l < k the variance of the difference θ k − θ l by V lk : Then there exists a unique matrix V 1/2 lk ≻ 0 such that (V 1/2 lk ) 2 = V lk . Lemma 4.5. Assume (A4) , (A3) and (A5) . If for some k ≤ K the hypothesis H k is true, that is, if θ * 1 = · · · = θ * k = θ , then for any l < k it holds that: Proof. The H k and (2.8) imply where ξ is a standard normal vector in IR p . Thus by Theorem 4.2 for any l < k lk ξ. By the Schur theorem there exists an orthogonal matrix M such that where ε is a standard normal vector, Λ = diag(λ 1 (V 1/2 lk )) , and p = rank(B l ) . Therefore, . . , p , are the nonzero eigenvalues of V 1/2 lk . By the similar argumentation: For any square matrices A and B we have . Application of this bound to the variance of the difference of estimates yields where V l = Var θ l , l ≤ k . By (3.2) and by Assumption (B) we have: Thus by (4.10) the upper bound for the induced matrix norm reads as follows: ). (4.11) Similarly, (4.12) These bounds imply Lemma 4.6. Under the conditions of the preceding lemma for any µ 0 < t −1 0 , or µ 1 < t −1 1 respectively, the exponential moments are bounded: Proof. The statement of the lemma is justified similarly to the proof of Corollary 4.4. The bounds (4.11) and (4.12) imply the bounds for the corresponding moment generating functions: where C(p, r) is as in (4.8).
Proof. Integrating by parts and Lemma 4.5 yield for the second assertion where η ∼ χ 2 p . The first assertion is proved similarly.
Proof. of Theorem 3.1 The theoretical choice of the critical values The risk corresponding to the adaptive estimate can be represented as a sum of risks of the false alarms at each step of the procedure: By the definition of the last accepted estimate θ k , for any m = 1, . . . , k − 1 , the event { θ k = θ m } happens if for some l = 1, . . . , m the statistic T l,m+1 > z l . Thus It holds also that for any positive µ This simple fact and the Cauchy-Schwarz inequality imply for m = 1, . . . , k−1 the following bound: This together with the bound from Lemma 4.7 gives Since u r(k−l) ≤ u r(K−l) for any l < k ≤ K the choice with C(p, r) = log 2 2r [Γ(2r + p/2)Γ(p/2)] 1/2 Γ(r + p/2) provides the required bound E 0,Σ |( θ l − θ l ) ⊤ B l ( θ l − θ l )| r ≤ αC(p, r) for all l = 2, . . . , K.
The second assertion follows from the observation that Assumption (A4) due to the equality (4.13) also holds for the Kronecker product (4.14) Therefore Proof. The condition (4.15) implies W l ΣW m = diag(w l,1 w m,1 /σ 2 1 , . . . , w l,n w m,n /σ 2 n ) = W l for any l ≤ m . Thus the blocks of Σ k simplify to and Σ k has a simple structure: Then the determinant of Σ k coincides with the determinant of the following irreducible block triangular matrix: are nonsingular. By (A1) and (A2) B l ≻ 0 for any l . By (A5) there exists u 0 > 1 such that B l u 0 B l−1 , therefore In the "nonparametric situation" the moment generation function (mgf ) of the joint distribution of θ 1 , . . . , θ K is Thus, provided that Σ K,0 ≻ 0 , it holds that vec Θ K ∼ N (vec Θ * K , Σ K,0 ) . Similarly, in the "parametric situation", if Σ K ≻ 0 , then the joint distribution of vec Θ K is N (vec Θ K , Σ K ) with the mgf: Proof. Let γ ∈ IR pK be written in a partitioned form γ ⊤ = (γ ⊤ 1 , . . . , γ ⊤ K ) with γ l ∈ IR p , l = 1, . . . , K . Then the mgf for the centered random vector vec Θ K − vec Θ * K ∈ IR pK , due to the decomposition (2.8) θ l = θ * l + D l Σ 1/2 0 ε with D l = B −1 l ΨW l , can be represented as follows: A trivial observation that K l=1 D ⊤ l γ l is a vector in IR n and Σ 1/2 0 ε ∼ N (0, Σ 0 ) by (1.1) implies by the definition of Σ K,0 the first assertion of the lemma, because here D K is defined by (3.23).

Proof. of Theorem 3.2 (Propagation property)
Notice that for any nonnegative measurable function g = g( Θ k ) the Cauchy-Schwarz inequality implies with the Radon-Nikodym derivative Z k = dIP k f ,Σ0 /dIP k θ,Σ . One gets the first assertion taking g = |( θ k − θ) ⊤ B k ( θ k − θ)| r/2 , and applying "the parametric risk bound" with δ = 0 from (4.7): The second assertion is treated similarly by application of the pivotality property from Lemma 4.1 and the propagation conditions (2.17).
To calculate E θ,Σ [Z 2 k ] let us consider log Z k given by as a function of vec Θ * k . Application of the Taylor expansion at the point vec Θ k yields With ξ ∼ N (0, I pk ) the second moment of the Radon-Nikodym derivative reads as follows To estimate the obtained expression in terms of the level of noise misspecification δ notice that the condition (3.8) implies Therefore the quantity in the exponent in (4.26) is bounded by: Moreover, Finally, In the case of homogeneous errors the expression for log Z k reads as log Z k = pk log σ σ 0 + 1 2 By Assumption (A4) , (4.28) where p is the dimension of the parameter set and k is the degree of the localization.