Sparse model selection under heterogeneous noise: Exact penalisation and data-driven thresholding

: We consider a Gaussian sequence space model X λ = f λ + ξ λ , where the noise variables ( ξ λ ) λ are independent, but with heterogeneous variances ( σ 2 λ ) λ . Our goal is to estimate the unknown signal vector ( f λ ) by a model selection approach. We focus on the situation where the non-zero entries f λ are sparse. Then the heterogenous case is much more involved than the homogeneous model where σ 2 λ = σ 2 is constant. Indeed, we can no longer proﬁt from symmetry inside the stochastic process that one needs to control. The problem and the penalty do not only depend on the number of coeﬃcients that one selects, but also on their position. This appears also in the minimax bounds where the worst coeﬃcients will go to the larger variances. With a careful and explicit choice of the penalty, however, we are able to select the correct coeﬃcients and get a sharp non-asymptotic control of the risk of our procedure. Some ﬁnite sample results from simulations are provided.


Motivation and main results
We consider the following sequence space model (1.1) where (f λ ) are the real-valued coefficients of a signal and the noise variables (ξ λ ) ∼ N (0, Σ) have a diagonal covariance matrix Σ = diag(σ 2 λ ). Here Λ is a finite, but large index set. This heterogeneous model may appear in several frameworks where the variance is fluctuating, for example in heterogeneous regression, coloured noise, fractional Brownian motion models or especially in statistical inverse problems. For the latter setting the general literature is quite exhaustive, e.g. Johnstone and Silverman (1997); Abramovich and Silverman (1998); Cavalier et al. (2002); Cavalier (2004); Cavalier and Raimondo (2007); Cohen et al. (2004); Cavalier (2011); Donoho (1995); Hoffmann and Reiß (2008); Johnstone and Paul (2013); Rochet (2013), but mostly focusses on specific questions like universal thresholding, asymptotic minimax rates or level-wise thresholding. The goal here is to estimate the unknown parameter vector (f λ ) from the observations (X λ ) under general and unknown sparsity constraints. To this end a penalised empirical risk criterion, based on the so-called risk hull approach, is proposed for general families of possibly data-driven selection rules. This can be viewed as a (data-dependent) model selection procedure and results in a sparse oracle-type inequality.
Model selection is a core problem in statistics. One of the main reference in the field dates back to the information criterion AIC by Akaike (1973), but there is a huge amount of more recent work on this subject, in particular a precise analysis for high-dimensional and sparse data, see e.g. Birgé and Massart (2001); Golubev (2002); Abramovich et al. (2006); Massart (2007); Golubev (2011); Rochet (2013); Wu and Zhou (2013). Model selection is usually linked to the choice of a penalty and its precise choice is the main difficulty in model selection both from a theoretical and a practical perspective. Moreover, there is a close relationship between model selection and the popular thresholding procedures, where coefficients below a certain noise-related level are suppressed, cf. Golubev (2002); Abramovich et al. (2006); Massart (2007). The idea is that the search for a "good penalty" in model selection is indeed very much related to the choice of a "good threshold" in wavelet procedures. There exists also a fascinating connection between the false discovery rate control (FDR) and both thresholding and model selection, as studied in Abramovich et al. (2006); Benjamini and Hochberg (1995), which will become apparent in the second part of our paper. Our main structural assumption is that the parameter vector (f λ ) of interest is sparse, while we do neither know the position nor the number of non-zero entries. Sparsity is one of the leading paradigms nowadays and signals with a sparse representation in some basis (for example wavelets) or functions with sparse coefficients appear in many scientific fields, compare the discussions in Abramovich et al. (2006); Golubev (2002Golubev ( , 2011; Wu and Zhou (2013) among many others.
In this paper, we consider the sequence space model with heterogeneous errors. Our goal is then to select among a family of models the best possible one, by use of a data-driven selection rule. In particular, one has to deal with the special heterogeneous nature of the observations, which must be reflected by the choice of the penalty. The heterogenous case is much more involved than the direct (homogeneous) model. Indeed, there is no more symmetry inside the stochastic process that one needs to control, since each empirical coefficient has its own variance. The problem and the penalty do not only depend on the number of coefficients that one selects, but also on their position. The penalty is in this sense non-local. We treat the case of general families of data-driven selection rules first and then specify to the full subset selection procedures and the computationally much easier thresholding rules via an FDR-type control. Using our model selection approach, the procedures are almost exact minimax (up to a factor 2). Moreover, the procedure is fully adaptive. Indeed, the sparsity index γ n is unknown and we obtain an explicit penalty, valid in the mathematical proofs and directly applicable in simulations.
The heterogeneity also appears in the minimax lower bounds where the coefficients in the least favourable model will go to the larger variances. In the case of known sparsity γ n , we consider also a non-adaptive threshold estimator and derive its minimax upper bound. This estimator exactly attains the lower bound for typical specifications of the noise levels (σ 2 λ ) and is then minimax. The paper is organized as follows. In the following subsection we give examples of problems where our heterogeneous model appears. Section 2 specifies our method and a provides a general oracle-type inequality for general families of selection rules. In Section 3 we consider the sparsity assumptions and obtain concrete results for the full subset selection and thresholding procedures. Section 4 presents the results on minimax lower and upper bounds. In Section 5 we present numerical results that document the finite-sample properties of the methods and we discuss implementation issues. All proofs are deferred to Section 6.

Heterogeneous regression
Consider first a model of heterogeneous regression where ε i are i.i.d. standard Gaussian, but their variance are fluctuating depending on the design points x i and f is some spiky unknown function. In this model Λ = {1, . . . , n}. By spiky function we mean that f (x i ) is close to zero apart from a small subset of all design points x i . These signals are frequently encountered in applications (though rarely modeled in theoretical statistics), e.g. when measuring absorption spectra in physical chemistry (i.e. rare well-localised and strong signals) or jumps in log returns of asset prices (i.e. log-price increments which fluctuate at low levels except when larger shocks occur).

Coloured noise
Often in applications coloured noise models are adequate. Let us consider here the problem of estimating an unknown function observed with a noise defined by some fractional Brownian motion, where f is an unknown 1−periodic function in L 2 (0, 1), 1 0 f (t)dt=0, ε is the noise level and W −α is a fractional Brownian motion of index α (e.g., see Sowell (1990)). The fractional Brownian motion appears in econometric applications to model the long-memory phenomena, e.g. in Comte and Renault (1996). We are not interested in the fractional Brownian motion itself, but we want to estimate the unknown function f based on the noisy data Y (t), as in Cavalier (2004); Johnstone (2011); Wang (1996). The model (1.2) is close to the standard Gaussian white noise model, which corresponds to the case α = 0. Here, the behaviour of the noise is different. Let us point out the potential use of our approach here.
An important tool is fractional integration. In this framework, if the function f is supposed to be 1−periodic, then the natural way is to consider the periodic version of fractional integration (given in (1.3)) such that and thus (see p.135 in Zygmund (1959)), (1.4) By integration and projection on the cosine (or sine) basis and using (1.4), one obtains the sequence space model (as in Cavalier (2004)),

Inverse problems
Consider the following framework of a general inverse problem where A is a known injective compact linear bounded operator, f an unknown ddimensional function,Ẇ is a Gaussian white noise and ε > 0 the noise level. We will use here the framework of Singular Values Decomposition (SVD), see e.g. Cavalier (2011). Denote by ϕ λ the eigenfunctions of the operator A * A associated with the strictly positive eigenvalues b 2 λ > 0. Remark that any function f may be decomposed in this orthonormal basis as f = λ∈Λ f λ ϕ λ , where λ ∈ Λ.

Data-driven-subset selection
We consider the sequence space model (1.1) for coefficients of an unknown L 2function f with respect to an orthornormal system (ψ λ ). The estimator over an arbitrary large, but finite index set Λ is then defined bŷ We write |h| = #{h λ = 1} and n = #Λ for the cardinality of Λ. By A we denote the operator norm, i.e. the largest absolute eigenvalue. The random elements (X λ ) λ take values in the sample space X = R Λ . We now consider an arbitrary family H ⊆ H 0 := {h : X → {0, 1} Λ } of Borel-measurable data-driven subset selection rules. Define an estimator by minimizing in the family H the penalized empirical risk: (2.1) Remark that h ⋆ is defined in an equivalent way by Then, define the data-driven estimator In order to find an explicit and adequate penalty, we follow Cavalier and Golubev (2006) and apply the concept of a risk hull ℓ(f, h). The function ℓ provides an upper bound for the stochastic error in estimating f uniformly over possibly random selction rules h. Although it may thus still be stochastic, it is much easier to work with since it does no longer depend on (X λ ) directly. The following penalty function turns out to be natural: where σ 2 (j) h denotes the j-th largest value among {h λ σ 2 λ } and log + (z) = max(log z, 0). In general the second summand is of lower order and mainly provides non-asymptotic control. Note further that in the homogeneous case σ λ = σ the close approximation |h| j=1 log(ne/j) ≈ |h| log(ne/|h|) shows that the typical 2σ 2 k log(n/k)-form of the penalty for the sparsity index k is recovered, see Abramovich et al. (2006) for a survey of different results in this direction.
The next lemma shows that the penalty indeed provides a risk hull. The proof is based on bounding the tails of the corresponding order statistics and on worstcase permutations of the entries. We are not going to dwell on measurability issues there, taking outer expectations if necessary, which is easily justified by the arguments.

Lemma. The function
Based on this lemma we are able to derive a general oracle inequality for datadriven subset selection, which form the first fundamental result of the paper.

Sparse representations
Let us consider the intuitive version of sparsity by assuming a small proportion of nonzero coefficients, in the spirit of Abramovich et al. (2006), i.e. the family where γ n denotes the maximal proportion of nonzero coefficients. Throughout, we assume that this proportion γ n is such that asymptotically γ n → 0 and nγ n → ∞.
We first derive the results for the interesting cases of full subset selection and adaptive thresholding before discussing their relevance and comparing them with the literature.

Full subset selection
The goal here is to study the accuracy of the full model selection over the whole family of estimators, which offers 2 n possibilities to select a sub-model. Each coefficient may be chosen to be inside or outside the model. Let us consider the case where H denotes all deterministic subset selections, (3.1) We strive for an oracle inequality that involves only the noise levels of the truly active coordinates, i.e. involving h f := 1(f λ = 0) λ∈Λ . To this end, write Σ h for the covariance matrix of the ξ λ restricted to the indices λ for which h λ = 1, i.e.
3.1 Theorem. Let h ⋆ be the data-driven rule defined in (2.1) with H as in (3.1). We have, for n → ∞, uniformly over f ∈ F 0 (γ n ), In particular, if log + ( Σ ) = O(log n) (i.e., any polynomial growth for Σ is admissible) and In Section 5 we shall propose a greedy algorithm to find approximately the full subset selection rule. As the results there indicate, however, the statistical properties of the full subset selection rule are not so impressive that a profound study of algorithmic ways to overcome the complexity bound O(2 n ) seems necessary. In fact, adaptive thresholding not only in practice, but also theoretically offers a simple and quite successful way to perform sparse model selection.

Threshold estimators
Consider now a family of adaptive threshold estimators. The problem is to study the data-driven selection of the threshold. Let us consider the case where H denotes the threshold selection rules with arbitrary threshold values t > 0: Note that H consists of n = #Λ different subset selection rules only and can be implemented efficiently using the order statistics of (|X λ |/σ λ ) λ . (1)).
( 3.5) Assuming for Σ the growth bounds

Discussion
Heterogeneous case. One may compare the method and its accuracy with other results in related frameworks. For example, Rochet (2013) considers a very close framework of model selection in inverse problems by using the SVD approach. This results in a noise (ξ λ ) which is heterogeneous and diagonal. Johnstone (2011); Johnstone and Paul (2013) study the related topic of inverse problems and Wavelet Vaguelette Decomposition (WVD), built on Birgé and Massart (2001). The framework in Johnstone (2011) is more general than ours. However, this leads to less precise results. In all their results Johnstone and Paul (2013); Rochet (2013) use universal constants which are not really controlled. This is even more important for the constants inside the method, for example in the penalty. Our method contains an explicit penalty. It is used in the mathematical results and also in simulations without additional tuning. A possible extension of our method to the dependent WVD case does not seem straight-forward.
Homogeneous case. Let us compare with other work for the homogeneous setting Σ = σ 2 Id. There exist a lot of results in this framework, see e.g. Abramovich et al. (2006); Johnstone (2011); Massart (2007); Wu and Zhou (2013). Again those results contain universal constants, not only in the mathematical results, but even inside the methods. For example, constants in front of the penalty, but also inside the FDR technique, with an hyper-parameter q n which has to be tuned. The perhaps closest paper to our work is Golubev (2011) in the homogeneous case. Our penalty is analogous to "twice the optimal" penalty considered in Golubev (2011). This is due to difficulties in the heterogenous case, where the stochastic process that one needs to control is much more involved in this setting. Indeed, there is no more symmetry inside this stochastic process, since each empirical coefficient has its own variance. The problem and the penalty do not only depend on the number of coefficients that one selects, but also on their position. It is not obvious how the theory of ordered processes, the main tool in the homogeneous case, extends to the heterogeneous setting.
This leads to our main risk term 4 Σ nγ n log(γ −1 n ), while the risk 2σ 2 nγ n log(γ −1 n ) is obtained in Golubev (2011). The potential loss of the factor 2 in the heterogeneous framework is possibly avoidable in theory, but in simulations the results seem comparably less sensitive to this factor than to other modifications, e.g. to how many data points, among the nγ n non-zero coefficients, are close to the critical threshold level, which defines some kind of effective sparsity of the problem (often much less than nγ n ). This effect is not treated in the theoretical setup in most of the FDR-related studies, where implicitly a worst case scenario of the coefficients' magnitude is understood.

Lower bound
First we present a lower bound in the minimax sense over F 0 (γ n ). The proof is achieved by exhibiting a least favourable Bayesian prior on the signal coefficients. This reveals the statistical complexity of the model selection problem.
4.1 Theorem. For any estimatorf n based on n observations we have the minimax lower bound The introduction of the sequence (c n ) is used to have a uniform control of the coefficients and to make deviations from expectations sufficiently small. In specific cases the expression remains asymptotically the same when the supremum is taken over all sequences (α λ ) with α λ ∈ [0, 1] and λ α λ nγ n . A more concrete lower bound is given in the following corollary.
4.2 Corollary. Distributing mass uniformly over the r n indices with largest values σ λ in Theorem 4.1 yields the lower bound, as n → ∞, in terms of the inverse order statistics σ 2 (i) , provided log(n/r n ) = o(log(γ −1 n )) (i.e., r n must be somewhat larger than nγ n ).
For polynomial growth σ 2 (i) ∼ (n − i) β , β > 0, the lower bound simplifies to, as n → ∞, Remark that the lower bound is a kind of weighted entropy. In contrast to the upper bounds above the minimax (and the Bayes) lower bound does not involve the quantity Σ h f , individual to each unknown f . In the proof for this heterogeneous model, conceptually we need to allow for a high complexity of the class F 0 (γ n ), leading to the entropy factor log(γ −1 n ), and to put more prior probability on coefficients with larger variance, which explains the abstract weighted entropy expression.

Upper bound
Consider now the setting where the sparsity γ n is known and a correctly tuned threshold estimator is applied in order to identify the unknown positions of the significant non-zero coefficients f λ . This leads to the following non-adaptive minimax upper bound.
Let us mention that for faster growth than polynomial, we might well have β n = log(γ −1 n )o( Σ ). So, in general the upper bound matches exactly the lower bound with respect to the term 2nγ n log(γ −1 n ), while the influence of the heterogeneous noise depends on the specific case. This procedure, however, is nonadaptive since the threshold relies on the knowledge of the sparsity γ n .

A numerical example
In Figure 1 a typical realisation of the coefficients f λ is shown in blue with 50 non-zero coefficients chosen uniformly on [−6, 6] and increasing noise level σ λ = 0.01λ for every λ = 1, . . . , 200. The inner black diagonal lines indicate the minimax threshold (with oracle value of γ n ) and the outer diagonal lines the universal threshold. For each λ the non-blue points depict noisy observations X λ , obtained by adding N (0, σ 2 λ ) noise to the blue point f λ . Observations included in the adaptive full subset selection estimator are coloured green, while those included for the adaptive threshold estimator are the union of green and yellow points (in fact, for this sample the adaptive thresholding selects all full subset selected points), the discarded observations are in magenta.
We have run 1000 Monte Carlo experiments for the parameters n = 200, σ λ = 0.01λ in the sparse (γ n = 0.05) and dense (γ n = 0.25) case. In Figure 2 the first 100 relative errors are plotted for the different estimation procedures in the dense case, which offers more detailed sample information than mere boxplots. The errors are taken as a quotient with the sample-wise oracle threshold value applied to the renormalised X λ /σ λ . Therefore only the full subset selection can sometimes have relative errors less than one. Table 1 lists the relative Monte Carlo errors for the two cases. The last column reports the relative error of the oracle procedure with h λ = 1(f λ = 0) that discards all observations X λ with f λ = 0 (not noticing the model selection complexity).
The simulation results are quite stable for variations of the setup. Altogether the thresholding works globally well. The (approximate) full subset selection procedure (see below for the greedy algorithm used) is slightly worse and exhibits  Table 1: adaptive (blue), universal (magenta) and minimax (yellow) thresholding, full subset selection (green).

Table 1
Relative errors from 1000 Monte Carlo simulations as in Figure 1 γn a higher variability, but is still pretty good. By construction, in the dense case the oracle minimax threshold works better than the universal threshold, while the universal threshold works better in very sparse situations. The reason why the minimax threshold even with a theoretical oracle choice of γ n does not work so well is that the entire theoretical analysis is based upon potentially most difficult signal-to-noise ratios, that is coefficients f λ of the size of the threshold or the noise level. Here, however, the effective sparsity is larger (i.e., effective γ n is smaller) because the uniformly generated non-zero coefficients can be relatively small especially at indices with high noise level, see also Figure 1. Let us briefly describe how the adaptive full subset selection procedure has been implemented. The formula (2.3) attributes to each selected coefficient X λ the individual penalty p h λ = 2σ 2 λ (log(ne/r λ (h)) + log + (n Σ )/r λ (h) with the inverse rank r λ (h) of (h λ σ 2 λ ) λ (e.g., λ (log(ne) + log + (n Σ )) all coefficients with X 2 λ /σ 2 λ 4(log(ne) + log + (n Σ )) are included into h * 1 in an initial step. Then, iteratively h * i is extended to h * i+1 by including all coefficients with The iteration stops when no further coefficients can be included. The estimator h * I at this stage definitely contains all coefficients also taken by h * . In a second iteration we now add in a more greedy way coefficients that will decrease the total penalized empirical risk. Including a new coefficient X λ0 , adds to the penalized empirical risk the (positive or negative) value − X 2 λ0 + 4σ 2 λ0 (log(ne/r λ0 (h * I )) + log + (n Σ ))/r λ0 (h * I ) − 4 λ:σ λ <σ λ 0 (h * I ) λ σ 2 λ (log(1 + 1/r λ (h * I )) + log + (n Σ )/(r λ (h * I )(r λ (h * I ) + 1))).
Here, r λ0 (h * I ) is to be understood as the rank at λ 0 when setting (h * I ) λ0 = 1. Consequently, the second iteration extends h * I each time by one coefficient X λ0 for which the displayed formula gives a negative value until no further reduction of the total penalized empirical risk is obtainable. This second greedy optimisation does not necessarily yield the optimal full subset selection solution, but most often in practice it yields a coefficient selection h * with a significantly smaller penalized empirical risk than the adaptive threshold procedure. The numerical complexity of the algorithm is of order O(n 2 ) due to the second iteration in contrast to the exponential order O(2 n ) when scanning all possible subsets. A more refined analysis of our procedure would be interesting, but might have minor statistical impact in view of the good results for the straight-forward adaptive thresholding scheme.
This yields the asserted general bound and inserting the bound for log + (n Σ ) gives directly the second bound.
The property P (f ∈ F 0 (γ n )) → 1 then implies that the Bayes-optimal risk, derived below, will be an asymptotic minimax lower bound over F 0 (γ n ).