Adaptive variable selection in nonparametric sparse additive models

We consider the problem of recovery of an unknown multivariate signal $f$ observed in a $d$-dimensional Gaussian white noise model of intensity $\varepsilon$. We assume that $f$ belongs to a class of smooth functions ${\cal F}^d\subset L_2([0,1]^d)$ and has an additive sparse structure determined by the parameter $s$, the number of non-zero univariate components contributing to $f$. We are interested in the case when $d=d_\varepsilon \to \infty$ as $\varepsilon \to 0$ and the parameter $s$ stays"small"relative to $d$. With these assumptions, the recovery problem in hand becomes that of determining which sparse additive components are non-zero. Attempting to reconstruct most non-zero components of $f$, but not all of them, we arrive at the problem of almost full variable selection in high-dimensional regression. For two different choices of ${\cal F}^d$, we establish conditions under which almost full variable selection is possible, and provide a procedure that gives almost full variable selection. The procedure does the best (in the asymptotically minimax sense) in selecting most non-zero components of $f$. Moreover, it is adaptive in the parameter $s$.


Introduction
In recent years, there has been much work on methods for variable selection in high dimensional settings; refer, for example, to [3,6,8,18] and references therein. Among a variety of methods proposed, the lasso has become an important tool for sparse high-dimensional regression problems. Motivated by the fact that finding the lasso solutions is computationally demanding, Genovese et al. [6] studied the relative statistical performance of the lasso and marginal regression, which is also known as simple thresholding, for sparse high-dimensional regression problems. They found that marginal regression, where each dependent variable is regressed separately on each covariate, provides a good alternative to the lasso, and concluded that their procedure merits further study. Handling the problem of reconstruction in high dimensional regression, Genovese et al. [6] distinguished between the cases of exact, almost full, and no recovery. Exact recovery refers to the situation where the set of all relevant components can be consistently recovered (asymptotically). Almost full recovery stands for the possibility of having the number of misclassified components negligibly small as compared to the number of all relevant components. The latter strategy requires milder restrictions on a statistical model and can be used in the situations where exact recovery is impossible. If neither exact nor almost full recovery can be achieved, we speak of 'no recovery' when the optimal risk is as large as the number of relevant components and any recovery procedure fails completely.
Ingster and Stepanova [13] extended the idea of Genovese et al. [6] to the case of nonparametric regression. Specifically, they addressed the problem of recovering sparse additive smooth signals observed in the continuous regression model and showed that, asymptotically, as dimension increases indefinitely, exact variable selection is possible and is provided by a suitable thresholding procedure. The procedure in [13] is optimal in the asymptotically minimax sense. It is also free from the sparsity parameter and thus is adaptive. At the same time, the more intricate problem of almost full recovery in an adaptive setup remained unsolved. We shall treat this problem in the present paper.
Our setting is that of a multivariate signal f ∈ F d ⊂ L 2 ([0, 1] d ) = L d 2 corrupted by a Gaussian white noise of a given intensity ε: where W is a d-dimensional Gaussian white noise on [0, 1] d , ε > 0 is a noise intensity, and F d is a subset of L d 2 that consists of sufficiently smooth functions. In the present paper, two examples of F d will be considered. In this model, the "observation" is the function X ε : L d 2 → G taking its values in the set G of normal random variables such that if ξ = X ε (φ), η = X ε (ψ), where φ, ψ ∈ L d 2 , then E(ξ) = (f, φ), E(η) = (f, ψ), and Cov(ξ, η) = ε 2 (φ, ψ). For any f ∈ L d 2 , the observation X ε determines the Gaussian measure P ε,f on the Hilbert space L d 2 with mean function f and covariance operator ε 2 I, where I is the identity operator (see [9,19] for references). The expectation that corresponds to the probability measure P ε,f is denoted by E ε,f . In this paper, the case of growing dimension d = d ε → ∞ as ε → 0 is studied. It is well known that the continuous model (1) serves as a good approximation to a more realistic equidistant sampling scheme with discrete Gaussian white noise. In such an approximation, ε −2 roughly corresponds to the number n of observations per unit cube [0, 1] d . An important problem in this context is to recover f from noisy data. Attempting to suppress the curse of dimensionality and complement the findings in [13], we assume that f has an additive sparse structure. Our goal is to study under what conditions and by means of what procedure almost full recovery of an additive sparse signal f is possible. In other words, we wish to correctly identify most non-zero components of f . In doing so, we aim at providing the procedure that, for the two function spaces F d of our interest, one consisting of functions of finite smoothness and the other consisting of functions of infinite smoothness, is optimal in the asymptotically minimax sense. In the almost full recovery regime, one can detect even smaller relevant components but, unfortunately, at the price of a loss in the rate. Therefore constructing the corresponding procedure is technically more demanding as compared to that in the exact recovery case. To develop a good almost full recovery procedure, we will use results from minimax hypothesis testing and minimax estimation theory.
To fix some notation and assumptions, let the signal f in model (1) be of the form (see, for example, [5] and [13]) where for a number s ∈ {1, . . . , d}, called the sparsity parameter, The η j 's are non-random quantities taking values 0 and 1; the case η j = 1 (η j = 0) corresponds to the situation when the component f j is active (non-active). When s = o(d) we speak of a sparse additive signal f . In addition, each component f j is assumed to be an element of a certain smooth function space F σ ⊂ L 2 [0, 1] depending on a known parameter σ > 0; two examples of F σ under study are introduced in Section 2. Thus, the class of s-sparse multivariate signals of interest is where the components satisfy side condition that guarantees uniqueness, and the signal recovery problem becomes that of determining which sparse additive components are non-zero.
In the context of variable selection, the problem of reconstruction of an additive function f is now stated as follows. For each component f j of a signal f ∈ F d s,σ , consider testing the hypothesis of no signal H 0j : f j = 0 versus the alternative H 1j : f j ∈ F σ (r ε ), where for a positive family r ε → 0 and · σ is a norm on F σ . In this problem, a precise demarcation between the signals that can be detected with error probabilities tending to 0 and the signals that cannot be detected is given in terms of a detection boundary, or separation rate, r * ε → 0 as ε → 0. For various function classes frequently used in minimax hypothesis testing, sharp asymptotics for r * ε are available (see, for example, [10]). The hypotheses H 0j and H 1j separate asymptotically (that is, the minimax error probability tends to zero) if r ε /r * ε → ∞ as ε → 0. The hypotheses H 0j and H 1j merge asymptotically (that is, the minimax error probability tends to one) if r ε /r * ε → 0 as ε → 0. When H 0j and H 1j separate asymptotically, we say that f j is detectable. If the hypotheses H 0j and H 1j separate (merge) asymptotically when lim inf r ε /r * ε > 1 (lim sup r ε /r * ε < 1), the detection boundary r * ε is said to be sharp. The knowledge of a sharp detection boundary r * ε allows us to have a meaningful problem of testing H 0j : f j = 0 versus H 1j : f j ∈ F σ (r ε ) by choosing r ε so that lim inf ε→0 r ε /r * ε > 1. Otherwise, the function f j will be too "small" to be noticeable.
Let us agree to say that any measurable function η * = η * (X ε ) taking values on {0, 1} d is a selector. Following [6] and [13], we judge the quality of a selector η * of vector η ∈ H d,s by using the Hamming distance on {0, 1} d , which counts the number of positions at which η * = (η * 1 , . . . , η * d ) and η = (η 1 , . . . , η d ) differ: Following [6], we distinguishe between exact and almost full recovery. Roughly, a selector η * = η * (X ε ) is asymptotically exact if its maximum risk is o(1). Likewise, a selector η * = η * (X ε ) is asymptotically almost full if its maximum risk is o(s) with s being the number of non-zero components f j of a signal f = d j=1 η j f j . Ingster and Stepanova [13] have obtained adaptive procedure that gives asymptotically exact reconstruction of a σ-smooth signal f ∈ F d s,σ observed in a d-dimensional Gaussian white noise model. A similar result for the space of infinitely-smooth functions is stated in this paper in Section 4.2 (see Theorems 1 and 2). Although the selector in Section 4.2 is based on somewhat different statistics when compared to the one in [13], both selectors have one common feature that their thresholds are free of the sparsity parameter s and therefore automatically adapt themselves to its values.
The goal of this paper is three-fold. First, we find a sharp detection boundary that allows us to separate detectable components of a signal f ∈ F d s,σ from non-detectable ones. Next, assuming that all active components f j are detectable and that s belongs to a set S d , which puts some mild restrictions on the range of s, we construct a selector η * = η * (X ε ) with the property Finally, we show that if at least one of the f j 's is undetectable, then that is, almost full recovery is impossible. The selector η * that satisfies (3) is said to provide asymptotically almost full recovery of a signal f ∈ F d s,σ in model (1); its maximum risk is small relative to the number of non-zero components. If, in addition, inequality (4) holds true, then the selection procedure based on η * is the best possible (in the asymptotically minimax sense). The notion of optimality that we use is borrowed from the minimax hypothesis testing theory.
In the present setup, adaptive (in s) variable selection in high dimensions presents several challenges. First, one has to construct a good non-adaptive selector. Second, having that selector available, one has to adapt it to unknown values of the parameter s. It turns out that, when s is known, both exact and almost full recovery can be achieved by a suitably designed thresholding procedure (see Section 3.1 for details). The problem of adaptation of this procedure to unknown values of s was tackled and solved in [13], but in the case of exact recovery only. Handling the same problem in the almost full recovery case will bring us in this paper to the use of Lepski's method. This method was proposed for adaptive estimation in a Gaussian white noise model. The reason why adaptive reconstruction of most relevant components of f turns out to be more challenging than adaptive reconstruction of all components of f lies in the very nature of the thersholding procedure as defined in (20). In contrast to the exact selector given by (22) whose threshold is set regardless of the value of s, thresholding in (20) does depend on s.
The paper is organized as follows. In Section 2 we present some general results of the asymptotically minimax hypothesis testing theory and provide details on their use for the two function spaces of our interest. In Section 3 we translate the initial problem to the one in terms of the Fourier coefficients and, for both function spaces in hand, obtain almost full selectors for a known sparsity parameter s. In addition to that, we derive conditions under which almost full variable selection is possible. Adaptive selectors for the function spaces in hand are developed in Section 4. To complete the picture, we also introduce an adaptive selection procedure that gives exact reconstruction for the space of analytic functions. Our main results are stated in Section 4 and proved in Section 5.

The building blocks
As in [13], the recovery problem under study will be connected to that of hypothesis testing. Before stating and proving our main results, we shall discuss some important tools of minimax hypothesis testing that will be used in the subsequence sections. For a complete exposition of the subject, see [15] and the review papers [10,11,12].

Extreme problem for ellipsoids: general case
In asymptotically minimax hypothesis testing, when dealing with classes of smooth functions, the first common step is to transform the initial problem involving a class of functions to the corresponding problem in the space of Fourier coefficients. For this, let {φ k (x)} k∈Z be the orthonormal basis in L 2 [0, 1] given by is the kth Fourier coefficient of g, and g 2 2 = k∈Z θ 2 k . Let F σ be a function space depending on a parameter σ > 0 that is a subset of L 2 [0, 1]. Suppose that g ∈ F σ ⊂ L 2 [0, 1] is observed in a univariate Gaussian white noise of intensity ε, and we wish to test the null hypothesis H 0 : g = 0 versus a sequence of alternatives H 1ε : g ∈ F σ (r ε ), where the set F σ (r ε ) is given by (2). For the two function spaces of interest, the norm of an element g is expressed as (8) and (12) below). In the sequence space of Fourier coefficients, the set F σ (r ε ) corresponds to the ellipsoid in the space l 2 (Z) with semi-axis c k = c k (σ) and a small neighbourhood of the point θ = 0 removed: For constructing an asymptotically almost full selector, we shall need some facts from the minimax theory of hypothesis testing. Denote by θ * (r ε ) = (θ * k (r ε )) k∈Z the solution to the extreme problem and let u 2 ε (r ε ) = u 2 ε (Θ σ (r ε )) be the value of the problem, that is, The function u 2 ε (r ε ) plays a key role in the minimax theory of hypothesis testing. It controls the minimax total error probability and is used to set a cut-off point of the asymptotically minimax test procedure. The detection boundary r * ε in the problem of testing H 0 : θ = 0 versus H 1 : θ ∈ Θ σ (r ε ) is determined by the relation u ε (r * ε ) ≍ 1. The function u ε (r ε ) is a non-decreasing function of the argument r ε which possesses a kind of 'continuity' property. Namely, for any ǫ > 0 there exist ∆ > 0 and ε 0 > 0 such that for any δ ∈ (0, ∆) and ε ∈ (0, ε 0 ), These and some other facts about u 2 ε (r ε ) can be found in [10,Sec. 3.2] and [15,Sec. 5.2.3]). For standard function spaces with the norm g σ defined (under the periodic constraints) in terms of Fourier coefficients as g 2 σ = k∈Z θ 2 k c 2 k , the form of the extremal sequence (θ * k (r ε ) k∈Z in problem (6) as well as the sharp asymptotics for u ε (r ε ) are available. Below we cite some relevant results for the two function spaces F σ of our interest: the Sobolev space of periodic σ-smooth function on R and the space of periodic functions on R that admit an analytic continuation to the strip around the real line.

Extreme problem for Sobolev ellipsoids
Let F σ with σ > 0 denote the Sobolev space of σ-smooth 1-periodic functions on R. Define the norm · σ on F σ by the formula where θ k is the kth Fourier coefficient of f with respect to {φ k (x)} k∈Z . If σ is an integer, then under the periodic constraints (when the function admits 1-periodic [σ]-smooth extension on the real line) the norm as in (8) corresponds to For a function f ∈ F σ consider testing the hypothesis H 0 : f = 0 versus the alternative Switching from Sobolev balls {f ∈ F σ : f σ ≤ 1} to Sobolev ellipsoids {θ ∈ l 2 (Z) : k∈Z c 2 k θ 2 k ≤ 1} leads to the problem of testing H 0 : θ = 0 versus H 1 : θ ∈ Θ σ (r ε ). The test procedure that does the best in distinguishing between the latter two hypotheses is obtained by solving the extreme problem (6) with the semi-axes c k defined as in (8); see Section 3 of [10] for details. The extremal sequence (θ * k (r ε ) k∈Z satisfies (see, for example, [10, § 3.2] and Theorem 2 in [16]): where The sharp asymptotics for u ε (r ε ) are of the form (see [15, § 4.3.2] and Theorems 2 and 4 in [16]) where (see, for example, p. 104 of [10]) and B(·, ·) is the Euler beta-function.

Extreme problem for the ellipsoids of analytic functions
The following example of F σ is also well known in nonparametric estimation and hypothesis testing. Let F σ with σ > 0 be the class of 1-periodic functions f on R admitting a continuation to the strip S σ = {z = x + iy : |y| ≤ σ} ⊂ C such that f (x + iy) is analytic on the interior of S σ , bounded on S σ and Let the norm · 1,σ on F σ be given by (see, for example, [7]) In terms of the Fourier coefficients, the squared norm f 2 1,σ takes the form In view of the relations we may also consider an equivalent norm · σ defined as We have chosen to deal with the latter norm as it is easier to study. The ball {f ∈ F σ : f σ ≤ 1} corresponds to the ellipsoid {θ ∈ l 2 (Z) : k∈Z c 2 k θ 2 k ≤ 1} with the semi-axes c k defined as in (12). Thus translating the problem of testing H 0 : f = 0 versus H 1 : f ∈ F σ (r ε ) to the one in terms of Fourier coefficients brings us to testing H 0 : θ = 0 versus H 1 : θ ∈ Θ σ (r ε ). The asymptotically minimax test procedure that distinguishes between these two hypotheses is obtained by solving the extreme problem (6) with the semi-axes c k defined as in (12). The elements of the extremal sequence (θ * k (r ε ) k∈Z in problem (6) with the semi-axis c k as above may be taken as constants (independent of k) satisfying as ε → 0 (see, for example, Section 3 in [10]) where and we have Formulas (13)- (15), as well as formulas (9)-(11), will be employed to construct almost full selectors for the two function spaces under study.

Variable selection in a sequence space model
By sufficiency, the problem of recovering f observed in the Gaussian white noise model can be transformed to an equivalent problem in a sequence space model. Acting as in [13], for the index l ∈ Z d whose jth component is equal to k and the other components are equal to zero, define the function Consider the sequence space model where X j,k = X ε (φ j,k ) are the empirical Fourier coefficients and the collection ( In this paper we have chosen to deal with the latter model, which is technically more convenient. Although the set of θ j s in (17) involves an orthogonal system in L d 2 , the results on minimax errors and risks do not depend on the choice of this orthogonal system because the random variables X j,k , which generate a sufficient σ-algebra for f ∈ F d s,σ , are independent normal N (η j θ j,k , ε 2 ). Thus the distribution of {X j,k } depends on the Fourier coefficients θ j,k of f with respect to the system {φ j,k } but not on the choice of {φ j,k }. Using a suitable finite collection of the random variables X j,k as defined in (16), we wish to construct an optimal selection procedure that recovers most non-zero components of (η 1 θ 1 , . . . , η d θ d ), but not all of them.

Almost full variable selection in the non-adaptive case
We first consider a non-adaptive setup when the sparsity parameter s is known. When dealing with the problem of variable selection in model (16), we make use of the statistics, cf. asymptotically minimax test statistics in Section 3.1 of [10], where for any r ε > 0 the weight functions ω k (r ε ) are given by the formula and the number r * ε (s) > 0 is the solution of the equations u ε (r * ε (s)) 2 log(d/s) = 1.
For both function spaces of interest, the quantities K ε , θ * k (r ε ), and u ε (r ε ) in formula (18) are specified in Section 2. The sparsity parameter s ∈ {1, 2, . . . , d} is assumed to be small relative to d, that is, s = o(d). Note that the weights ω k (r ε ) are normalized to have 1≤|k|≤Kε ω 2 k (r ε ) = 1/2. Now we define a non-adaptive almost full selector to bě where δ = δ ε > 0 satisfies The arguments as in the proof of Theorem 1 show that for Sobolev ellipsoids, under the conditions, cf. (23), the selectorη reconstructs almost all relevant components of a vector η ∈ H d,s , and hence asymptotically provides almost full recovery of a signal f ∈ F d s,σ in model (1). To illustrate the difference between exact and almost full reconstruction in adaptive settings, assume that F σ is the Sobolev space. In this case, a selector (see Section 3.1 of [13] with s in place of d 1−β ) where the statistics t * j are defined similar to the t j as in (18) with the relation instead of (19), turns our to be a non-adaptive exact selector, as long as Under the above conditions, the procedure based on η * selects correctly all non-zero components of a vector η ∈ H d,s , and hence provides exact recovery of a signal f ∈ F d s,σ in model (1).
Contrasting with formula (22), the threshold in (20) is set at a lower level and is dependent on the parameter s. The latter fact makes the idea of adaption suggested in [13] for the exact reconstruction case invalid (see Section 3.3. for details). In the next section, we obtain the desired adaptive selector by using Lepski's method. Before doing that, we provide conditions on d as a function of ε under which the thresholding procedure (20), as well as its adaptive version introduced in Section 4.1, gives asymptotically almost full reconstruction of a function f ∈ F d s,σ .

Conditions for almost full variable selection
Consider now the question of determining conditions on d as a function of ε under which almost full variable selection is possible. Violation of these conditions will lead to entirely different selection strategies.
In the sequence space of Fourier coefficients, consider testing the null hypothesis H 0j : θ j = 0 versus the alternative H 1j : θ j ∈ Θ σ (r ε ), where the set Θ σ (r ε ) is given by (5). It is easy to see that under the null hypothesis H 0j , we have (see, for example, Section 4.1 of [13]) E 0 (t j ) = 0, Var 0 (t j ) = 1, while under the alternative H 1j , where for all sufficiently small ε a small parameter r ε > 0 satisfies r ε /r * ε (s) > 1, Furthermore, under the above restrictions on r ε and r * ε (s) the following result holds (in case of Sobolev spaces, see Proposition 7.1 in [5] and Lemma 1 in [13]; in case of the space F σ of analytic functions, the proof is similar to that of Sobolev spaces).
Let the quantity T = T ε → −∞ and the weight functions ω k (r * ε (s)) as in (18) be such that as ε → 0 Then as ε → 0 and for all j = 1, . . . , d, uniformly in θ j ∈ Θ σ (r ε ), For both function spaces F σ of our interest, the exponential bounds (26) and (27) will be applied below to the quantity T = T ε → −∞ of order O(log 1/2 d). This observation and assumption (19) transform requirement (25) into Condition (28) gives a restriction on the growth of d = d ε ensuring that the selection procedure works as designed. Indeed, as shown in Section 4.1 in [13], for the Sobolev space of σ-smooth functions, one has Therefore condition (28) is fulfilled when In case of the space F σ of analytic functions, one has and, in view of (11) and (19), the quantity r * ε (s) satisfies Therefore log (r * ε (s)) −1 ∼ log(ε −1 ), and (see (30)) From this, the technical condition (28) holds true when, cf. formula (30),

Main results
In this section, we consider a more realistic problem when the sparsity parameter s is unknown. We derive conditions under which almost full variable selection is possible, and construct a selector for which the Hamming distance is much smaller than the number of relevant components (see Theorems 3 and 5). Our selector is adaptive in the sparsity parameter s and is unimprovable in the asymptotically minimax sense (see Theorems 4 and 6). In addition to that, in Section 4.2 we provide asymptotically exact selection procedure for the space of analytic functions that is adaptive in the sparsity parameter s.

Almost full variable selection in the adaptive case
In this subsection, the selectorη as in (20) will be used to obtain the corresponding adaptive procedure. To avoid losses due to adaptation, we will have to limit the range of the possible values of s. Namely, we assume that for some constants 0 < c < C < 1 and define the set and assume that yielding d ∆ ≤ const for all large enough d. For each m = 1, . . . , M , let the parameter r * ε (s m ) > 0 be determined by the equation, cf. (19), where, depending on a type of the ellipsoid Θ σ (r ε ) we are dealing with, the function u ε (r ε ) satisfies either (11) or (15). Similar to the case of known s, consider weighted chi-square type statistics, cf. (18), with weight functions possessing the property 1≤|k|≤Kε ω 2 k (r * ε (s m )) = 1/2. The values of θ * k (r * ε (s)) and K ε depend on the function space under consideration. For the Sobolev space in hand, θ * k (r * ε (s)) and K ε are as in (9) and (10); for the space of analytic functions, θ * k (r * ε (s)) and K ε are as in (13) where m is chosen by Lepski's method (see Section 2 of [17]) as follows: Here the quantities v i = v i,d are set to be Algorithmically, Lepski's procedure for choosingm works as follows. We start by settinĝ

Exact variable selection for analytic functions in the adaptive case
The problem of adaptive reconstruction of sparse additive functions in the Gaussian white noise model was studied in the only case of σ-smooth functions, see [13]. Before handling the problem of almost full variable selection in adaptive settings, we complement the findings in [13] by presenting an adaptive exact selector for the space of analytic functions. The strategy is similar to the one suggested in [13] for σ-smooth functions, but the parameters of the statistics and the condition on the dimension d are different.
Consider a sequence space model that corresponds to the Gaussian white noise model with f from the class of analytic functions F σ as defined in Section 2.3. Let 1 < s 1 < s 1 < . . . < s M < d be the grid of points as in (33). For any m = 1, . . . , M , let the parameter r * ε,m > 0 be determined by the equation Consider weighted chi-square type statistics with weight functions obeying the normalization condition k∈Z ω 2 k (r * ε,m ) = 1/2. Next, for all j = 1, . . . , d and m = 1, . . . , M , set η j,m = I t j,m > (2 + δ)(log d + log M ) , and define an adaptive exact selector η * * of a vector η ∈ H d,s by the formula (see formula (18) in [13]) The idea behind the selector η * * is as follows. The jth component of a signal is viewed active if at least one of the statistics t j,m , m = 1, . . . , M , detects it. Therefore, thinking of η j,m and η * * j as test functions, we get that the probability of having θ j incorrectly undetected does not exceed the respective probability with the η j,m test, where s m is close to the true (but unknown) value of s. Furthermore, the probability that η * * j incorrectly detects θ j is less than the sum of the respective probabilities for the η j,m tests over all m = 1, . . . , M, and is small by the choice of threshold.
Let the set Θ σ,d (r ε ) be as in (37) with the coefficients c k given by (12). The following two theorems, whose proofs are similar to those of Theorems 3 and 4 in [13], hold true.
where η * * is the selector of vector η as defined in (36).
The sharp detection boundary in Theorems 1 and 2 which makes it possible to decide on whether we are in a position to proceed further with variable selection or not, is determined in terms of the function u ε (r ε ) with sharp asymptotics as in (11) and (15). The use of u ε (r ε ) instead of r ε makes it easy to build a bridge between variable selection in Gaussian white noise setting and variable selection in regression setting as studied in Sec. 4 of [6]. In addition, using u ε (r ε ) instead of r ε makes the statement of detectability condition precise. By 'continuity' of u ε (r ε ) as cited in (7), the conditions of Theorems 1 and 2 that separate detectible components from undetectable ones can be written in a usual form lim inf ε→0 r ε /r * ε > 1 and lim sup ε→0 r ε /r * ε < 1, where for Sobolev ellipsoids the sharp detection boundary r * ε is found explicitly from (11), and for the ellipsoids of analytic functions it is found implicitly from (15). Similar remark applies to Theorems 3 to 6 stated in Section 4.3 and 4.4,

Almost full variable selection for Sobolev balls
Consider the set Θ σ (r ε ) as in (5) with the coefficients c k given by (8), and define the set Let η(s m ) be the selector given by (35) based on the statistics t j (s m ) as in (18), where the quantities θ * k (r ε ), K ε , and u ε (r ε ) are specified by formulas (9), (10), and (11), respectively. The following theorem holds.
The next result shows that if the detectability condition is not met, almost full selection is impossible.
where the infimum is over all selectorsη of a vector η in model (16).

Almost full variable selection for analytic functions
The results similar to Theorems 3 and 4 hold true for the space of analytic functions. Namely, consider the sets Θ σ (r ε ) and Θ σ,d (r ε ) as in (5) and (37) with the coefficients c k given by (12). Again, let η(s m ) be the selector defined by (35) based on the statistics t j (s m ) as in (18), but the quantities θ * k (r ε ), K ε , and u ε (r ε ) are now as in (13), (14), and (15), respectively. The following results hold true.

Proofs of the Theorems
In this section, we prove Theorems 3 and 4. The proofs of Theorems 5 and 6 go along the same lines and therefore are omitted. Throughout the proof, the exponential bounds (26) and (27) on the tail probabilities of the statistics t j (s) will be frequently used.
Proof of Theorem 3. Let m 0 ∈ {2, . . . , M } be such that which implies that s m 0 /s < d ∆ . Then, using the definition of the selectorη(sm), we can write To complete the proof, we need to show that I 1 and I 2 are both negligibly small when ε is small. Consider the term I 1 and observe that for all η ∈ H d,s and θ ∈ Θ σ,d (r ε ), where by (34) and the choice of the sequences τ d and ∆ where by (26) the first summand in the above expression satisfies To treat the second term on the right side of (39), recall that 1 < s m 0 /s < d ∆ . Then, by the assumption on the parameter r ε = r ε (s) and the 'continuity' of the function u ε (r ε ) as stated in (7), using the fact that ∆ log d → 0 as d → ∞, one can find a constant δ 1 > 0 such that for all sufficiently small ε r ε ≥ r * ε (s m 0 )(1 + δ 1 ). From this, using Proposition 4.1 in [5] and recalling formula (24), where the last inequality follows from the fact that d c ≤ s m 0 < d C , which implies δ log d = o(log(d/s m 0 )). Thus as ε → 0 Now (27) in combination with (40) and (41) gives, uniformly in θ 1 ∈ Θ σ (r ε ), Putting everything together, we conclude that the first term on the right side of (38) satisfies Let us now show that By definition ofm, for all η ∈ H d,s and all θ ∈ Θ σ,d (r ε ),

Now, we introduce independent events
A j (s) = t j (s) > 2 log(d/s) + δ log d , j = 1, . . . , d, and denote by A j (s) the complement of A j (s). Observing that for all To bound this sum, we shall apply Bernstein's inequality saying that if X 1 , . . . , X d are independent random variables such that for all j = 1, . . . , d and for some H > 0 then (see, for example, pp. 164-165 of [2]) Observe that for independent random variables X 1 , . . . , X d with the property E(X j ) = 0 and |X j | ≤ M, j = 1, . . . , d, for some M > 0, the Bernstein condition (43) holds with H = M/3. Below we will use Bernstein's inequality in the case of t ≥ B 2 d /H. To do this, let us introduce random variables X j = X j (s k−1 , s i ), 1 ≤ j ≤ d, m 0 ≤ k ≤ M , k ≤ i ≤ M , by the formula and observe that |X j | ≤ 4, j = 1, . . . , d, and for all η ∈ H d,s and θ ∈ Θ σ,d (r ε ) E η,θ (X j ) = 0, j = 1, . . . , d.
Before applying Bernstein's inequality, we show that for all η ∈ H d,s and θ ∈ Θ σ,d (r ε ), and for We have Recalling (26) and the relation τ d d −δ/2 → 0 as d → ∞, we have Similarly, using the fact that v k−1 < v i when k ≤ i ≤ M , we obtain Therefore for all m 0 ≤ k ≤ M and k ≤ i ≤ M Consider the second term on the right side of (46), J 2 (s k−1 , s i ). First, note that for all m 0 ≤ k ≤ M and k ≤ i ≤ M , s < s i and s < s k−1 , k = m 0 . and for k = m 0 one has s k−1 = s m 0 −1 ≤ s, which implies s/s m 0 −1 < d ∆ . Therefore, by the assumption on r ε = r ε (s) and the 'continuity' of the function u ε (r ε ) as cited in (7), using the fact that ∆ log d → 0 as d → ∞, one can find constants δ 2 > 0 and δ 3 > 0 such that for all sufficiently small ε when m 0 ≤ k ≤ M and k ≤ i ≤ M . From this, for all sufficiently small ε, cf. (40), and hence as ε → 0 2 log(d/s i ) + δ log d − inf It now follows from (27), (48), and (49) that, uniformly in θ 1 ∈ Θ σ (r ε ), Also, as relation (49) continues to hold with s k−1 , m 0 ≤ k ≤ M , instead of s i , similar arguments yield which implies Combining (46), (47) and (50), we arrive at (45). We see then by (45) that Therefore, the use of Bernstein's inequality as in (44) for the case of t ≥ B 2 d /H with H = 4/3 gives This in combination with (38) and (42) completes the proof of Theorem 3. ⊔ ⊓

Proof of Theorem 4
To prove the theorem, we first pick good prior distributions on η = (η j ) and θ = (θ j ). Having done this, we bound the normalized minimax risk by the normalized Bayes risk and show that the latter is strictly positive. The first part of the proof up to relation (55) go along the lines of that of Theorem 2 in [13], with p = s/d instead of p = d −β .

Concluding remarks
In the context of variable selection in high dimensions, in both regression and white noise settings, simple thresholding provides plausible alternative to the lasso for a large range of problems. As a statistical tool, thresholding strategy is simple in nature and is not as computationally demanding as the lasso, especially in very high dimensional problems. At the same time, it is capable of doing at least as good as the lasso, or even better (see our Theorems 1 to 6, Theorems 9 to 11 in [6], and Theorems 1 and 2 in [13] for details). In light of these facts, we support the viewpoint of Genovese et al. [6] that for sparse high-dimensional regression problems a simple thresholding procedure merits further investigation.
To conclude our study, we point out possible directions for extending the results obtained in this paper. For the two function spaces F σ at hand, it might be of interest to produce asymptotically exact and almost full selectors in very high dimensional settings when the conditions log d = o ε −2/(2σ+1) and log d = o(log ε −1 ) on the growth of d as a function of ε are violated.
The setup of inverse problems, where the observations are X ε = Kf + εW , with K being a linear operator such that K ⋆ K is compact, translates into a Gaussian sequence model with heterogenous observations X j,k = η j θ j,k + ǫv k ξ j,k , where v −2 k are the eigenvalues of K ⋆ K. This case, which extends our setup, can be treated by using the sharp testing results for the inverse problems obtained in [14].
Furthermore, handling the problem of variable selection in a sequence space model, general ellipsoids {θ ∈ l 2 (Z) : k∈Z c 2 k θ 2 k ≤ 1} in l 2 (Z), with semi-axes c k decreasing fast enough, could be studied. A more complicated model, in which a d-variate regression function f admits a decomposition to a sum of k-variate components, with k ≥ 2 and only a small number s of these components being non-zero, also deserves some attention.
Eliminating the assumption of known parameter σ leads to the problem of adapting the proposed selection procedures to the possible values of σ.
To pursue more practical goals, one can try to translate the results obtained for an additive s-sparse Gaussian white noise model to the corresponding discrete regression model for which the corresponding detection problem was solved in [1].