Estimation and variable selection with exponential weights

In the context of a linear model with a sparse coefficient vector, exponential weights methods have been shown to be achieve oracle inequalities for denoising/prediction. We show that such methods also succeed at variable selection and estimation under the near minimum condition on the design matrix, instead of much stronger assumptions required by other methods such as the Lasso or the Dantzig Selector. The same analysis yields consistency results for Bayesian methods and BIC-type variable selection under similar conditions. MSC 2010 subject classifications: Primary 62J99.


Introduction
Consider the standard linear regression model: where y ∈ R n is the response vector; X ∈ R n×p is the regression (or design) matrix, assumed to have normalized columns; β ⋆ ∈ R p is the coefficient vector; and z ∈ R n is white Gaussian noise, i.e., z ∼ N (0, σ 2 I n ). As in general the model (1) is not identifiable, we let β ⋆ denote one of the coefficient vectors of minimal support size such that Xβ = E(y). Then J ⋆ and s ⋆ denote the support and support size of β ⋆ . We are most interested in the case where the coefficient vector is sparse, meaning s ⋆ is much smaller than p. As usual, we want to perform inference based on the design matrix X and the response vector y. The four main inference problems are: • Denoising: estimate the mean response vector Xβ ⋆ ; • Prediction: estimate Uβ ⋆ for a new observation U ∈ R p ; • Estimation: estimate the coefficient vector β ⋆ ; • Support recovery: estimate the support J ⋆ .
These problems are not always differentiated and often referred to jointly as variable/model selection in the statistics literature, and feature selection in the machine learning literature. Being central to statistics, a large number of papers address these problems. We review the literature with particular emphasis on papers that advanced the theory of model selection. We find [38], who provides necessary conditions and sufficient conditions under which the AIC/Mallows' C p criteria and the BIC criteria are consistent. For example, AIC/Mallows' C p are consistent when there is a unique β such that E(y) = Xβ, and this β has a support of fixed size as n, p → ∞. Also, BIC is consistent when the dimension p is fixed and the model is identifiable -a condition that appears to be missing in that paper. BIC was recently shown in [14] to be consistent when the model is identifiable, p = O(n a ) with a < 1/2 and the true coefficient vector has a support of fixed size as n, p → ∞. They also propose an extended BIC for when a is larger. Closely related is the work of [1], who consider the maximum a posteriori (MAP) estimator for essentially the same sparsity prior. They derive the denoising performance of this estimator, together with sharp minimax prediction oracle inequalities for the sparse regression model (1).
Assuming the size of the support of β ⋆ is known, [34] establish denoising and estimation performance bounds for best subset selection, and obtain information bounds for these problems. Relaxing to the ℓ 1 -norm penalty, the Lasso and the closely related Dantzig Selector were shown to be consistent when the design matrix satisfies a restricted isometric property (RIP) or has column vectors with low coherence; see [4,33,6,29,8,49,12,10] among others. [45] used a different set of conditions called cone invertibility factors or also sensitivity conditions and established oracle inequalities for the estimation problem with the Lasso and Dantzig Selector. [22] also exploited this approach to build computationally tractable confidence bands for the Dantzig Selector. With a carefully chosen nonconcave penalty, [21] shows that consistent variable selection is possible when p = O(n 1/3 ). This condition on p was weaken in the follow-up paper [20], though with an additional restriction on the coherence (Condition (16) there). The strongest results in that line of work seem to appear in [47,48], which suggests a minimax concave penalty that leads to consistent variable selection under much weaker assumptions. The classical forward stepwise selection, also known as orthogonal matching pursuit, which is shown in [9] to enable variable selection under an assumption of low coherence on the design matrix. Screening was studied in [19] in the ultra high-dimensional setting, assuming the design is random. A combination of screening and penalized regression is explored in [24,25], with asymptotic optimality when the Gram matrix X ⊤ X/n is (mildly) sparse.
A distinct line of research considered the use of exponential weighting in highdimensional denoising/prediction problems [7,13,43,18,23,26,28,16,17]. This methodology has the potential of striking a good compromise between statistical accuracy and computational complexity. While computational tractability has only been demonstrated in simulations, a number of sharp statistical results exist for the denoising/prediction problems. In particular, [2,35] propose exponential weights procedures that achieve sharp sparsity oracle inequalities with no assumptions of the design matrix X. For a recent survey of the exponential weights literature, see [36]. We emphasize that there exists no result in the literature concerning the estimation and support recovery problems with an exponential weights approach and that our results are the first of this nature.
Our contribution is the following. We establish performance bounds for the version of exponential weights studied in [2] for the three inference problems of denoising, estimation and support recovery. The methodology developed in the present paper is new and brings novel and interesting results to the sparse regression literature. The main feature of this methodology is that it only requires comparatively almost minimum assumptions on the design matrix X and the target β ⋆ . In particular, for estimation and support recovery, the conditions are slightly stronger than identifiability. Moreover, when the size of support is known, the exponential weights method is consistent under the minimum identifiability condition as long as the nonzero coefficients are large enough, close in magnitude to what is required by any method, in particular matching the performance of best subset selection [34]. See also [41,11,46].
The rest of the paper is organized as follows. In Section 2, we describe in detail the methodology and state the main results concerning estimation and support recovery with our exponential weights procedure. We also apply our methodology to establish novel results in estimation and support recovery for the Bayesian map estimator studied in [1], thus completing the theoretical study of the map. In Section 3, we present further results and establish some connections with the Bayesian theory and the BIC estimator. We also compare the results we obtained for exponential weights with those established for other methods, in particular the Lasso and mcp. In Section 4, we discuss our results in the light of recent information bounds for model selection. The proofs of our main results are in Section 5.

Main results
We consider the version of exponential weights studied in [2], shown there to enjoy optimal oracle performance for the denoising problem. The procedure puts a sparsity prior on the coefficient vector and selects the estimates using the posterior distribution. We obtain a new denoising performance bound which is based on balancing the sparsity level and the size of the least squares residuals. The result does not assume any conditions on the design matrix. The task of support recovery, to be amenable, necessitates additional assumptions. We show that under near-identifiablity conditions on the design matrix, the posterior concentrates on the correct subset of nonzero components with overwhelming probability, provided that these coefficients are sufficiently large -somewhat larger than the noise level. This immediately implies that the maximum a posteriori (map) and the randomized exponential weights estimator (rew) are consistent. We then derive estimation performance guarantees in Euclidean norm and l ∞norm for the map, rew and the averaged exponential weights estimator (aew) that can also be interpreted as the posterior mean by analogy with the bayesian theory.
Throughout, we assume the noise variance σ 2 is known. We also assume that p ≥ n and remark that similar results hold when n ≥ p, with p replaced by n in the bounds. We use some standard notation. For any u = (u 1 , · · · , u d ) ⊤ ∈ R d with d ≥ 1 and q ≥ 1, we define Without loss of generality, we assume from now on that the predictors are normalized in the sense that For a subset J ⊂ [p] := {1, . . . , p}, let X J = [X j , j ∈ J] ∈ R n×|J| , where X j denotes the jth column vector of X. For a subset J ⊂ [p], let M J be the linear span of {X j , j ∈ J} and let P J be the orthogonal projection onto M J . Then, P ⊥ J := I n − P J is the orthogonal projection onto M ⊥ J . We say that a vector is s-sparse if its support is of size s.

Exponential weights
We start with the definition of a sparsity prior on the subsets of [p], which favors subsets with small support. This leads to a pseudo-posterior, which is used in turn to define various exponential weights estimators.
• The prior π. Fix an upper bound s ≥ 1 on the support size, and a sparsity parameter λ > 0. The prior chooses the subset J ⊂ [p] with probability • The posterior Π. Given that the noise is assumed i.i.d. Gaussian with variance σ 2 , given a subset of variables J ⊂ [p], the coefficient vector that maximizes the likelihood is the least squares estimate β J with a maximum proportional to exp(− P ⊥ J (y) 2 2 /(2σ 2 )). In light of this, we define the following pseudo-posterior, which chooses J ⊂ [p] with probability The prior π enforces sparsity and focuses on subsets of size not exceeding s. Without additional knowledge, we shall take s = p. The exponential factor in P ⊥ J (y) 2 2 in the posterior enforces fidelity to the observations. Note that Π is not a true posterior because no prior is assumed for β ⋆ ; we elaborate on this point in Section 3.2. The variance term 2σ 2 corresponds to the temperature T in a standard Gibbs distribution. We will calibrate the procedure via the sparsity exponent λ in (3), though we could have done so via the temperature as well. Remember that we assume that σ 2 is known. When the variance is unknown, we can replace it with a consistent estimatorσ 2 .
Based on the pseudo-prior Π, it is natural to consider the maximum a posteriori (map) support estimate, defined as This leads to considering the map coefficient estimate. For any J ⊂ [p], let β J denote the the least squares coefficient vector for the sub-model (X J , y) with minimum Euclidean norm -so that β J is unique even when the columns of X J are linearly dependent. When the columns of X J are linearly independent, the standard formula applies The map coefficient estimate is then defined as β map = β Jmap . We can also consider a randomized version of the exponential weights (rew) defined as follows In words, we first draw a subset of variablesĴ according to the posterior Π and we compute the standard least squares estimator in the corresponding submodel (XĴ , y).
We also want to study the averaged exponential weights estimator (aew) In the rest of this section, we establish denoising, support recovery and estimation oracle inequalities for the map, rew and aew estimators.

A concentration result for the posterior
Our performance bounds for support recovery rely, as they should, on concentration properties of the posterior Π. We first prove that, without any condition on the design matrix X, the posterior Π concentrates on subsets of small size. Proposition 1. Consider a design matrix X with p ≥ n and normalized column vectors (2). For some ε > 0 and c ≥ 1, take Then, with probability at

Identifiability
Actual support recovery requires some additional conditions, the bare minimum being that the model is identifiable.
This condition characterizes the identifiability of the model as stated in the following simple result.
In this paper, we establish that exponential weights, and also ℓ 0 -penalized variable selection, allow for support recovery and estimation under the condition I((2 + ε)s ⋆ ) for any ε > 0 fixed, as long as the non-zero entries of the coefficient vector are sufficiently large. In fact, I(2s ⋆ ) suffices when s ⋆ is known.
While I(s) is qualitative, results on estimation and support recovery necessarily require a quantitative measure of correlation in the covariates. The following quantity appears in the performance bounds we derive for exponential weights and related methods: for any integer s ≥ 1, define The quantity ν s is the smallest sparse singular value of among sub-matrices of 1 √ n X made of at most s columns. Note that, indeed, I(s) is equivalent to ν s > 0.

Support recovery
We now state the main result concerning the support recovery problem. It states that, under I((2 + ε)s ⋆ ), the posterior distribution Π concentrates sharply on the support of β ⋆ -which we assumed to be s ⋆ -sparse -as long as λ and the nonzero coefficients are sufficiently large.
Theorem 1. Consider a design matrix X, with p ≥ n and normalized column vectors (2), that satisfies Condition I((2 + ε)s ⋆ ) for some fixed ε > 0. Assume that (9) holds and Then, with probability at Under the conditions of Theorem 1, some straightforward calculations imply that J rew = J ⋆ with probability at least 1 − 6p −c . In particular, as p → ∞, the REW consistently recovers the support of the coefficient vector. Note that the same is immediately true for J map with probability at least 1 − 2p −c . Thus, we have the following corollary. The result applies in the high-dimensional setting p > n, as long as the conditions are met. Note that some classes of matrices like random Gaussian, Rademacher or Fourier matrices are known to satisfy our conditions with probability close to 1. See for instance [37] for a review of existing results. Characterizing all design matrices X that satisfy I((2 + ε)s ⋆ ) in the high-dimensional setting is an interesting open question beyond the scope of this paper.
We mention that, if s ⋆ is known and we restrict the prior over subsets J of size exactly s ⋆ , then the same conclusions are valid with ε = 0 and ν (2+ε)s⋆ replaced by ν 2s⋆ in (12), yielding consistent support recovery under the minimum identifiability condition I(2s ⋆ ). In Section 3, we show that the Lasso estimator requires much more restrictive conditions on the design matrix and β ⋆ to ensure it selects the correct variables with high probability.
Finally, we note that the concentration is even stronger. Under the same conditions, if for some fixed constant m ≥ 1, then We will use this refinement in the proof of Theorem 4.

Estimation
Armed with results for the support recovery, we establish corresponding bounds for the estimation problem. Our first result is a simple consequence of Theorem 5 and Proposition 1.
Theorem 2. Consider a design matrix X with p ≥ n and normalized column vectors (2). Let β denote either β map or β rew . Assume λ satisfies (9) with ε ≤ 1/2. Then with probability at least 1 − 3p −c , we have We continue with bounds on the estimation error, this time in terms of the l ∞ -norm. Based on Theorem 1 (and its proof), we deduce the following.
We emphasize that this estimator requires only the near minimum condition I((2 + ε)s ⋆ ) and that the nonzero components of β ⋆ are somewhat larger than the noise level in (12) to achieve the optimal (up to logs) dependence on n, p of the l ∞ -norm estimation bound. We will develop this point further in Section 3 below where we compare our procedure to the Lasso and mcp estimators.
We now study the performances of the aew β aew and that of the following variant Define the quantity ν min = min J⊂[p] : νJ >0 {ν J }, and note that ν min > 0.
Theorem 4. Let the conditions of Theorem 1 be satisfied and let c ≥ 1.

If in addition I(s) is satisfied,
3. If in addition I(s ⋆ + s) is satisfied and λ ≥ (62 + 4c) log p, We note that β aew requires at least I(s). (Recall that we assume s is known such that s ⋆ ≤ s.) In practice, when the sparsity is unknown, we make a conservative choice s ≫ 2s ⋆ so that I(s) is substantially more restrictive that I(2s ⋆ ). Typically, we assume that s ⋆ = O( n log p ) and we take s of this order of magnitude. We will see below in Section 2.6 that for Gaussian design, the condition I(s ⋆ + s) is satified with probability close to 1. On the other hand, the estimation result for β holds true under the near minimum condition I((2 + ε)s ⋆ ). For both estimators, their estimation bounds depend on the quantities ν min , ν s , Xβ ⋆ 2 and β ⋆ ∞ which can potentially yield a sub-optimal rate of estimation. Note however the presence of the factor p −c in the bound. In particular, if the nonzero components of β ⋆ are sufficiently large, then the quantities ν min , ν s , Xβ ⋆ 2 and β ⋆ ∞ may be completely cancelled for a sufficiently large c > 0. If I(s ⋆ + s) is satisfied, then we can derive a bound that no longer depends on Xβ ⋆ 2 and β ⋆ ∞ . We will also see below that this bound yields the optimal rate of l ∞ -norm estimation (up to logs) for the estimator β aew when the design matrix is Gaussian. Optimality considerations are further discussed in Section 4 based on recent information bounds obtained elsewhere.

Example: Gaussian design
The quintessential example is that of a random Gaussian design, where the row vectors of X, denoted x 1 , . . . , x n , are independent Gaussian vectors in R p with zero mean and p × p covariance matrix Σ. If we assume that Σ has 1's on the diagonal, the resulting (random) design is just slightly outside our setting, since the columns vectors are not strictly normalized. Our results apply nevertheless. Therefore, it is of interest to lower-bound ν s for such a design.
We start by relating X and Σ. Consider J ⊂ [p], and let Σ J denote the principal submatrix of Σ indexed by J. By [40, Cor. 1.50 and Rem. 1.51], there is a numeric constant C > 0 such that, when n ≥ C|J|/η 2 , with probability at least 1 − 2 exp(−η 2 n/C), we have where · denotes the matrix spectral norm. When this is the case, by Weyl's theorem [39, Cor. IV.4.9], where λ min (A) and λ max (A) denote the smallest and largest eigenvalues of a symmetric matrix A. Define for some a ≥ 2. Then, with probability at least 1 − 2p −a/2 , For example, in standard compressive sensing where Σ is the identity matrix, we have η Σ (s) = λ Σ (s) = 1 for all s, in which case with high probability ν s ≥ 1/2 when n ≥ 2Cs log p. Consequently, the l ∞ -norm estimation bounds in (14) and (16) are of the order bσ log(p)/n for some numerical constant b > 0. Again, the constants are loose in this discussion.

Denoising and prediction for exponential weights
The existing literature on exponential weights focuses entirely on the denoising/prediction problem. In particular, sharp oracle inequalities are available. See the recent survey [36]. Our general approach, and what we established in the previous section, allows us to also quantify the performance of exponential weights at denoising. Indeed, as a direct consequence of Proposition 1, we establish new sparsity oracle inequalities in probability for the denoising problem. In particular, we show without any assumption on X and β ⋆ that the map, rew and aew come within a log factor of that of the oracle estimator β J⋆ in terms of denoising performance: Theorem 5. Consider a design matrix X with p ≥ n and normalized column vectors (2). Assume λ = (62 + 12c) log p for some c > 0. Then with probability and Note that here, and anywhere else in the paper, what is true of β map is true of β J for any J such that Π(J) ≥ Π(J ⋆ ).
In [2,35], a sharp sparsity oracle inequality for the prediction problem is established in expectation for the aew using the approach by Stein's Lemma from [27]. Note that [2] also established an oracle inequality in probability for a different version of the rew that requires the knowledge of β ⋆ 1 . Here, we use instead the concentration property of the posterior Π and derive a sparsity oracle inequality for denoising in probability that is minimax optimal up to a logarithmic factor without knowing β ⋆ 1 . An open question is to determine whether our approach can be used to establish a similar result for the prediction problem without the knowledge of β ⋆ 1 .
Finally, note that there is no contradiction between Theorems 1 and 5 established in the high-dimensional setting p ≥ n and Theorem 1 in [44] which states the impossibility to achieve simultaneously consistent variable selection and denoising. More precisely, [44] proved that for any procedure β such that P(J( β) = J ⋆ ) → 1 as n → ∞, then we also have lim n→∞ E[ X β −Xβ ⋆ 2 ] = ∞. Indeed, the bound in (17)-(18) satisfies σ √ 8s ⋆ λ → ∞ as n → ∞ in the highdimensional setting p ≥ n.

Bayesian model selection and BIC estimator
Many Bayesian techniques for model selection have proposed in the literature; see [15] for a comprehensive review. That same paper suggests a procedure similar to ours, except that it is a bonafide Bayesian model and they use the following independence sparsity prior where ω ∈ (0, 1) controls the sparsity level. Roughly, λ for our prior corresponds to log(1 − 1/ω) for this prior. Our main results remain valid under this prior. [14] studied the performance of BIC in high-dimensional settings. Not only showed that BIC was consistent when p < √ n (under some mild conditions on the design matrix), they also suggested a modification of the penalty term to yield a method that is consistent for larger values of p when the number of variables in the true (i.e., sparsest) model s ⋆ is bounded independently of n or p.
[48] also proposed a variable selection result for the BIC estimator essentially under the condition I((2 + ǫ)s ⋆ ). By a simple modification of our arguments, we can also recover these results under the same condition I((2 + ǫ)s ⋆ ). Indeed, the results we established for J map -in particular, Corollary 1 -apply to J bic = arg min J:|J|≤s 1 σ 2 y ⊤ (I − P J )y + λ|J|.
[1] considered a prior very similar to (3) and studied the map in the context of the denoising problem. They established an oracle inequality as well as a performance bound similar to Theorem 5, as well as minimax lower bounds for the denoising problem for model (1).

The Lasso
The Lasso estimator is the solution of the convex minimization problem β lasso = arg min where λ = Aσ log(p)/n, A > 0 and · 1 is the l 1 -norm. The Lasso has received considerable attention in the literature over the last few years [3,6,8,32,33,49]. It is not our goal to make here an exhaustive presentation of all existing results. We refer to Chapter 4 in [30] and the references cited therein for a comprehensive overview of the literature. Concerning the l ∞ -norm estimation and support recovery problems, the most popular assumption is the Irrepresentable Condition [3,30,33,42,49] denoted from now on by IC(s ⋆ ). See for instance Assumption 4.2 in [30]. The condition IC(s ⋆ ) is strictly more restrictive than the identifiability I(2s ⋆ ) and does not hold true in general when the columns of the design matrix X are not weakly correlated.
We say that a l ∞ -norm estimation rate is optimal if it is of the form ασ log(p)/n where α > 0 is an absolute constant as in the case of a Gaussian sequence model where n = p, X = I n is the n × n identity matrix and z ∼ N (0, σ 2 I n ). On the one hand, the best available estimation bound for the LASSO with Gaussian design matrices and l ∞ -norm error is of the order σ s ⋆ (log p)/n and it is not clear whether this bound can be improved. Based on this best available l ∞ -norm estimation bound, we can only guarantee that the LASSO will be consistent for the support recovery problem when ρ σ s ⋆ (log p)/n. See Section 4.3 in [30] for more details. On the other hand, we established that our exponential weights procedure attains the optimal l ∞ -norm estimation rate and achieves support recovery provided that ρ σ (log p)/n and Condition I((2 + ε)s ⋆ ) holds true, a condition that allows for non-negligible correlations between the columns of X.

The mcp
The mcp estimator initially proposed by [47] is the solution of the following nonconvex minimization problem: where λ, γ > 0 and the mcp penalty function Υ is nonconvex, equal to 0 outside a compact neighborhood of 0 and admits a nonzero right derivative at 0. See equations (2.1)-(2.3) in [47] for more details. The performance of this estimator is established in Theorem 1 and Corollary 1 of [47] for the following choice of parameters: Define If We establish our support recovery result for the random exponential weights estimator under the near minimum conditions ν (2+ǫ)s⋆ > 0 and where ǫ > 0 is some small absolute constant. Our conditions are less restrictive since for small s ⋆ we will have (2 + ǫ)s ⋆ ≤ d * and consequently ν (2+ǫ)s⋆ ≥ ν d * .
In the high-dimensional setting, we often have (2 + ǫ)s ⋆ ≪ d * and very illposed design matrices X whose ν s decrease extremely fast as a function of s.
Consequently, we will have ν (2+ǫ)s⋆ ≫ ν d * . In that situation, the randomized exponential weights estimator will achieve support recovery under much weaker conditions than those required in [47] to establish the support recovery consistency of the mcp estimator. We also note that Corollary 1 in [47] is obtained under the asymptotic setting p ≫ n > s ⋆ → ∞ while our results hold for any settings of p, n, s ⋆ as long as ν (2+ǫ)s⋆ > 0 and the above condition on ρ is met. This includes in particular the setting of [47].
Finally, we remark that the optimal theoretical choice of γ for the mcp estimator in Corollary 1 in [47] depends on d * through ν d * . For arbitrary design matrices X where no theoretical bound on d * and ν d * are available, we may need to compute these quantities. This is a delicate combinatorial problem to solve in high-dimension since it requires considering a very large number of submatrices of X. The same parameter γ in Corollary 3 and the discussion below in [48] depends on s ⋆ which is typically unknown. Note that our exponential weights estimators do not present the same limitation. Indeed, the tuning of λ in (3) does not require computing any restricted eigenvalues and the parameter s can be chosen conservatively (for example,s = [n/2] if no other information is available). The randomized exponential weights and map estimators will achieve support recovery under the near minimum condition I((2 + ε)s ⋆ ) even ifs is chosen conservatively (says = n 2 ).

Discussion
We established some performance bounds for exponential weights when applied to solving the problems of denoising, estimation and support recovery, and deduced similar results for a slightly different Bayesian model selection procedure [15] and ℓ 0 -penalized (BIC-type) variable selection. How sharp are these bounds? We did not optimize the numerical constants appearing in our results, simply because we believe our bounds are loose and also because there are no known sharp information bounds for theses problems, except in specific cases [25]. That said, there are some results available in the literature [41,34,31,1] and our bounds come close to these. For example, from [34] we learn that, when I(2s ⋆ ) holds, there is a universal constant C > 0 such that, for any estimator β that knows s ⋆ , with probability at least 1/2, where we recall that κ s is defined in (20) and from [41], we learn that, for another universal constant C ′ > 0, .

E. Arias-Castro and K. Lounici
Thus we see that our estimation bounds (14) and (16) come quite close to these information bounds. See also the detailed discussion in [1].
Of course, there is a trade-off with computational tractability, as computing the exponential weights estimates (of even approximating them) in polynomial time remains an open problem. That said, numerical experiments in [35] show that these methods are promising.

Proofs
For the sake of brevity, we let · = · 2 throughout this section. The proofs are rather lengthy but the driving idea is to show that J∈J : J =J⋆ Π(J) is negligible in front of Π(J ⋆ ) under suitable conditions. To this end, we study the quantities Π(J) Π(J⋆) for all J = J ⋆ . An important part of the proofs consists in deriving a sharp control on the magnitude of the noisy perturbations. For the sake of clarity, this part is done separately in Lemmas 2-4 whose proofs are given in the appendix. Armed with these intermediate results, we can tune the parameter λ accurately in order to obtain the desired properties for the various exponential weights estimators considered in this paper.

Proof of Theorem 5
with For the inner product on the RHS, note that ξ J ∈ span(X J∪J⋆ ) and ξ J⋆ ∈ span(X J⋆ ), so that by Cauchy-Schwarz's inequality.

Proof of Proposition 1
Remember (21). We reformulate (22) in the following way The natural idea is then to divide the possible subsets J into the following classes J s,t = {J ⊂ [p] : |J| = s, |J ∩ J ⋆ | = t, J = J ⋆ } and study the behaviour of the above difference on each of these classes (Note that a similar strategy was carried out in [5] in a model selection framework to derive denoising/prediction oracle inequalities for various models). We first bound the inner product in (28).

Lemma 3.
For any c > 0, with probability at least 1 − p −c , for all J ∈ J s,t with t ≤ s ∧ s ⋆ .
We now bound the quadratic term in (28).

Lemma 4.
For any c > 0, with probability at least 1 − p −c , for all J ∈ J s,t with t ≤ s ∧ s ⋆ .
Combining (21) and (33) We then use the standard bound on the binomial coefficient Hence, we have so far that where we used the fact that p ω ≥ 2, because p ≥ 2, and also −(λ − ω log p)εs ⋆ + s ⋆ ω log p ≤ −c log p, because of (9). This shows that using the fact that ω ≥ c. From this, and the fact that p −c ≤ 1/2, we conclude the proof.

Proof of Theorem 1
Let ν = ν (2+ε)s⋆ for short. The proof of this result is identical to that of Proposition 1 up to (32). We now need a lower bound on γ J . For this, we use the following irrepresentability result.
, with smallest singular value δ, and let P 2 denote the orthogonal projection onto X 2 . Then for any β 1 , Note that for any J ∈ J s,t with s − t ≤ (1 + ε)s ⋆ , the smallest singular value of [X J⋆ X J\J⋆ ] is bounded from below by √ nν; by Lemma 5, this implies that Hence, where we recall that ρ is defined in (12).
In view of (32) and (39) we have, with probability at least 1 − 2p −c , for all J ∈ J s,t Next, we have The first sum in the right-hand side was already bounded in Proposition 1. We concentrate on the second sum.
Combining (38) with (41)-(44), we conclude that using the fact that α ≥ ω ≥ c. From this, we get This concludes the proof of Theorem 1. We note that the proof of (13) is virtually identical.

Proof of Theorem 3
We prove the result for β map . The proof for β rew is the same up to some trivial modifications. For r > 0, we have By Theorem 1, J map = J ⋆ with probability at least 1 − 2p −c , so that the second term on the RHS is bounded by 2p −c . Next, we know that β J⋆ ∼ N (β ⋆ , σ 2 1 n Ψ −1 ⋆ ) with Ψ ⋆ := 1 n X ⊤ J⋆ X J⋆ , and in particular, β J⋆,j − β ⋆,j ∼ N (0, σ 2 τ 2 j /n), where τ 2 j is the jth diagonal entry of Ψ −1 ⋆ . This matrix being positive semi-definite, its diagonal terms are all bounded from above by its largest eigenvalue, which is the inverse of the smallest eigenvalue of Ψ ⋆ , which in turn is larger than ν 2 s⋆ . Hence, Var( β J⋆,j ) ≤ σ 2 /(nν 2 s⋆ ) for all j ∈ J ⋆ , so that a standard tail bound on the normal distribution and the union bound give Taking r = σ 2(c + 1) log(p)/(nν 2 s⋆ ) bounds this by p −c , and the desired result follows.

Proof of Theorem 4
Again, we only prove the result for β map . The proof for β rew is also almost identical up to some trivial modifications.

Proof of Lemma 4
Fix J ∈ J s,t . First, we notice that z ⊤ (P J −P J⋆ )z = z ⊤ (P J −P J∩J⋆ )z −z ⊤ (P J⋆ −P J∩J⋆ )z ≤ z ⊤ (P J −P J∩J⋆ )z, since P J⋆ − P J∩J⋆ is an orthogonal projection, and therefore positive semidefinite. And Q J := P J − P J∩J⋆ is also an orthogonal projection, of rank s − t, so that Q J z 2 ∼ σ 2 χ 2 s−t . Chernoff's Bound applied to the chi-square distribution yields log P χ 2 m > a ≤ − m 2 (a/m − 1 − log(a/m)) ≤ − a 4 , ∀a ≥ 2m.
The rest of the proof is exactly the same as that of Lemma 3.