Aggregation of Affine Estimators

We consider the problem of aggregating a general collection of affine estimators for fixed design regression. Relevant examples include some commonly used statistical estimators such as least squares, ridge and robust least squares estimators. Dalalyan and Salmon (2012) have established that, for this problem, exponentially weighted (EW) model selection aggregation leads to sharp oracle inequalities in expectation, but similar bounds in deviation were not previously known. While results indicate that the same aggregation scheme may not satisfy sharp oracle inequalities with high probability, we prove that a weaker notion of oracle inequality for EW that holds with high probability. Moreover, using a generalization of the newly introduced $Q$-aggregation scheme we also prove sharp oracle inequalities that hold with high probability. Finally, we apply our results to universal aggregation and show that our proposed estimator leads simultaneously to all the best known bounds for aggregation, including $\ell_q$-aggregation, $q \in (0,1)$, with high probability.


INTRODUCTION
In the Gaussian Mean Model (GMM), we observe a Gaussian random vector Y ∈ IR n such that Y ∼ N (µ, σ 2 I n ) where the mean µ ∈ IR n is unknown and the variance parameter σ 2 is known. For the purpose of discussion, we assume that σ 2 = 1 throughout this introduction but our main subsequent results explicitly depend on σ 2 . This apparently simple model introduced in a notorious paper [Ste56] by Stein, was the starting point of a vast literature on shrinkage [Gru98] that later evolved in the Gaussian sequence model. This literature is much too vast to explore here but we refer the reader to the excellent manuscript by Johnstone [Joh11] for both motivation and partial literature review.
Independently of the variety of methods and results dedicated to the GMM, Nemirovski [JN00,Nem00] introduced aggregation theory as a versatile tool for adaptation in nonparametric estimation [Lec07,RT07,Yan04], but also more recently in high dimensional regression [LB06,RT11,DS12]. In all these results, exponential weights have played a key role (see [RT12] for a recent survey). Specifically, we focus here on model selection aggregation where, given a family of estimators µ 1 , . . . ,μ M , the goal is to mimic the best of them. Originally, aggregation was accompanied with a sample splitting scheme in which the sample was split into two parts: the first one to construct various estimators and the second to aggregate them. For example, this approach was practically implemented in [RT07] for density estimation and in [Lec07] for classification. The advantage of sample splitting is that it allows to freeze the first sample and therefore treat the estimators to be aggregated as deterministic functions that only satisfy mild boundedness assumption. This is the framework of pure aggregation under which most of the developments have been made starting from the seminal works on aggregation [JN00,Nem00,Tsy03]. Pure model selection aggregation in the GMM can be described as follows. Given M ≥ 2 vectors µ 1 , . . . , µ M , the goal is to construct an estimatorμ called aggregate, using the observation Y and such that μ − µ 2 − min 1≤j≤M µ j − µ 2 is as small as possible, where · denotes the Euclidean distance on IR n . Bounds on this quantity are called sharp oracle inequalities. While not directly connected to Stein's original result on admissibility [Ste56], it turns out that for aggregation too, the most natural choiceμ = µ wherê  = argmin 1≤j≤M µ j − Y 2 is suboptimal. Nevertheless, this problem is by now well understood and various optimal choices forμ relying on model averaging rather than model selection were proposed and proved to be optimal (see [RT12] and references therein). Two approaches have been employed successfully. The first family of methods is based on exponential weights [DT07,DT08]. Following original ideas of Catoni [Cat99] and Yang [Yan99], it can be proved that for any prior probability distribution π = (π 1 , . . . , π M ) on [M ] = {1, . . . , M }, there exists an aggregateμ EW based on exponential weights that satisfies the following sharp oracle inequality: where here and in what follows C > 0 is a numerical constant that may change from line to line.
In particular, if π is chosen to be the uniform distribution, this estimator attains the optimal rate C log(M ) [RT11] that is independent of the dimension n. Nevertheless, it was observed in [DRZ12] that the random quantity μ EW − µ 2 may have fluctuation of order √ n around its expectation so that the bound (1.1) may be fail to accurately describe the risk ofμ EW , especially for large dimension n. To overcome this limitation, a new method called Q-aggregation was recently proposed and studied in several settings [Rig12,DRZ12,LR13]. It enjoys the following property. For any prior π on [M ], it yields an aggregateμ Q that satisfies not only a sharp oracle inequality in expectation of form (1.1) but also one that holds with high probability: with probability 1 − δ.
In this paper, we extend this work to the aggregation of not fixed vectors µ 1 , . . . , µ M but of affine estimatorsμ 1 , . . . ,μ M that are of the formμ j = A j Y + b j for some deterministic matrix-vector pair (A j , b j ). Note that these estimators are constructed using the same observations Y as the ones employed for aggregation. In particular, no sample splitting scheme is needed.
A canonical example of affine estimators where A j are projection matrices, was first introduced in [LB06] and further studied in [RT11] under the light of high-dimensional linear regression. In a remarkable paper, Dalalayan and Salmon [DS12] recently extended these setups to a more general family of affine estimators, under mild conditions on matrices A j . Nevertheless, all these previous papers are limited to deriving sharp oracle inequalities in expectation of the same type as (1.1). Moreover, the lower bounds of [DRZ12] indicate that the estimators based on exponential weights that are employed in [LB06,RT11,DS12] are unlikely to satisfy sharp oracle inequalities with high probability. In this paper, akin to [Rig12,DRZ12,LR13], we demonstrate that Q-aggregation succeeds where exponential weights have failed by proving a sharp oracle inequality that holds with high probability in section 2.1. Yet, the situation regarding exponential weights is not desperate as we show in section 2.2 that it still leads to a weaker notion of oracle inequalities.
The rest of this paper is organized as follows. In the next section, we give a precise description of the problem of model selection aggregation of affine estimators and give a solution to this problem using Q-aggregation. Specifically, in section 2.1, we show that for any prior probability distribution π = (π 1 , . . . , π M ) on [M ] = {1, . . . , M }, there exists an aggregateμ Q based on Q-aggregation that satisfies a sharp oracle inequality of the form that holds both in expectation and with high probability, where Tr(A j ) denotes the trace of A j . We continue by proving in section 2.2 that for any ε > 0, there exists a choice of the temperature parameter for which the better known aggregateμ EW based on exponential weights satisfies a weak oracle inequality that holds with high probability Such an inequality completes the sharp oracle inequality of [DS12] that holds in expectation. We give applications of these oracle inequalities to sparsity pattern aggregation and universal aggregation in section 3. In particular, we show that Q-aggregation of projection estimators leads to the first sharp oracle inequalities that hold with high probability for these two problems. By "high probability", we mean a statement that holds with probability at least 1 − δ, 0 < δ < 1/2. Our results below exhibit explicit dependence on δ.
Notation: For any integer n, the set of integers {1, . . . , n} is denoted by [n]. We denote by Tr(A) and Rk(A) respectively the trace and the rank of a square matrix A. We denote by · the Euclidean norm of IR n and by |J| the cardinality of a finite set J. For any real numbers a 1 , . . . , a n , diag(a 1 , . . . , a n ) denotes the n × n diagonal matrix with a 1 , . . . , a n , on the diagonal. The indicator function is denoted by 1I(·) and for any integer n, K ⊂ [n], 1I K denotes the vector v ∈ {0, 1} n with jth coordinate given by v j = 1 iff j ∈ K. For any matrix B, B † denotes the Moore-Penore pseudoinverse of B. The operator norm of a matrix is denoted by · op . The cone of n × n positive semidefinite matrices is denoted by S n . The flat simplex of IR M is denoted by Λ M and is defined by The set Λ M can be identified to the set of probability measures on [M ] and for any θ, π ∈ Λ M , we define the Kullback-Leibler divergence between these two measures by with the usual convention that 0 log(0) = 0, 0 log(0/0) = 0 and θ log(θ/0) = +∞, ∀ θ > 0. Finally, throughout the paper, we use the notation log(x) to denote the function log(x) = (log x) ∨ 1.

AGGREGATION OF AFFINE ESTIMATORS
Recall that the Gaussian Mean Model (GMM) can be written as follows. One observes Y ∈ IR n such that Throughout this paper and in accordance with [DS12], we call an affine estimator of µ any estimator µ of the form where A ∈ S n is a n×n matrix and b ∈ IR n is a n-dimensional vector. Both A and b are deterministic. Given a family of affine estimatorsμ 1 , . . . ,μ M , whereμ j = A j Y + b j and a prior probability measure π = (π 1 , . . . , π M ) on these estimators, our goal is to construct an aggregateμ such that with probability 1 − δ for any j ∈ [M ], where T j > 0 is as small as possible and ε ≥ 0. As we will see, we can achieve ε = 0 using Q-aggregation but only prove a weak oracle inequality with ε > 0 in section 2.2 using exponential weights. Inequalities of the form (2.3) with ε > 0 can be of interest as long as there exists a candidate affine estimatorμ j that is close to µ with high probability. Several examples where it is the case are described in [DS12].
Our results below hold under the following general condition on the family of matrices {A j } j∈ [M ] .
To illustrate the purpose of aggregating affine estimators and the relevance of Condition 1, observe that a large body of the literature on the GMM studies estimators of the form AY , where A = diag(a 1 , . . . , a n ) is a diagonal matrix with elements a j ∈ [0, 1] for all j = 1, . . . , n. If µ is assumed to belong to some family of regularity classes such as Sobolev ellipsoids, Besov classes, tail classes, it has been proved that such estimators are minimax optimal (see [CT01,Tsy09,Joh11]).
Commonly used examples include ordered projection estimators, spline estimators and Pinsker estimators (see [DS12] for a detailed description). These estimators are known to be minimax optimal over Sobolev ellipsoids [Pin80,GN92,Tsy09]. Diagonal filters trivially satisfy Condition 1 with V = 1.
We give details of a specific application to sparsity pattern aggregation and its consequences on universal aggregation in section 3.

Sharp oracle inequalities using Q-Aggregation
In this section, we state our main result: a sharp oracle inequality for an aggregate of affine estimators based on Q-aggregation. Specifically, we consider the problem of aggregating general affine estimatorsμ j = A j Y + b j , j ∈ [M ] that satisfy Condition 1. Note that unlike [DS12], we do not require that matrices A j , j ∈ [M ] commute and we make no assumption on the vectors b j , j = 1, . . . , M . Moreover, our results can be extended to an infinite family {(A λ , b λ ), λ ∈ Λ} as in [DS12] but we prefer to present our result in the discrete case for the sake of clarity.
For any θ ∈ IR M , let µ θ denote the linear combination of some given affine estimatorsμ 1 , . . . ,μ M that is defined by Our goal is to find a vectorθ ∈ IR M such that the aggregate µθ mimics the affine estimatorμ j that is the closest to the true mean µ.
In this paper, we consider a generalization of the Q-aggregation scheme of static models that was developed in [Rig12,DRZ12]. To that end, fix a prior probability distribution π ∈ Λ M and for any θ ∈ Λ M , define where ν ∈ (0, 1) and λ > 0 are tuning parameters, and C j is set to be Let nowθ be defined as The resulting estimator µθ is called Q-aggregate estimator of µ. Theorem 1 is our main result.
Theorem 1. Consider the GMM (2.1) and letμ j = A j Y + b j , j ∈ [M ] be affine estimators of µ together with a prior distribution π = (π 1 , . . . , π M ) on these estimators and let V = max j∈[M ] A j op . Letμ Q = µθ be the Q-aggregate estimator withθ defined in (2.6) with tuning parameters ν ∈ (0, 1) and λ ≥ 8σ 2 min(ν,1−ν,( 5 Moreover, the same Q-aggregate estimatorμ Q satisfies A few remarks are in order. Note that the oracle inequality of Theorem 1 is sharp since the leading term μ j − µ 2 has multiplicative constant 1. A similar oracle inequality was obtained in [DS12] bit our main theorem above presents significant differences. First, and this is the main contribution of this paper, the above oracle inequality holds with high probability whereas the ones in [DS12] only hold in expectation. Nevertheless, our model is simpler than the one studied in [DS12] who study heteroskedastic regression. Moreover, the bound in [DS12, Theorem 2] is "scale-free" whereas ours depends critically on the size of the matrices A j via C j and V . We believe that this dependence cannot be avoided in high probability bounds as such quantities essentially control the deviations of estimators. As we will see, the bounds of Theorem 1 are sufficient to perform sparsity pattern aggregation and universal aggregation optimally.

Weak oracle inequality using exponential weights
The oracle inequalities (1.1) and (1.2) are sharp in contrast to weak oracle inequalities where the right-hand side of (1.1) or (1.2) is replaced by for some ε > 0 (see [LM12] for a discussion on the difference between sharp and weak oracle inequalities). While they appear to be quite similar, some estimators do satisfy weak oracle inequalities while they do not satisfy sharp ones. This is the case of the aggregate with exponential weight that provably fails to satisfy a sharp oracle inequality with high probability in a certain setups [DRZ12, Proposition 2.1].
To prove weak oracle inequalities that hold with high probability, we modify the aggregate studied in [DS12].
is a family of affine estimators equipped with a prior probability distribution π ∈ Λ M and that C j is defined in (2.5). Letθ ∈ Λ M be the vector of exponential weights defined by The parameter λ > 0 is often referred to as temperature parameter. It is not hard to show (see, e.g., [Cat04, p. 160]) thatθ is the solution of the following optimization problem: Observe that the above criterion corresponds to Q defined in (2.4) with ν = 1, that is without the quadratic term in θ. We believe that this quadratic term is key in obtaining sharp oracle inequalities that hold with high probability. We already know from previous work [LB06, RT11, RT12, DS12] that this term is not necessary to obtain sharp oracle inequalities that hold in expectation. As illustrated below, it is also not required to get weak oracle inequalities, even with high probability. Denote byμ EW = M j=1θ jμj the aggregate with exponential weightsθ j , j ∈ [M ] defined in (2.9).
Theorem 2. Let the conditions of Theorem 1 hold. Letμ EW = µθ be the aggregate with exponential weightsθ defined in (2.9) with tuning parameter λ ≥ 4σ 2 (16 ∨ 5V ). Then for any δ > 0, with probability at least 1 − δ, we have Note that unlike Theorem 1, the right-hand side of the above oracle inequality is multiplied by a factor 1 + ε > 1: it is a weak oracle inequality but it holds with high probability and thus complements the results of [DS12] on aggregation of affine estimators using exponential weights. Alquier and Lounci [AL11] prove the first oracle inequality with high probably using exponential weights. They use specific projection estimatorsμ j for sparsity pattern aggregation but make extra assumptions and use a prior probability measure tailored to these assumptions in order to obtain a sharp oracle inequality. While their final result [AL11, Theorem 3.1] is not directly comparable to ours, a weak oracle inequality similar to the one above can be deduced from their proof. Actually, our proof uses one of their arguments.

SPARSITY PATTERN AGGREGATION
In this section, we illustrate the power of the two oracle inequalities stated in the previous section. Indeed, carefully selecting the affine estimatorsμ 1 , . . . ,μ M , as well as the the prior probability distribution π leads to various optimal results. Some results for diagonal filters can be found in [DS12] and we focus here on sparsity pattern aggregation.
Recall the results we have proved in the previous section. With probability at least 1 − δ, for λ large enough and, where t = 0 ifθ is computed according to (2.6) and t = 4σ 2 ifθ is computed according to (2.9).
In the sequel, we fix ν = 1/2 in Q-aggregation since this choice leads to the sharpest bounds.

Sparsity pattern aggregation
Let X 1 , . . . , X p ∈ IR n be given vectors and assume that µ ∈ IR n can be well approximated by a linear combination of X j , j ∈ J * for some unknown sparsity pattern J * ⊂ [p]. More precisely, we are interested in sparse linear regression, where the goal is to find a sparse β ∈ IR p such that Xβ − µ 2 is small, where X = [X 1 , . . . , X p ] is the n × p design matrix obtained by concatenating the X j 's. Akin to [BRT09,RT11], we do not assume that there exists a sparse β * such that Xβ * = µ but rather that there may be a systematic error. Oracle inequalities such as the ones described below in Theorems 3 and 4 capture the statistical precision of fitting possibly misspecified sparse linear models in the GMM.
To achieve our goal, we follow the same idea as in [RT11,RT12] and employ sparsity pattern aggregation. The idea can be summarized as follows. For each sparsity pattern of β, compute the least squares estimator and then aggregate these (projection) estimators. Specifically, for each sparsity pattern J ⊂ [p] define X J to be the n × |J| matrix obtained by concatenating X j , j ∈ J and let A J = X J (X ⊤ J X J ) † X ⊤ J denote the projection matrix onto the linear span span(X J ) of X j , j ∈ J. If J * was known, a good candidate to estimate µ would be the least squares estimatorμ J * = A J * Y . Since J * is unknown, we propose to aggregate the affine (actually linear) estimatorsμ J = A J Y, J ⊂ [p]. This approach is called sparsity pattern aggregation [RT11] and can be extended to more general notions of sparsity such as group sparsity or fused sparsity [RT12]. It yields a family of affine estimatorsμ J = A J Y such that C J = 4σ 2 ,Tr(A J ) = 4σ 2 Rk(X J ) and V = max J A J op = 1.
Sparsity pattern aggregation has been shown to attain the best available sharp oracle inequalities in expectation [RT11,RT12] and one of the main contribution of this paper is to extend these results to results with high probability. Moreover, it leads to universal aggregation with high probability (see section 3.3).
The key to sparsity pattern aggregation is to employ a correct prior probability distribution. Rigollet and Tsybakov [RT12], following [LB06,Gir08] suggest to use In particular, it exponentially downweights patterns J according to their cardinality. For any β ∈ IR p \ {0}, let |β| 0 denote the number of nonzero coefficients of β and, by convention, let |0| 0 = 1.
Corollary 1. Taking λ = 20σ 2 and λ = 64σ 2 forμ Q andμ EW respectively, with probability at least 1 − δ, we have: The novelty of this result is twofold. First, we use Q-aggregation to obtain the first sharp sparsity oracle inequalities that hold with high probability under no additional condition on the problem. Second, we prove a weak sparsity oracle inequality for the aggregate based on exponential weights that holds with high probability. While it is only a weak oracle inequality, it extends the results of Rigollet and Tsybakov [RT11,RT12] that hold only in expectation and the results of [AL11] that hold with high probability but under additional conditions.

ℓ q -aggregation
Recently, Rigollet and Tsybakov [RT11] observed that any estimator that satisfies an oracle inequality such as (2.8) also adapts to sparsity when measured in terms of ℓ 1 norm. Specifically, their result [RT11, Lemma A.2] implies that if max j µ j ≤ √ n, then for any constant ν > 0, it holds (3.6) min wherec is an absolute constant. The above bound hinges on a Maurey argument, which, as noticed by Wang et al. [WPGY11], can be extended from ℓ 1 balls to ℓ q balls for q ∈ (0, 1]. It has been argued that ℓ q -balls (0 < q ≤ 1) describe vectors that are "almost sparse" [FPRU10,Joh11]. For any q ∈ (0, 1], θ ∈ IR M , let |θ| q denote the ℓ q -"norm" of IR M of θ defined by Moreover, for a given radius R > 0 and any q ∈ [0, 1], define the ℓ q -ball of radius R by B q (R) = θ ∈ IR M : |θ| q ≤ R .
Not surprisingly, these almost sparse vectors can be well approximated by sparse vectors as illustrated in the following lemma that generalized (3.6) Lemma 1. Fix ν > 0, M ≥ 3 and let and µ j , j ∈ [M ] such that max j µ j 2 ≤ B 2 . Then with the convention |θ| 0 0 = |θ| 0 .
We postpone the proof to Appendix B where further results on the approximation of vectors with small ℓ q norm by sparse vectors, can be found. We are now in a position to state the main result of this subsection. Its proof follows directly from the above lemma by rounding √ 66 up to 9 and 5 · 396/3 to 16.
Theorem 4. Letμ J , J ⊂ [p] be defined as in subsection 3.1 with π being the sparsity prior defined in (3.1). Moreover, assume that max j X j 2 ≤ B 2 for some B > 0 and assume that M ≥ 3. Then, the following statements hold with probability at least 1 − δ: (ii) The aggregate with exponential weightsμ EW with λ = 64σ 2 satisfies where, in both cases, ϕ q,p is defined in (3.7) .
Both (3.8) and (3.9) can be compared to the prediction rates over ℓ q balls that were derived in [RWY12] where the setup is the following. First, it is assumed that the true mean µ in (2.1) is of the form µ = Xβ * for some β * ∈ B q (R), R > 0 and that B = κ √ n. In this case, it follows from Theorem 4 that with probability at least 1 − δ, we have for anyμ ∈ {μ Q ,μ EW } that for some numerical constants C 1 , C 2 . In their specific regime of parameters, our rates are of the same order as [RWY12, Theorem 4] and are therefore optimal in that range. However, we provide a better finite sample performance and explicit dependence in the confidence parameter δ as well as explicit constants that do not depend on q. In particular, our bounds are continuous functions of q on the whole closed interval [0, 1]. More strikingly, unlike [RWY12] neither of the estimatorŝ µ Q ,μ EW depends on q or R and yet they optimally adapt to these parameters. This remarkable phenomenon is even better illustrated in the context of universal aggregation.

Universal aggregation
In his original description of aggregation, Nemirovski [Nem00] introduced three types of aggregation to which three new types were added later [BTW07,Lou07,WPGY11]. All of these aggregation problems can be described in the following unified way. Given M ≥ 2 deterministic vectors µ 1 , . . . , µ M ∈ IR n and a set Θ ⊂ IR M , the goal is to construct an aggregateμ such that with high probability and where the remainder term ∆ n,M (Θ) > 0 is as small as possible. To each of the six types of aggregation, corresponds a unique Θ ⊂ IR M and a smallest possible ∆ n,M (Θ) for which (3.10) holds. Such a ∆ n,M (Θ) is called the optimal rate of aggregation (over Θ) [Tsy03]. The six types of aggregation all correspond to choices of Θ that are intersections of balls B q (R) for various choices of q and R. They are summarized in Table 1. We add a new natural type of aggregation that we call D-ℓ q aggregation, where, by analogy to D-linear and D-convex aggregation, we add to ℓ q aggregation the constraint that θ must be D-sparse. In particular, D-convex aggregation introduced in [Lou07] can be identified to D-ℓ 1 aggregation.
Type of aggregation Θ Optimal rate Table 1 The seven types of aggregation and the corresponding choice of Θ. The range of parameters is q ∈ (0, 1), D ∈ [M ], R > 0. All numerical constants have been removed for clarity.
While most papers on the subject use different estimators for different aggregation problems [Nem00, Tsy03, RT07, Rig12], Bunea et al. [BTW07] were the first to suggest that one single estimator could solve several aggregation problems all at once and used the bic estimator to obtain partial results in the form of weak oracle inequalities. More recently, Rigollet and Tsybakov [RT11] showed that the exponential screening estimator solved the first five types of aggregation all at once, without the knowledge of Θ. Using similar arguments, we now show that the Q-aggregate solves at once, all seven problems of aggregation described in Table 1, not only in expectation, but also with high probability.
Note that the rates in Table 1 are optimal in the sense of [Tsy03] for the most interesting ranges of parameters. Indeed, they match the most general lower bounds of [RT11, RWY12, WPGY11] apart from minor discrepancies that can be erased by placing appropriate assumptions on the range of parameters considered. It is not hard to see from our proofs where the ambiant dimension M can be replaced by the dimension of the linear span of µ 1 , . . . µ M should appear in these bounds [RT11]. Since this is not the main focus of our paper, we choose not to have this dependence explicit in our bounds but in view of the similarity of our proof techniques and that of [RT11], it is clear that it can be made explicit whenever appropriate by a simple modification of the prior π.

APPENDIX A: PROOFS OF THE MAIN THEOREMS
The following lemma is key to both of our theorems. It allows us to control the deviation of the empirical risk of any aggregate µθ around its true risk.
Then, for any j ∈ [M ] we have the following inequality with probability at least 1 − δ, Moreover, Observe now that the decomposition (2.1) implies thatμ where I n denotes the identity matrix of IR n . Next, we obtain from the Cauchy-Schwarz inequality that where, To bound P 1 , observe that since A j and B ⊤ k B k both have nonnegative eigenvalues, it holds In particular, the matrices U k , V k are orthogonal so that the vectors satisfy Z = U k ξ ∼ N (0, σ 2 I n ) and W = V k ξ ∼ N (0, σ 2 I n ). Since A k ∈ S n , we know Applying now Cauchy-Schwarz inequality and Lemma C.1 yields where, in the inequality, we used the following inequalities: where we recall that C k = 4σ 2 Tr(A k ) is defined in (2.5). We now bound P 2 . To that end, observe that it follows from [Rig12, Lemma 6.1] that Note now that the eigenvalues of B k belong to [−V, V ] so that for λ ≥ 20σ 2 V , we have The bounds on √ P 1 and √ P 2 together with (A.1) and (A.2) yield The two statements of the lemma follow easily from this bound on the moment generating function using the same arguments as in [Rig12, Theorem 3.1]. Specifically, the statement with high probability follows from a Chernoff bound and the statement in expectation follows from the inequality t ≤ e t − 1.

A.2 Proof of Theorem 2
For any θ ∈ Λ M , definê and observe that It follows from the definition (2.9) ofθ, that for any j ∈ [M ], it holdŝ where e j denotes the jth vector of the canonical basis of IR M . Together with (A.7) applied with θ =θ and θ = e j respectively, and the identity it yields that for any j ∈ [M ], we have with probability at least 1 − δπ j . Together with (A.8) the identity Recall that our assumptions imply that λ > 16σ 2 so that The proof is concluded by a union bound.
On the other hand, we get from Theorem 1 that with probability at least 1 − δ/2, it holds It can be shown [RT12] that and we also have that CJ = 4σ 2 Rk(XJ ) ≤ 4σ 2 |β| 0 .
Putting everything together yields that with probability at least 1 − δ, it holds To conclude the proof of (3.2), it suffices to observe that AJ µ − µ 2 ≤ Xβ − µ 2 . The proof of (3.3) follows along the same lines.

A.4 Proof of Theorem 5
Replacing β by θ and X j by µ j in the proof of Theorem 3 leads to The above display combined with Lemma 1 yields that for any q ∈ (0, 1), R > 0, where the function ϕ q,M is defined in (3.7). To complete the proof, if suffices that for any Θ ∈ In the rest of the proof, we treat each case separately. To that ends, write ψ(θ) = 66σ 2 |θ| 0 log 2eM |θ| 0 δ ∧ ϕ q,M (θ; 9σ, B) .

B.1 Decay of coefficients on ℓ q -balls
For any q > 0, θ ∈ IR M , recall that |θ| q denotes the ℓ q -norm of θ and is defined by It is known [Joh11] that if q < 1, such balls contain sparse signals, in the sense that their coefficients decay at a certain polynomial rate. This is quantified by the following lemma that yields a much sharper result than the one obtained using weak ℓ q -balls, especially for q close to 1.
Proof. Let {v j } j≥1 be an infinite sequence such that v j = |θ (j) | for j ∈ [M ] and v j = 0 for j ≥ M + 1. Next for any k ≥ 0, let B k denote the block of m consecutive integers defined by B k = {km + 1, . . . , (k + 1)m} and observe that where in the last inequality, we use the fact that a p + b p ≤ (a + b) p for any a, b > 0, p ≥ 1.

B.2 Proof of Lemma 1
We begin by an approximation bound a la Maurey on ℓ q balls.
APPENDIX C: TECHNICAL LEMMAS C.1 Deviations of a χ 2 distribution Let us first recall Lemma 1 of [LM00] in a form that is adapted to our purpose. We omit its proof.