Asymptotic Properties of Lasso+mLS and Lasso+Ridge in Sparse High-dimensional Linear Regression

We study the asymptotic properties of Lasso+mLS and Lasso+Ridge under the sparse high-dimensional linear regression model: Lasso selecting predictors and then modified Least Squares (mLS) or Ridge estimating their coefficients. First, we propose a valid inference procedure for parameter estimation based on parametric residual bootstrap after Lasso+mLS and Lasso+Ridge. Second, we derive the asymptotic unbiasedness of Lasso+mLS and Lasso+Ridge. More specifically, we show that their biases decay at an exponential rate and they can achieve the oracle convergence rate of $s/n$ (where $s$ is the number of nonzero regression coefficients and $n$ is the sample size) for mean squared error (MSE). Third, we show that Lasso+mLS and Lasso+Ridge are asymptotically normal. They have an oracle property in the sense that they can select the true predictors with probability converging to 1 and the estimates of nonzero parameters have the same asymptotic normal distribution that they would have if the zero parameters were known in advance. In fact, our analysis is not limited to adopting Lasso in the selection stage, but is applicable to any other model selection criteria with exponentially decay rates of the probability of selecting wrong models.


Introduction
Consider the sparse linear regression model 0

Hanzhong Liu and Bin Yu/Lasso+mLS and Lasso+Ridge in SHLR
1 where ǫ = (ǫ 1 , ..., ǫ n ) T is a vector of independent and identically distributed (i.i.d.) random variables with mean 0 and variance σ 2 . Y = (y 1 , ..., y n ) T ∈ R n is the response vector, and X ∈ R n×p is the design matrix which is deterministic. β * ∈ R p is the vector of model coefficients with at most s (s < n) non-zero components. We consider the high-dimensional setting which allows p and s to grow with n (p can be comparable to or larger than n). Note that, in here and what follows, Y , X, and β * are all indexed by the sample size n, but we omit the index whenever this does not cause confusion.
In sparse linear regression models, an active line of research focuses on the recovery of sparse vector β * by a popular l 1 regularization method called Lasso [47]. The Lasso has been studied under at least three common criteria: (i) model selection criteria, meaning the correct recovery of the support set S = {j ∈ {1, 2, ..., p} : β * j = 0} of the model coefficients β * ; (ii) l q estimation errors ||β − β * || q q , especially l 2 and l 1 , whereβ is the estimate of β * ; and (iii) prediction error ||Xβ − Xβ|| 2 2 . The Lasso estimator is defined bŷ where λ n ≥ 0 is the tuning parameter which controls the amount of regularization applied to the estimate. Setting λ n = 0 reverses the Lasso problem to Ordinary Least Squares (OLS) which minimizes the unregularized empirical loss. Replacing ℓ 1 penalty by ℓ 2 penalty in (2) gives the Ridge estimator [28]: β Ridge (λ n ) = argmin β ||Y − Xβ|| 2 2 + λ n ||β|| 2 2 , The Lasso estimator has two nice properties, namely, (i) it generates sparse models by means of ℓ 1 regularization and (ii) it is also computationally feasible (see [43,19,24]). The asymptotic behavior of Lasso-type estimators has been studied by [32] for fixed p and β * as n → ∞. In particular, they have shown that under some regularity conditions on the design, λ n = o(n) is sufficient for consistency in the sense thatβ(λ n ) → p β * , and λ n should grow more slowly (i.e. λ n = O( √ n)) for asymptotic normality of the Lasso estimator. On the model selection consistency front, [36] proposed the neighborhood stability condition which is equivalent to the Irrepresentable condition [25,48,54, 51] to prove the Lasso consistency for Gaussian graphical model selection. [54] showed that the Irrepresentable condition is almost necessary (for fixed p) and sufficient for the Lasso to select the true model both in the classical fixed p setting and in the high-dimensional setting.
[52] considered a weaker sparsity assumption, meaning that the regression coefficients outside an ideal model are small but not necessarily zero (the sum of their absolute values is of the order O(sλ n /n)), and imposed the sparse Riesz condition to prove the rate-consistent in terms of the sparsity, bias and the norm of missing large coefficients.
[51] further established precise conditions on the scalings of (n, p, s) that are necessary and sufficient for sparsity pattern recovery using the Lasso. In addition, thresholded Lasso and Dantzig estimators were introduced in [31] and the authors proved their model selection consistency under less restrictive conditions on the decay rates of the nonzero regression coefficients. Other related work includes [17,55,10,1]. All the aforementioned papers imposed suitable mutual incoherence conditions on the design. On the l 2 estimation error front, the Lasso has been shown under a weaker restricted eigenvalue condition to achieve l 2 convergence rate of (s log p)/n [49,9,39,41,42], which is the minimax optimal rate [45].
However, even if p is fixed and the Irrepresentable Condition is satisfied, there does not exist a tuning parameter λ n which can lead to both variable selection consistency and asymptotic normality [21,55]. More importantly, for the case of p ≫ n, statistical inference for the Lasso estimator with theoretical guarantees is still an insufficiently explored area.
The bootstrap is very useful for inference. For fixed p, [40] developed a perturbation resampling-based method to approximate the distribution of a general class of penalized regression estimates. [13] proposed a modified residual bootstrapping Lasso method that is consistent in estimating the limiting distribution of the Lasso estimator.
[53] developed a low-dimensional projection (LDP) approach to constructing confidence intervals. Though LDP works for (s log p)/ √ n → 0, it has nothing to do with the idea of resampling and bootstrap. Does the bootstrap provide a valid approximation in the case of p ≫ n? In this paper, we will give an affirmative answer to this question based on residual bootstrap after two post-Lasso estimators: Lasso+mLS and Lasso+Ridge. Our method provides consistent estimate of the limiting distribution of Lasso+mLS (or Lasso+Ridge) even if p grows at an exponential rate in n.
Post-Lasso estimator is a special case of two stage estimators: (1) selection stage: one selects predictors using the Lasso; and (2) estimation stage: modified Least Square (mLS) or Ridge, is applied to estimate the coefficients of the selected predictors. Our estimator is referred to as Lasso+mLS or Lasso+Ridge. Lasso+mLS is very close to Lasso+OLS [3], which uses Ordinary Least Squares (OLS) in the second stage. Several authors have previously considered two stage estimators to improve the performance of the Lasso, such as the Lars-OLS hybrid [19], adaptive Lasso [55], relaxed Lasso [37], and marginal bridge estimator [29], to name just a few.
Our contributions are summarized as follows: 1. We propose a valid inference procedure for parameter estimation based on parametric residual bootstrap after two post-Lasso estimators: Lasso+mLS and Lasso+Ridge. More specifically, we show that the Mallows distance between the distributions of the bootstrap estimator and the Lasso+mLS (or Lasso+Ridge) estimator converges to 0 in probability. 2. Under the Irrepresentable condition and other regularity conditions, we derive the asymptotic unbiasedness of Lasso+mLS and Lasso+Ridge. We show that their biases decay at an exponential rate and that they can achieve the oracle convergence rate of s/n for mean squared error E||β − β * || 2 2 whereβ is either Lasso+mLS or Lasso+Ridge. 3. We prove the asymptotic normality of Lasso+mLS and Lasso+Ridge. As we show in Theorem 3 and Corollary 2, these two post-Lasso estimators display an oracle property that the Lasso does not have: they can select the true predictors with probability converging to 1 and the estimates of nonzero parameters have the same asymptotic normal distribution that they would have if the zero parameters were known in advance. 4. Our analysis is not limited to adopting the Lasso in the selection stage, but is applicable to any other model selection criteria with exponentially decay rates of the probability of selecting wrong models, for example, stability selection [38], SCAD [21,34] and Dantzig selector [12,9,26].
Our key assumptions for the validity of residual bootstrap after Lasso+mLS or Lasso+Ridge are the Irrepresentable condition and that s goes to infinity slower than √ n. The Irrepresentable condition can be weakened by the sparse eigenvalue condition if we adopt stability selection [38] to enhance the selection performance of the Lasso. Without considering model selection, [8] showed that residual bootstrap OLS fails if p 2 /n does not tend to 0. Therefore, the condition s 2 /n → 0 cannot be weakened. Our conditions on the scalings (n, p, s) are not the sharpest but have been previously used in the literature [54] and make our convergence rate more explicit. In addition, we require a gap of size n c 3 2 (c 3 ∈ (0, 1] is a constant) between the decay rate of β * and n − 1 2 which prevents the estimation from being dominated by the noise terms. [22] proposed a similar constraint min 1≤i≤s |β * i | ≥ M n κ , 0 ≤ κ < 1 2 to show model selection consistency of the Sure Independent Screening. This assumption is weaker than min 1≤i≤s |β * i | ≥ M , which was assumed by [29] in connection with the asymptotic properties of the Bridge estimator. However, as mentioned by [33], inference results based on post-model selection methods can be misleading when this kind of "beta min" condition fails. Therefore, the proposed inference procedure should be used in practice only when there is believed to be a gap between the decay rate of the nonzero elements of β * and the n −1/2 . Finally, we need some regularity conditions (conditions (a)-(c) in Section 2) which are standard in sparse highdimensional linear regression literature [54,29,30].
After we had obtained our bootstrap results in Theorem 4 and Corollary 3, our attention was brought to an independent result in [14] where a variant of the Irrepresentable condition was used to prove the second-order correctness of the residual bootstrap applied to a suitable studentized version of the adaptive Lasso estimator. However, the main results of [14] are valid only for linear combinations of the adaptive Lasso estimator while our results hold for the joint distribution of Lasso+mLS (or Lasso+Ridge). The distance between distributions used in [14] and the proof there are also different from ours. Specifically, [14] adopted the total variation distance and used the Edgeworth expansion in the proof while we study the Mallows distance and our proof is direct. In addition, [14] allows p to grow only at polynomial rates in n while we allow p to grow at an exponential rate in n.
In our paper, before stating the bootstrap result, we first derive the asymptotic unbiasedness and asymptotic normality of Lasso+mLS and Lasso+Ridge. It is well known that (s log p)/n is the minimax optimal rate for the Lasso under the restricted eigenvalue condition. Since we assume the stronger Irrepresentable condition and some conditions on scaling (n, p, s), we are able to attain a better rate of s/n for MSE which indicates that one can avoid the feature selection penalty of log p by combining the Lasso and Least Squares or Ridge. We should mention that previous work [3] has obtained l 2 convergence rate (||β Lasso+OLS − β * || 2 2 = O p (s/n)) of Lasso+OLS estimator under weaker conditions. However, their results hold in probability and it is not clear whether Lasso+OLS can achieve the oracle convergence rate of O(s/n) in L 2 -expectation, i.e., whether E||β − β * || 2 2 = O(s/n) holds, which we need to prove the validity of residual bootstrap. On the asymptotic normality front, the authors in [4,5] also adopted the OLS after model selection and derived the asymptotic normality for inference on the effect of a treatment variable on a scalar outcome in the presence of very many controls. However, they studied a partial linear model which is different from ours and the l 1 regularization was imposed on the effect of the control variables without on the effect of the treatment variable.
Notation For any vector a = (a 1 , ..., a m ) T , we denote ||a|| 2 2 = m i=1 a 2 i , ||a|| 1 = m i=1 |a i |, and a ∞ = max i=1,...,m |a i |. For a vector β ∈ R p and a set S ⊂ {1, ..., p}, denote S c the complementary set of S and let β S = {β j : j ∈ S}. Given an n by p matrix X, write x T i ∈ R p , i = 1, ..., n and X j ∈ R n , j = 1, ..., p the i-th row and the j-th column of X respectively, where x T i is the transpose of x i . For a given m × m matrix A, let Λ min (A) and Λ max (A) denote the smallest and largest eigenvalues of A respectively. Write tr(A) the trace of A which is the sum of the diagonal entries of A.
The rest of the paper is organized as follows: in Section 2, we define modified Least Squares and Ridge after model selection and study their asymptotic properties. In Section 3, we apply these general properties to the special cases of Lasso+mLS and Lasso+Ridge and then derive their asymptotic unbiasedness, asymptotic normality and the approximation property of residual bootstrap. Similar asymptotic properties of modified Least Squares and Ridge after stability selection are obtained in Section 4. Simulation examples are given in Section 5. We conclude in Section 6. The proofs can be found in the Appendix.

Asymptotic Properties of Modified Least Squares and Ridge after Model Selection
In this section, we begin with a precise definition of the modified Least Squares or Ridge after model selection, and then study their asymptotic properties, including asymptotic unbiasedness, asymptotic normality and the validity of residual bootstrap.

Definitions and Assumptions
Modified Least Squares (or Ridge) after model selection refers to a special type of two stage estimators. In the first stage, one uses certain model selection methods to select predictors. For example, letβ be the Lasso estimator defined in (2), one gets a set of selected predictorsŜ = {j ∈ {1, 2, ..., p} :β j = 0}. Again,β andŜ are dependent on λ n , but we omit the dependence whenever this does not cause confusion.
In the second stage, a low-dimensional estimation method is applied to the selected predictors inŜ. For example, one can adopt OLS and then form the OLS after model selection (denoted by Select+OLS): The solution of (4) isβ Select+OLS, (4) is not unique. In this case one can use the generalized inverse. However, the generalized inverse is not stable when the smallest nonzero eigenvalue of X T S XŜ approximately equals 0, which may result in poor performance. We propose a modified Least Squares method in the second stage and form our Select+mLS estimator. Let d = |Ŝ| and write 1 √ n XŜ in its singular value decomposition (SVD) form where U is an n × n orthogonal matrix, D is an n × d diagonal matrix with singular values λ 1 ≥ λ 2 ≥ ... ≥ λ d on the diagonal, and V T (the transpose of V ) is a d × d orthogonal matrix. By simple algebraic operations, OLS based on (XŜ, Y ) has the following form: where D −1 is a d × n diagonal matrix with diagonal entries λ −1 1 , λ −1 2 ,...,λ −1 d . If one or more of the singular values are 0, one can utilize generalized inverse which just takes λ −1 k = 0 for all zero-valued λ k . We propose a hard thresholding on the singular values, that is, shrinking those singular values smaller than τ n (τ n > 0) to zero. Then define a modified Least Square estimatorβ mLS (τ n ) in the same form of (6) except that we take λ −1 k = 0 for all λ k < τ n . This estimator is similar to principal components regression [35]. Note that if 0 ≤ τ 2 n ≤ Λ min ( 1 n X T S XŜ),β mLS (τ n ) is the same as β OLS , that is,

6
Our final modified Least Squares after model selection (Select+mLS) is defined by:β One can also use Ridge in the second stage and form the Ridge after model selection (Select+Ridge): where µ n ≥ 0 is a smoothing parameter. (8) is equivalent tõ When the Lasso is used in the selection stage, we refer to our final estimators as Lasso+mLS and Lasso+Ridge respectively. τ n and µ n are tuning parameters. In our theorems and simulation, τ n ∝ 1 n and µ n ∝ 1 n can get good estimation and prediction performance. For the sake of notational simplicity, we omit the dependence of estimators on λ n and τ n or µ n whenever this does not cause confusion.
Assumption (b): Suppose that the predictors are standardized, i.e.
The gaussian assumption (a) is fairly standard in the literature. Assumption (c) ensures the smallest eigenvalue of C 11 is bounded away from 0 so that its inverse behaves well. (a)-(c) are typical assumptions in sparse linear regression literature, see for example [54,29,30]. Assumption (d) means that the number of relevant predictors s is allowed to diverge but much slower than n, and that the number of predictors p can grow faster than n (up to exponentially fast), which is standard in almost all high-dimensional inference literature. Though this assumption is stronger than the typical one s log p n → 0, it has been used previously [54].
In the following subsections, we will show that both Select+mLS and Se-lect+Ridge have good asymptotic properties if the probability of selecting wrong models P (Ŝ = S) decays fast, say o(e −n κ ) (where κ > 0 is a constant).

Bias and MSE of Select+mLS and Select+Ridge
In a high-dimensional setting, bias is not the only consideration of estimates because of the bias and variance trade-off. Regularization has been a popular technique for model fitting which results in a biased estimator but decreases the mean squared error (MSE) dramatically. However, if two estimators have the same MSE, we prefer the unbiased one. The following Theorems 1 and 2 provide general bounds for the bias and MSE of the Select+mLS and Select+Ridge. Theorem 1. Suppose that Gaussian assumption (a) is satisfied and τ 2 n ≤ Λ min (C 11 ), then the bias and the MSE ofβ Select+mLS (τ n ) satisfy ). The first term on the right hand side of (15) corresponds to the oracle convergence rate. The second term is related to model selection accuracy. In the case of the Lasso, P (Ŝ = S) can decay at an exponential rate. Hence the MSE is completely determined by the first term σ 2 n tr(C −1 11 ), which can not be improved.
Remark 2.2. From theorem 1, one can easily get an upper bound for prediction mean squared error E{ 1 Similarly, forβ Select+Ridge (µ n ), we have: Theorem 2. Suppose that Gaussian assumption (a) is satisfied, then the bias and the MSE ofβ Select+Ridge (µ n ) satisfy Theorems 1 and 2 indicate that as long as the probability of selecting wrong models P (Ŝ = S) decays fast, say at an exponential rate, Select+mLS and Select+Ridge are asymptotically unbiased and their MSEs decay at the oracle rate. We have known that under the Irrepresentable condition and other regularity conditions, the probability of the Lasso selecting wrong models satisfies P (Ŝ = S) = o(e −n c 2 ) [54]. Then applying the above theorems to Lasso+mLS and Lasso+Ridge as special cases, we can easily derive their convergence rates of bias and MSE, see Section 3 for more details.

Asymptotic Normality of Select+mLS and Select+Ridge
In this section, we show asymptotic normality of Select+mLS and Select+Ridge. LetΨ and Ψ be the distribution functions of √ n(β S − β * S ) and N (0, σ 2 C −1 11 ) respectively, whereβ can be any one of the two post model selection estimators: Select+mLS and Select+Ridge. LetŜ be the selected predictor set, Theorem 3. Suppose that assumptions (a)-(e) are satisfied and that the model selection procedure is consistent, i.e., P (Ŝ = S) = o(1), then Select+mLS and Select+Ridge are asymptotically normal 2 , that is, This theorem states that model selection consistency in the first stage implies the asymptotic normality of the second stage estimators: Select+mLS and Select+Ridge. The proof of this theorem can be found in Appendix C.

Residual Bootstrap after Select+mLS and Select+Ridge
To make reliable scientific discoveries, we need to establish valid inference procedures including constructing confidence regions and testing for the parameter estimation. Although we have derived the asymptotic normality of Select+mLS and Select+Ridge, it is difficult to use in practice because the noise variance σ 2 is not known and hard to estimate in a high-dimensional setting. The bootstrap is a popular alternative in this case. A summary of bootstrap methods in linear and generalized linear penalized regression models for fixed p can be found in [46]. We will consider the p ≫ n case by proposing a new inference procedure: residual bootstrap after two stage estimators. Our method allows p to grow at an exponential rate in n.
In the context of a regression model, residual bootstrap is a standard method to bootstrap when the design matrix X is deterministic [18,23,32]. Letβ denote Select+mLS or Select+Ridge, the residual vector is given by: Consider the centered residuals at the mean {ǫ i −μ, i = 1, ..., n}, whereμ = 1 n n i=1ǫ i . For residual bootstrap, one obtains ǫ * = (ǫ * 1 , ..., ǫ * n ) T by resampling with replacement from the centered residuals {ǫ i −μ, i = 1, ..., n}, and formulates Y * as follows Then one can define the selected predictor setŜ * and Select+mLS or Se-lect+Ridgeβ * based on the bootstrap sample (X, Y * ). For the bootstrap to be valid, one needs to verify that the conditional distribution of T * n = √ n(β * −β) given ǫ, which can be computed directly from the data, approximates the distribution of T n = √ n(β − β * ). The difference between two distributions can be characterized by Mallows metric. Definition 1. The Mallows metric d, relative to the Euclidean norm || · ||, of two distributions F andF is the infimum of (E||Z − W || 2 2 ) 1 2 over all pairs of random vectors Z and W , where Z has distribution F and W has distributioñ F . That is, By Lemma 8.1 in [7], the infimum can be attained. To proceed, denote G n and G * n the distribution of T n and the conditional distribution of T * n respectively. Let P * denote the conditional probability given the error variables {ǫ i , i = 1, ..., n}.
To show the validity of residual bootstrap after Select+mLS or Select+Ridge, we need more conditions: Assumption (dd): suppose that s 2 /n → 0, as n → ∞.
Assumption (f ): suppose that both the probability of selecting wrong models based on the original data (X, Y ) and that based on the resample (X, Y * ) decay at an exponential rate, i.e. P (Ŝ = S) = o(e −n c 2 ) and P * (Ŝ * =Ŝ) = o p (e −n c 2 ).
Assumption (g): suppose that Assumption (dd) is stronger than assumption (d) because it requires that s grows slower than √ n. Without considering model selection, [8] showed that residual bootstrap OLS fails if p 2 /n does not tend to 0. Therefore, the assumption (dd) cannot be weakened. As we shown in the next section, assumption (f) is satisfied if the Irrepresentable condition and some regularity conditions hold. Assumption (g) is a technical assumption which makes the convergence rates in Theorems 1 and 2 more clear.
Since β * has only s ≪ n nonzero components, this assumption is not very restrictive. Obviously, it is satisfied when the maximum of β * j is upper bounded by a constant. Assumption (h) is not very restrictive either because the number of terms in the sum is s ≪ n 1 2 and it clearly holds when all the predictors corresponding to the nonzero coefficients are bounded by a constant M , i.e. |x ij | ≤ M, i = 1, ..., n, j = 1, ..., s. [29] also assumed this condition to show the asymptotic normality of Bridge estimator.
Theorem 4 states that residual bootstrap after Select+mLS or Select+Ridge gives a valid approximation to the distribution of T n if the probabilities of selecting wrong models P (Ŝ = S) and P * (Ŝ * =Ŝ) decay at an exponential rate and the number of true predictors s grows slower than √ n. The proof of this theorem can be found in Appendix C.
In the next section, we will apply the above Theorems 1, 2, 3 and 4 to two special cases: Lasso+mLS and Lasso+Ridge.

Asymptotic Properties of Lasso+mLS and Lasso+Ridge
As we shown in Section 2, Select+mLS and Select+Ridge have attractive asymptotic properties (see Theorem 1, 2, 3 and 4) if the probability of selecting wrong models decays exponentially fast. Applying these theorems to Lasso+mLS and Lasso+Ridge, we can easily attain their convergence rates of bias and MSE, asymptotic normality and the validity of residual bootstrap. To proceed, we will first give a brief overview of the assumptions used to get model selection consistency of the Lasso investigated by [54]. Let sign(·) map positive entries to 1, negative entries to −1 and zero to zero, Definition 2 (Irrepresentable condition [25,48,36,54,51]). There exists a positive constant vector η, such that where 1 is a p−s by 1 vector with entries 1 and the inequality holds element-wise.
Assumption (i): there exist constant c 1 + c 2 < c 3 ≤ 1 and M > 0 so that Assumption (j): suppose that the tuning parameter λ n in the definition of the Lasso satisfies λ n ∝ n The Irrepresentable Condition is a key assumption on the design that can be weakened by e.g. using stability selection instead of the Lasso. [38] showed that stability selection can achieve the same convergence rate of the probability of selecting wrong models as the Lasso does but under a less restrictive sparse eigenvalue condition. We will give a brief overview of stability selection combined with randomized Lasso in sparse linear regression in the Appendix B and discuss the asymptotic properties of two stage estimators based on stability selection in the next section. Assumption (i) requires a gap of size n c 3 2 between the decay rate of β * and n − 1 2 thus preventing the estimation from being dominated by the noise terms. It is weaker than min 1≤i≤s |β * i | ≥ M , which was assumed by [29] who studied the asymptotic properties of the Bridge estimator. [22] also proposed a similar constraint min 1≤i≤s |β * i | ≥ M n κ , 0 ≤ κ < 1 2 to show model selection consistency of the Sure Independent Screening. Now, we are ready to state our main results. Letβ andβ * denote the Lasso+mLS (or Lasso+Ridge) based on the original data (X, Y ) and that based on the resample (X, Y * ) respectively, then we can defineΨ, Ψ, T n , T * n , G n and G * n as Section 2 does. Firstly, combining Theorem 1 and the model selection property of the Lasso (see Lemma 2 in Appendix A), we can attain the following corollary: Corollary 1. Suppose that assumptions (a)-(d), (g), (i), (j) and the Irrepresentable condition (24) are satisfied, if τ n ∝ 1 n and µ n ∝ e −n c 2 /4 , then the bias and the MSE of Lasso+mLS and Lasso+Ridge estimatorsβ satisfy Proof. We only consider the Lasso+mLS since the proof for Lasso+Ridge is similar. By Lemma 2 in Appendix A, we have From assumption (c) (12), we know that Λ min (C 11 ) ≥ Λ min > 0. And since C 11 is an s by s matrix, we have Under condition (g) or equation (22), we have The corollary is obtained directly from Theorem 1.
Corollary 1 indicates that Lasso+mLS and Lasso+Ridge are asymptotically unbiased. In particular, their biases decay at an exponential rate and their MSEs achieve the oracle convergence rate of s n which is much faster than that of the Lasso. (Under the restricted eigenvalue condition, the Lasso achieves convergence rate of s log p n for MSE, see for example [45,52]). Secondly, combining Theorem 3 and Lemma 2, we can easily derive the asymptotic normality of Lasso+mLS and Lasso+Ridge.
Remark 3.1. For fixed p and fixed β * , [32] showed that for λ n = o( √ n), the Lasso estimator is also asymptotically normal N (0, σ 2 C −1 ) under conditions 1 n X T X → C and 1 n max 1≤i≤n x T i x i → 0. Then the asymptotic covariance matrix of the rescaled and centered Lasso estimator √ n(β S − β * S ) is σ 2 multiplied by where matrix A B means A − B is positive definite. Therefore, Lasso+mLS and Lasso+Ridge have smaller asymptotic covariance matrix (which is σ 2 C −1 11 ) compared with the Lasso and hence reduce estimation uncertainty. Moreover, as pointed out by [21], one cannot find a λ n such that the Lasso estimator is model selection consistent and asymptotically normal ( √ n−consistency) simultaneously. In this sense, Lasso+mLS and Lasso+Ridge improve the performance of the Lasso. Lastly, we verify the validity of residual bootstrap after Lasso+mLS and Lasso+Ridge. We would like to begin with showing an interesting result that the Lasso estimatorβ * based on the resample (X, Y * ) also has model selection consistency.β * = argmin β ||Y * − Xβ|| 2 + λ n ||β|| 1 .
Remark 3.2. In practice, if the Irrepresentable Condition does not hold, one needs to try other model selection methods, e.g., Bolasso [2] or stability selection. As stated before, as long as the probabilities of selecting wrong models P (Ŝ = S) and P * (Ŝ * =Ŝ) decay at an exponential rate o(e −n c 2 ), residual bootstrap after two stage estimator gives valid approximation. Corollary 3 indicates that residual bootstrap after Lasso+mLS or Lasso+Ridge gives a valid approximation to the distribution of T n and then can be used to construct confidence intervals and test for parameter estimation. The proof is straightforward and we omit it.

Asymptotic Properties of Modified Least Squares and Ridge after Stability Selection
If the Irrepresentable Condition is violated, the Lasso cannot correctly select the true model. In this case, one needs to apply other model selection criteria instead of the Lasso. Stability selection is one popular method among many others. [38] showed that it can achieve the same convergence rate of the probability of selecting wrong models but under a less restrictive sparse eigenvalue condition. More details are given in the Appendix B.
Letβ andβ * denote the modified Least Squares (or Ridge) after stability selection (SS+mLS or SS+Ridge) based on the original data (X, Y ) and that based on the resample (X, Y * ) respectively, then we can defineΨ, Ψ, T n , T * n , G n and G * n as Section 2 does. Applying the asymptotic properties of Select+mLS and Select+Ridge in Section 2, we can derive without any proofs the following corollaries parallel to Corollary 1, 2, 3.

Simulation
In this section we carry out simulation studies to evaluate the finite sample performance of Lasso+mLS and Lasso+Ridge. We have also constructed simulations for SS+mLS and SS+Ridge (modified Least Squares or Ridge after stability selection). Though in theory stability selection achieves the same convergence rate of the probability of selecting wrong models under weaker conditions compared with the Lasso, their finite sample performance is similar to the Lasso unless the signal to noise ratio is very high. In our simulation, Lasso+mLS and Lasso+Ridge work well and perform similarly with SS+mLS and SS+Ridge regardless of whether the Irrepresentable condition holds or not. Therefore we only present here the results for Lasso+mLS and Lasso+Ridge. In the following simulations, we also compared the performance of the Lasso+mLS (with τ n = 1/n) with that of the Lasso+OLS and found that their finite sample results are almost the same. This is true because the smallest singular value of the matrix XŜ containing the predictors selected by Lasso is bounded well from 0 in all examples which makes the hard thresholding step not necessary. We will omit the results of Lasso+OLS for the sake of brevity.

Comparison of Bias 2 , MSE and PMSE
This subsection compares the bias 2 (||Eβ − β * || 2 2 ), MSE and prediction mean squared error (PMSE) of the Lasso+mLS and Lasso+Ridge with those of the Lasso. We use R package "glmnet" to compute the Lasso solution. As part of the simulation, we fix τ n = 1/n and µ n = 1/n, so the only tuning parameter for all the three methods is λ n which can be chosen by a 5-fold cross-validation.
Our simulated data are drawn from the following model We set p = 500, s = 10 and σ = 1. The predictor vector x is generated from a multivariate normal distribution N (0, Σ). The value of x is generated once and then kept fixed. We consider two different Toeplitz covariance matrices Σ which control the correlation among the predictors:  Table 1 summaries eight different example settings. After X was generated, we examined the Irrepresentable condition and found that it holds in examples 1 − 6 and is violated in examples 7 − 8. In order to evaluate the prediction performance, we generate an independent testing data set of size 500 and compute the PMSE. Summary statistics are calculated based on 100 replications (keeping X fixed) and showed in Table 2 and Figure 1.
We see that Lasso+mLS and Lasso+Ridge perform almost the same. They not only dramatically decrease the bias 2 of the Lasso by more than 90% but also can reduce the variance (in fact, their variances are 20%−55% smaller than that of the Lasso in all examples except in example 7 where their variances are 45% larger), therefore they improve the MSE and PMSE by 40%−80% and 5%−25% respectively. These benefits occur regardless of whether the Irrepresentable condition holds or not, which indicates that Lasso+mLS and Lasso+Ridge dominate the performance of the Lasso in terms of estimation.

Finite Sample Distribution
This section evaluates the finite sample distribution of the scaled and centered Lasso+mLS estimator T n = √ n(β − β * ). Lasso+Ridge behaves similarly, so we omit it.
We show in Figure 2 the histograms and Normal Q-Q Plots of the scaled and centered Lasso and Lasso+mLS estimators based on 1000 replications. We only present the results for individual coefficients T n,j in example 1 with j = 1, 6, 11 corresponding to the largest, the medium sized and the zero-valued coefficients respectively. Other coefficients in example 1 and the coefficients in examples 2 − 8 behave similarly (except example 7 where the Irrepresentable condition does not hold). We can see that the finite sample distribution of the Lasso+mLS highly coincides with the asymptotically normal distribution which verifies the claims in Theorem 3 and Corollary 2. Although the finite sample distributions of the Lasso estimator for the largest and the medium sized coefficients also seem to somewhat resemble normality, the centers shift away from 0.
We should mention that Lasso+mLS suffers the same issue proposed in [44] which studied the distribution of the adaptive Lasso estimator, that is, the finite sample distribution can be highly non-normal when there is not a gap between the decay rate of the nonzero β * and the order n −1/2 .

Confidence Intervals and Coverage Probabilities
In this subsection, we study the finite sample performance of residual bootstrap Lasso+mLS and Lasso+Ridge using the examples from Table 1. Since Lasso+mLS and Lasso+Ridge behave similarly, we only show the results of Lasso+mLS for the sake of brevity.
We found that basic confidence intervals based on residual bootstrap Lasso provide more accurate coverage probabilities while having the same length as percentile confidence intervals (the basic 90% confidence intervals can achieve coverage probabilities larger than 80% while the percentile confidence intervals are too biased that their coverage probabilities can be lower than 20% for the nonzero-valued parameters, see Figure 3). The distribution of residual bootstrap Lasso is far away from being centered at the true value ( Figure 4) which makes the percentile confidence intervals fail. Similar phenomenon happens for paired bootstrap Lasso method. Therefore, we suggest using the basic confidence intervals in practice with high-dimensional data. In what follows, our confidence intervals are all basic. Figure 5 shows the coverage probabilities of 90% confidence intervals based on residual bootstrap Lasso+mLS and residual bootstrap Lasso. In this figure, only 20 zero-valued parameters are presented for the sake of brevity. The coverage probabilities for the remaining p − s − 20 zero-valued parameters are similar to the 20 presented and therefore are omitted. In addition, we average the coverage probabilities and interval lengths over nonzero-valued parameters part and zero-valued parameters part respectively (see Tables 3 and 4). From Figure 5 and Tables 3 and 4, we can see that residual bootstrap Lasso+mLS gives accurate coverage probabilities (approximately 88% for nonzero β * j and 1 for zero β * j ) when the Irrepresentable condition holds (see examples 1-6), which verifies Corollary 3. Note that, for zero-valued parameters, residual bootstrap Lasso+mLS produces confidence intervals with coverage probabilities close to 1 and very short lengths (approximately 0, see Figure 6 and Table 4), reflecting the oracle properties in Corollary 2. By contrast, residual bootstrap Lasso cannot provide accurate coverage probabilities unless n is large enough (the coverage probability for nonzero β * j is around 75% for n = 200 and is around   85% for n = 400). Even when the Irrepresentable condition doesn't hold (see examples 7-8), residual bootstrap Lasso+mLS can also provide reasonable coverage probabilities (83.5%, 90.3% for nonzero β * j , and 99.1%, 99.7% for zero β * j ) while residual bootstrap Lasso does not (71.7%, 82.1% for nonzero β * j and 93.1%, 92.6% for zero β * j ). Compared with residual bootstrap Lasso, the coverage probability of residual bootstrap Lasso+mLS is about 7% in average closer to the preassigned level (90%) for nonzero β * j and 5% closer to 1 for zero β * j . Even though residual bootstrap Lasso has shorter (17% in average) interval lengths for the nonzero β * j , it loses accuracy in coverage. Overall, residual bootstrap Lasso+mLS is better than residual bootstrap Lasso. Moreover, when n increases, the performance of both methods become better.  In practice, many people prefer to perform paired bootstrap Lasso (resampling from the pairs (x i , y i ), i = 1, ..., n instead of from the residual) even when it makes sense to think of the design matrix as fixed. Therefore, we give some comparisons of residual bootstrap Lasso+mLS and paired bootstrap Lasso. Figure 7 shows the coverage probabilities v.s. average interval lengths based on residual bootstrap Lasso+mLS and paired bootstrap Lasso for different examples. Again, we average the coverage probabilities and interval lengths over nonzero-valued parameters part and zero-valued parameters part respectively and show them in Table 3 and Table 4. We can see that residual bootstrap Lasso+mLS provides more accurate coverage probabilities (0.5% closer to 1 for zero β * j and 14% in average closer to the preassigned level for nonzero β * j ) with more than 90% shorter (for zero β * j ) or at least comparable (for nonzero β * j ) interval lengths compared with paired bootstrap Lasso. Based on our simulations, we conclude that residual bootstrap Lasso+mLS and residual bootstrap Lasso+Ridge are better choices for constructing confidence intervals. Coverage probabilities v.s. average interval lengths for 90% confidence intervals based on two methods: paired bootstrap Lasso (PBL, black circle) and residual bootstrap Lasso+mLS (RBLmLS, red triangle). The latter is better since it provides more accurate coverage probabilities (0.5% closer to 1 for zero β * j and 14% in average closer to the preassigned level (90%) for nonzero β * j ) but with 90% shorter (for zero β * j ) or at least comparable (for nonzero β * j ) interval lengths.

Conclusion
We have derived for the first time the asymptotic properties of Lasso+mLS and Lasso+Ridge in sparse high-dimensional linear regression models where p ≫ n. Under the Irrepresentable condition and other common conditions on scaling (n, s, p), we showed that both Lasso+mLS and Lasso+Ridge are asymptotically unbiased and they achieve oracle convergence rate of s n for MSE which improves the performance of the Lasso. In addition, Lasso+mLS and Lasso+Ridge estimators have an oracle property in the sense that they can select the true predictors with probability converging to 1 and the estimates of nonzero parameters have the same asymptotic normal distribution that they would have if the zero parameters were known in advance.
We then proposed residual bootstrap after Lasso+mLS and Lasso+Ridge methods and showed that they give valid approximations to the distributions of Lasso+mLS and Lasso+Ridge, respectively, provided that the probability of the Lasso selecting wrong models decays at an exponential rate and the number of true predictors s goes to infinity slower than √ n. In fact, our analysis is not limited to adopting Lasso in the selection stage, but is applicable to any other model selection criteria with exponentially decay rates of the probability of selecting wrong models, for example, stability selection, SCAD and Dantzig selector. Lastly, we presented simulation results assessing the finite sample performance of Lasso+mLS and Lasso+Ridge and observe that they not only dramatically decrease the bias 2 of the Lasso by more than 90% but also reduce the MSE and PMSE by 40% − 80% and 5% − 25% respectively. Further, we constructed 90% confidence interval based on our residual bootstrap Lasso+mLS (or Lasso+Ridge) and examined the coverage accuracy. We found that our method resulted in coverage probability approximately 88% for nonzero β * j and 1 for zero β * j , which is much more accurate (approximately 7% closer to the desired level) than bootstrap Lasso method.
Remark A.1. In fact, looking carefully through the proof of Lemma 2 in [54], the Gaussian assumption (a) can be relaxed by a subgaussian assumption. That is, assume that there exists constants C, c > 0 so This result tells us that using the Lasso we can allow p to grow faster than n (up to exponentially fast) while the probability of correct model selection still converges to 1 fast.

Appendix B: Stability Selection
Here we will give a brief introduction of stability selection combined with randomized Lasso in sparse linear regression.
The randomized Lasso is a generalization of the Lasso, which penalizes the absolute value |β k | of every component with a penalty randomly chosen in the where λ ∈ R + is the regularization parameter. [38] proposed an appropriate distribution of the weights W k : W k = α with probability p w ∈ (0, 1) and W k = 1 otherwise. For any given regularization parameter λ ∈ Λ ⊆ R + , denote the selected predictor set based on the samples I ⊂ {1, ..., n} asŜ λ,W (I) = {k :β λ,W k = 0}. (2010)). Let I be a random subsample of {1, ..., n} of size ⌊n/2⌋, drawn with replacement. For every set K ⊆ {1, ..., p}, the probability of being in the selected setŜ λ,W (I) iŝ

Definition 3 (Meinshausen and Buhlmann
where the probability P * is with respect to both the random subsampling and the randomness of the weights W k . With stability selection, we subsample the data many times and then choose the predictors with a high selection probability. (2010)). For a cut-off π thr with 0 < π thr < 1 and a set of regularization parameters Λ, the set of stable variables is defined asŜ

Definition 4 (Meinshausen and Buhlmann
For stability selection with randomized Lasso, one can obtain selection consistency by just assuming sparse eigenvalues, a condition that is much weaker than that of the Irrepresentability. Sparse eigenvalues condition essentially requires that the minimum and maximum eigenvalues, for a selection of order s predictors, are bounded away from 0 and ∞ respectively. (2010)). For any K ⊆ {1, ..., p}, let X K be the restriction of X to columns in K. The minimal sparse eigenvalue φ min is defined for k ≤ p as

Definition 5 (Meinshausen and Buhlmann
and analogously for the maximal sparse eigenvalue φ max .

Appendix C: Technical details
Proof of Theorem 1. Denoteβ the Select+mLS. Conditioned on {Ŝ = S}, for τ 2 n ≤ Λ min (C 11 ), we havẽ Combine with triangle inequality, where I A is the indicater function. The last equality holds since By Cauchy-Schwarz inequality, By definition (7), we have whereD is a d × n diagonal matrix with diagonal entries λ −1 1 , λ −1 2 ,...,λ −1 d (Note that we take λ −1 k = 0 for all λ k < τ n ). The last equality holds since the largest singular value ofD is no more than τ −1 n and U is an orthogonal matrix. Moreover, Combining the above results, we obtain (14). Next, we prove the second part of Theorem 1.
Proof of Theorem 2. Denoteβ the Select+Ridge, we havẽ Applying SVD decomposition of 1 √ n XŜ in (5), it is easy to obtaiñ where D 1 is a diagonal matrix with diagonal entries √ n nλ 2 1 +µn , ..., √ n nλ 2 d +µn . Since V is an orthogonal matrix, the last inequality is due to Combine Cauchy-Schwarz inequality, we have Moreover, and using ( µn Therefore, which proves the first part of Theorem 2. For the second part, we have Algebraic operation yields then, On the other hand, Applying the same trick as (41), we obtain the result.
Proof of Theorem 3. We prove the results for Select+mLS and Select+Ridge respectively.
In the following, we will prove the validity of residual bootstrap after Se-lect+mLS. The proof for residual bootstrap after Select+Ridge is omitted since the techniques are almost the same.
Proof. Note that P * (|ǫ * i | ≥ t) = 1 n n i=1 I |ǫi−μ|≥t , hence (47) is equivalent to We know thatǫ e c * t 2 I |ǫi−μ|≥t } can be bounded by For the first term in (49), let By Markov inequality, Note that where "→ 0" comes from the assumptions s 2 /n → 0 and max The inequality holds since it is easy to show that therefore, with probability going to 1, we have For the second term in (49), by strong law of large numbers, we have ǫ → 0, a.s., then It is easy to show that Using the same trick as (51), we have 53) holds in probability.
For the third term in (49), we will show that if c * = 1 The above integral can be divided into two parts [0, e 9σ 2 c * ] and [e 9σ 2 c * , ∞]. And the first part is bounded using P (sup For the second part, we have Therefore, where the second inequality comes from the Gaussian tail bound, Combine (55) and (58), we obtain (54). Again, by strong law of large numbers, {e c * t 2 I |ǫi|≥t/3 } holds in probability.
Proof of Theorem 4. Using the same notations as [7], let F be the true distribution of ε i ; let F n be the empirical distribution of ǫ 1 , ..., ǫ n ; letF n be the empirical distribution of the residualsǫ 1 , ...,ǫ n ; and letF n beF n centered at its meanμ. We first show that the Mallows metric of G n and G * n can be bounded by √ s · d(F,F n ).
By assumption (f), there exists a set A n be such that P (A n ) → 1 and for every ω ∈ A n , P * (Ŝ * =Ŝ) = o(e −n c 2 ).
Fix ω ∈ A n {Ŝ = S}. By definition of Mallows metric, we have By Lemma 8.1 in [7], the infimum in Mallows metric can be obtained. Then we can choose pairs {ǫ i ∼ F, ǫ * i ∼F n , i = 1, ..., n} which are independent and E(ǫ i − ǫ * i ) 2 = d 2 (F,F n ). Now, Let A = {Ŝ = S} and A * = {Ŝ * = S}, straightforward computation and triangle inequality yield where E * is the conditional expectation over ǫ * given ω. From the proof of Theorem 1, we have By Lemma 5, we have Next, we will bound the first three parts in the last inequality respectively. By straightforward computation, we have The penultimate equality is because E(ǫ − ǫ * )(ǫ − ǫ * ) T = d 2 (F,F n )I. Then we only need to show that It is easy to see that In the proof of Theorem 1, we have shown that E||ǫ|| 4 2 = O(n 2 ) (see (42)) and connect with P (A c ) = P (Ŝ = S) = o(e −n c 2 ), we obtain (64). The proof of (65) is the same as (64), so we omit it.
Combine (63), (64), (65) and (62), we have Therefore, we can obtain Lemma 6 by where the last inequality holds because ( 1 n X T S X S ) −1 is a s by s matrix and Λ max (( 1 n X T S X S ) −1 ) ≤ Λ −1 min . In order to prove Theorem 4, we only have to show that Lemma 7. Suppose that assumptions (a)-(c), (e) and (dd) are satisfied and that P (Ŝ = S) → 0, then Proof. By definition, Therefore, sd 2 (F n , F n ) = o p (1).
Lemma 8. Suppose that assumptions (a)-(c), (e) and (dd) are satisfied and that P (Ŝ = S) → 0, then Proof. Application of Lemma 8.8 in [7], shows that for random variables U and V with finite second moment, Therefore, if let F n be the empirical distribution of ǫ 1 , ..., ǫ n centered at its mean Connecting the above two equalities, Now use the fact sE( 1 n n i=1 ǫ i ) 2 = sσ 2 n → 0 and the previous Lemma 7, we obtain Lemma 8.
Since d is a metric, We still need to bound d 2 (F, F n ). Denote φ and Φ the density and distribution functions of standard normal distribution N (0, 1) respectively. To control d 2 (F, F n ), we use the following result obtained by [16]: let → ω denote weak convergence, Lemma 9 (del Barrio et al. (2000)). Let ǫ i , i = 1, .., n be a sequence of i.i.d. normal random variables with mean 0 and variance σ 2 . Then where {Z j , j = 3, .., ∞} are a sequence of independent N (0, 1) random variables and a n = 1 n n n+1 In fact we have (see [6]) which together with Lemma 6, Lemma 8 and inequality (66) complete the proof.

Introduction
Consider the sparse linear regression model where ǫ = (ǫ 1 , ..., ǫ n ) T is a vector of independent and identically distributed (i.i.d.) random variables with mean 0 and variance σ 2 . Y = (y 1 , ..., y n ) T ∈ R n is the response vector, and X ∈ R n×p is the design matrix which is deterministic. β * ∈ R p is the vector of model coefficients with at most s (s < n) non-zero components. We consider the high-dimensional setting which allows p and s to grow with n (p can be comparable to or larger than n). Note that, in here and what follows, Y , X, and β * are all indexed by the sample size n, but we omit the index whenever this does not cause confusion.
In sparse linear regression models, an active line of research focuses on the recovery of sparse vector β * by a popular l 1 regularization method called Lasso [40]. The Lasso has been studied under at least three common criteria: (i) model selection criteria, meaning the correct recovery of the support set S = {j ∈ {1, 2, ..., p} : β * j = 0} of the model coefficients β * ; (ii) l q estimation errors ||β − β * || q q , especially l 2 and l 1 , whereβ is the estimate of β * ; and (iii) prediction error ||Xβ − Xβ|| 2 2 . The Lasso estimator is defined bŷ where λ n ≥ 0 is the tuning parameter which controls the amount of regularization applied to the estimate. Setting λ n = 0 reverses the Lasso problem to Ordinary Least Squares (OLS) which minimizes the unregularized empirical loss. Replacing ℓ 1 penalty by ℓ 2 penalty in 1.2 gives the Ridge estimator [24]: 3) The Lasso estimator has two nice properties, namely, (i) it generates sparse models by means of ℓ 1 regularization and (ii) it is also computationally feasible (see [15,20,37]). The asymptotic behavior of Lasso-type estimators has been studied by [27] for fixed p and β * as n → ∞. In particular, they have shown that under some regularity conditions on the design, λ n = o(n) is sufficient for consistency in the sense thatβ(λ n ) → p β * , and λ n should grow more slowly (i.e. λ n = O( √ n)) for asymptotic normality of the Lasso estimator. On the model selection consistency front, [30] proposed the neighborhood stability condition which is equivalent to the Irrepresentable condition [21,41,44,47] to prove the Lasso consistency for Gaussian graphical model selection. [47] showed that the Irrepresentable condition is almost necessary (for fixed p) and sufficient for the Lasso to select the true model both in the classical fixed p setting and in the high-dimensional setting. [45] considered a weaker sparsity assumption, meaning that the regression coefficients outside an ideal model are small but not necessarily zero (the sum of their absolute values is of the order O(sλ n /n)), and imposed the sparse Riesz condition to prove the rate-consistent in terms of the sparsity, bias and the norm of missing large coefficients. In addition, [44] further established precise conditions on the scalings of (n, p, s) that are necessary and sufficient for sparsity pattern recovery using the Lasso. Other related work includes [13,21,41,48]. All the aforementioned papers imposed suitable mutual incoherence conditions on the design. On the l 2 estimation error front, the Lasso has been shown under a weaker restricted eigenvalue condition to achieve l 2 convergence rate of (s log p)/n [6,33,35,36,42], which is the minimax optimal rate [38]. Other work focuses on the convergence rates of ||Xβ(λ n ) − Xβ * || 2 2 and ||β(λ n ) − β * || 1 , see [7,23,43] for example. However, even if p is fixed and the Irrepresentable Condition is satisfied, there does not exist a tuning parameter λ n which can lead to both variable selection consistency and asymptotic normality [17,48]. More importantly, for the case of p ≫ n, statistical inference for the Lasso estimator with theoretical guarantees is still an insufficiently explored area.
The bootstrap is very useful for inference. For fixed p, [34] developed a perturbation resampling-based method to approximate the distribution of a general class of penalized regression estimates. [9] proposed a modified residual bootstrapping Lasso method that is consistent in estimating the limiting distribution of the Lasso estimator. [46] developed a low-dimensional projection (LDP) approach to constructing confidence intervals. Though LDP works for (s log p)/ √ n → 0, it has nothing to do with the idea of resampling and bootstrap. Does the bootstrap provide a valid approximation in the case of p ≫ n? In this paper, we will give an affirmative answer to this question based on residual bootstrap after two post-Lasso estimators: Lasso+mLS and Lasso+Ridge. Our method provides consistent estimate of the limiting distribution of Lasso+mLS (or Lasso+Ridge) even if p grows at an exponential rate in n.
Post-Lasso estimator is a special case of two stage estimators: (1) selection stage: one selects predictors using the Lasso; and (2) estimation stage: modified Least Square (mLS) or Ridge, is applied to estimate the coefficients of the selected predictors. Our estimator is referred to as Lasso+mLS or Lasso+Ridge. Lasso+mLS is very close to Lasso+OLS, which uses Ordinary Least Squares (OLS) in the second stage. Several authors have previously considered two stage estimators to improve the performance of the Lasso, such as the Lars-OLS hybrid [15], adaptive Lasso [48], relaxed Lasso [31], and marginal bridge estimator [25], to name just a few.
Our contributions are summarized as follows: 1. We propose a valid inference procedure for parameter estimation based on parametric residual bootstrap after two post-Lasso estimators: Lasso+mLS and Lasso+Ridge. More specifically, we show that the Mallows distance between the distributions of the bootstrap estimator and the Lasso+mLS (or Lasso+Ridge) estimator converges to 0 in probability.

Under the Irrepresentable condition and other regularity conditions, we
derive the asymptotic unbiasedness of Lasso+mLS and Lasso+Ridge. We show that their biases decay at an exponential rate and that they can achieve the oracle convergence rate of s/n for mean squared error E||β − β * || 2 2 whereβ is either Lasso+mLS or Lasso+Ridge. 3. We prove the asymptotic normality of Lasso+mLS and Lasso+Ridge. As we show in Theorem 3 and Corollary 2, these two post-Lasso estimators display an oracle property that the Lasso does not have: they can select the true predictors with probability converging to 1 and the estimates of nonzero parameters have the same asymptotic normal distribution that they would have if the zero parameters were known in advance. 4. Our analysis is not limited to adopting the Lasso in the selection stage, but is applicable to any other model selection criteria with exponentially decay rates of the probability of selecting wrong models, for example, stability selection [32], SCAD [17,28] and Dantzig selector [6,8,22].
Our key assumptions for the validity of residual bootstrap after Lasso+mLS or Lasso+Ridge are the Irrepresentable condition and that s goes to infinity slower than √ n. The Irrepresentable condition can be weakened by the sparse eigenvalue condition if we adopt stability selection [32] to enhance the selection performance of the Lasso. Without considering model selection, [5] showed that residual bootstrap OLS fails if p 2 /n does not tend to 0. Therefore, the condition s 2 /n → 0 cannot be weakened. Our conditions on the scalings (n, p, s) are not the sharpest but have been previously used in the literature [47] and make our convergence rate more explicit. In addition, we require a gap of size n c 3 2 (c 3 ∈ (0, 1] is a constant) between the decay rate of β * and n − 1 2 which prevents the estimation from being dominated by the noise terms. [18] proposed a similar constraint min 1≤i≤s |β * i | ≥ M n κ , 0 ≤ κ < 1 2 to show model selection consistency of the Sure Independent Screening. This assumption is weaker than min 1≤i≤s |β * i | ≥ M , which was assumed by [25] in connection with the asymptotic properties of the Bridge estimator. Finally, we need some regularity conditions (conditions (a)-(c) in Section 2) which are standard in sparse high-dimensional linear regression literature [25,26,47].
After we had obtained our bootstrap results in Theorem 4 and Corollary 3, our attention was brought to an independent result in [10] where a variant of the Irrepresentable condition was used to prove the second-order correctness of the residual bootstrap applied to a suitable studentized version of the adaptive Lasso estimator. However, the main results of [10] are valid only for linear combinations of the adaptive Lasso estimator while our results hold for the joint distribution of Lasso+mLS (or Lasso+Ridge). The distance between distributions used in [10] and the proof there are also different from ours. Specifically, [10] adopted the total variation distance and used the Edgeworth expansion in the proof while we study the Mallows distance and our proof is direct. In addition, [10] allows p to grow only at polynomial rates in n while we allow p to grow at an exponential rate in n.
In our paper, before stating the bootstrap result, we first derive the asymptotic unbiasedness and asymptotic normality of Lasso+mLS and Lasso+Ridge. It is well known that (s log p)/n is the minimax optimal rate for the Lasso under the restricted eigenvalue condition. Since we assume the stronger Irrepresentable condition and some conditions on scaling (n, p, s), we are able to attain a better rate of s/n for MSE which indicates that one can avoid the imsart-ejs ver. 2012/08/31 file: postlasso.tex date: May 7, 2014

Hanzhong Liu and Bin Yu/Lasso+mLS and Lasso+Ridge in SHLR
4 feature selection penalty of log p by combining the Lasso and Least Squares or Ridge. We should mention that previous work [2] has obtained l 2 convergence rate (||β Lasso+OLS − β * || 2 2 = O p (s/n)) of Lasso+OLS estimator under weaker conditions. However, their results hold only in probability while our rate is in the sense of L 2 -expectation (E||β − β * || 2 2 = O(s/n)). Notation For any vector a = (a 1 , ..., a m ) For a vector β ∈ R p and a set S ⊂ {1, ..., p}, denote S c the complementary set of S and let β S = {β j : j ∈ S}. Given an n by p matrix X, write x T i ∈ R p , i = 1, ..., n and X j ∈ R n , j = 1, ..., p the i-th row and the j-th column of X respectively, where x T i is the transpose of x i . For a given m × m matrix A, let Λ min (A) and Λ max (A) denote the smallest and largest eigenvalues of A respectively. Write tr(A) the trace of A which is the sum of the diagonal entries of A.
The rest of the paper is organized as follows: in Section 2, we define modified Least Squares and Ridge after model selection and study their asymptotic properties. In Section 3, we apply these general properties to the special cases of Lasso+mLS and Lasso+Ridge and then derive their asymptotic unbiasedness, asymptotic normality and the approximation property of residual bootstrap. Similar asymptotic properties of modified Least Squares and Ridge after stability selection are obtained in Section 4. Simulation examples are given in Section 5. We conclude in Section 6. The proofs can be found in the Appendix.

Asymptotic Properties of Modified Least Squares and Ridge after Model Selection
In this section, we begin with a precise definition of the modified Least Squares or Ridge after model selection, and then study their asymptotic properties, including asymptotic unbiasedness, asymptotic normality and the validity of residual bootstrap.

Definitions and Assumptions
Modified Least Squares (or Ridge) after model selection refers to a special type of two stage estimators. In the first stage, one uses certain model selection methods to select predictors. For example, letβ be the Lasso estimator defined in (1.2), one gets a set of selected predictorsŜ = {j ∈ {1, 2, ..., p} :β j = 0}. Again,β andŜ are dependent on λ n , but we omit the dependence whenever this does not cause confusion. In the second stage, a low-dimensional estimation method is applied to the selected predictors inŜ. For example, one can adopt OLS and then form the OLS after model selection (denoted by Select+OLS): imsart-ejs ver. 2012/08/31 file: postlasso.tex date: May 7, 2014

5
The solution of (2.1) isβ Select+OLS,Ŝ = (X T S XŜ) −1 X T S Y if X T S XŜ is invertible. When X T S XŜ is not invertible, the solution of (2.1) is not unique. In this case one can use the generalized inverse. However, the generalized inverse is not stable when the smallest nonzero eigenvalue of X T S XŜ approximately equals 0, which may result in poor performance. We propose a modified Least Squares method in the second stage and form our Select+mLS estimator. Let d = |Ŝ| and write 1 √ n XŜ in its singular value decomposition (SVD) form where U is an n × n orthogonal matrix, D is an n × d diagonal matrix with singular values λ 1 ≥ λ 2 ≥ ... ≥ λ d on the diagonal, and V T (the transpose of V ) is a d × d orthogonal matrix. By simple algebraic operations, OLS based on (XŜ, Y ) has the following form: If one or more of the singular values are 0, one can utilize generalized inverse which just takes λ −1 k = 0 for all zero-valued λ k . We propose a hard thresholding on the singular values, that is, shrinking those singular values smaller than τ n (τ n > 0) to zero. Then define a modified Least Square estimatorβ mLS (τ n ) in the same form of (2.3) except that we take λ −1 k = 0 for all λ k < τ n . This estimator is similar to principal components regression [29]. Note that if 0 ≤ τ 2 n ≤ Λ min ( 1 n X T S XŜ),β mLS (τ n ) is the same as β OLS , that is, Our final modified Least Squares after model selection (Select+mLS) is defined by:β Select+mLS,Ŝ (τ n ) =β mLS (τ n ) ,β Select+mLS,Ŝ c (τ n ) = 0. (2.4) One can also use Ridge in the second stage and form the Ridge after model selection (Select+Ridge): where µ n ≥ 0 is a smoothing parameter. (2.5) is equivalent tõ β Select+Ridge,Ŝ (µ n ) = (X T S XŜ + µ n I) −1 X T S Y ,β Select+Ridge,Ŝ c (µ n ) = 0. (2.6) When the Lasso is used in the selection stage, we refer to our final estimators as Lasso+mLS and Lasso+Ridge respectively. τ n and µ n are tuning parameters.
In our theorems and simulation, τ n ∝ 1 n and µ n ∝ 1 n can get good estimation and prediction performance. For the sake of notational simplicity, we omit the dependence of estimators on λ n and τ n or µ n whenever this does not cause confusion.
The gaussian assumption (a) is fairly standard in the literature. Assumption (c) ensures the smallest eigenvalue of C 11 is bounded away from 0 so that its inverse behaves well. (a)-(c) are typical assumptions in sparse linear regression literature, see for example [25,26,47]. Assumption (d) means that the number of relevant predictors s is allowed to diverge but much slower than n, and that the number of predictors p can grow faster than n (up to exponentially fast), which is standard in almost all high-dimensional inference literature. Though this assumption is stronger than the typical one s log p n → 0, it has been used previously [47].
In the following subsections, we will show that both Select+mLS and Se-lect+Ridge have good asymptotic properties if the probability of selecting wrong models P (Ŝ = S) decays fast, say o(e −n κ ) (where κ > 0 is a constant).

Bias and MSE of Select+mLS and Select+Ridge
In a high-dimensional setting, bias is not the only consideration of estimates because of the bias and variance trade-off. Regularization has been a popular technique for model fitting which results in a biased estimator but decreases the mean squared error (MSE) dramatically. However, if two estimators have the same MSE, we prefer the unbiased one. The following Theorems 1 and 2 provide general bounds for the bias and MSE of the Select+mLS and Select+Ridge.
Theorem 1. Suppose that Gaussian assumption (a) is satisfied and τ 2 n ≤ Λ min (C 11 ), then the bias and the MSE ofβ Select+mLS (τ n ) satisfy ). The first term on the right hand side of (2.12) corresponds to the oracle convergence rate. The second term is related to model selection accuracy. In the case of the Lasso, P (Ŝ = S) can decay at an exponential rate. Hence the MSE is completely determined by the first term σ 2 n tr(C −1 11 ), which can not be improved. Remark 2.2. From theorem 1, one can easily get an upper bound for prediction mean squared error E{ 1 Similarly, forβ Select+Ridge (µ n ), we have: Theorem 2. Suppose that Gaussian assumption (a) is satisfied, then the bias and the MSE ofβ Select+Ridge (µ n ) satisfy (2.14) imsart-ejs ver. 2012/08/31 file: postlasso.tex date: May 7, 2014

8
Theorems 1 and 2 indicate that as long as the probability of selecting wrong models P (Ŝ = S) decays fast, say at an exponential rate, Select+mLS and Select+Ridge are asymptotically unbiased and their MSEs decay at the oracle rate. We have known that under the Irrepresentable condition and other regularity conditions, the probability of the Lasso selecting wrong models satisfies P (Ŝ = S) = o(e −n c 2 ) [47]. Then applying the above theorems to Lasso+mLS and Lasso+Ridge as special cases, we can easily derive their convergence rates of bias and MSE, see Section 3 for more details.

Asymptotic Normality of Select+mLS and Select+Ridge
In this section, we show asymptotic normality of Select+mLS and Select+Ridge. LetΨ and Ψ be the distribution functions of √ n(β S − β * S ) and N (0, σ 2 C −1 11 ) respectively, whereβ can be any one of the two post model selection estimators: Select+mLS and Select+Ridge. LetŜ be the selected predictor set, Theorem 3. Suppose that assumptions (a)-(e) are satisfied and that the model selection procedure is consistent, i.e., P (Ŝ = S) = o(1), then Select+mLS and Select+Ridge are asymptotically normal 2 , that is, This theorem states that model selection consistency in the first stage implies the asymptotic normality of the second stage estimators: Select+mLS and Select+Ridge. The proof of this theorem can be found in Appendix C.

Residual Bootstrap after Select+mLS and Select+Ridge
To make reliable scientific discoveries, we need to establish valid inference procedures including constructing confidence regions and testing for the parameter estimation. Although we have derived the asymptotic normality of Select+mLS and Select+Ridge, it is difficult to use in practice because the noise variance σ 2 is not known and hard to estimate in a high-dimensional setting. The bootstrap is a popular alternative in this case. A summary of bootstrap methods in linear and generalized linear penalized regression models for fixed p can be found in [39]. We will consider the p ≫ n case by proposing a new inference procedure: residual bootstrap after two stage estimators. Our method allows p to grow at an exponential rate in n.
In the context of a regression model, residual bootstrap is a standard method to bootstrap when the design matrix X is deterministic [14,19,27]. Letβ denote Select+mLS or Select+Ridge, the residual vector is given by: (2.16) Consider the centered residuals at the mean {ǫ i −μ, i = 1, ..., n}, whereμ = 1 n n i=1ǫ i . For residual bootstrap, one obtains ǫ * = (ǫ * 1 , ..., ǫ * n ) T by resampling with replacement from the centered residuals {ǫ i −μ, i = 1, ..., n}, and formulates Y * as follows Then one can define the selected predictor setŜ * and Select+mLS or Se-lect+Ridgeβ * based on the bootstrap sample (X, Y * ). For the bootstrap to be valid, one needs to verify that the conditional distribution of T * n = √ n(β * −β) given ǫ, which can be computed directly from the data, approximates the distribution of T n = √ n(β − β * ). The difference between two distributions can be characterized by Mallows metric. Definition 1. The Mallows metric d, relative to the Euclidean norm || · ||, of two distributions F andF is the infimum of (E||Z − W || 2 2 ) 1 2 over all pairs of random vectors Z and W , where Z has distribution F and W has distributioñ F . That is, By Lemma 8.1 in [4], the infimum can be attained. To proceed, denote G n and G * n the distribution of T n and the conditional distribution of T * n respectively. Let P * denote the conditional probability given the error variables {ǫ i , i = 1, ..., n}.
To show the validity of residual bootstrap after Select+mLS or Select+Ridge, we need more conditions: Assumption (dd): suppose that s 2 /n → 0, as n → ∞. Assumption (f ): suppose that both the probability of selecting wrong models based on the original data (X, Y ) and that based on the resample (X, Y * ) decay at an exponential rate, i.e. P (Ŝ = S) = o(e −n c 2 ) and P * (Ŝ * =Ŝ) = o p (e −n c 2 ).
Assumption (g): suppose that Assumption (dd) is stronger than assumption (d) because it requires that s grows slower than √ n. Without considering model selection, [5] showed that residual bootstrap OLS fails if p 2 /n does not tend to 0. Therefore, the assumption (dd) cannot be weakened. As we shown in the next section, assumption (f) is satisfied if the Irrepresentable condition and some regularity conditions hold. Assumption (g) is a technical assumption which makes the convergence rates in Theorems 1 and 2 more clear. If we suppose Λ max (C 11 ) = Since β * has only s ≪ n nonzero components, this assumption is not very restrictive. Obviously, it is satisfied when the maximum of β * j is upper bounded by a constant. Assumption (h) is not very restrictive either because the number of terms in the sum is s ≪ n  Theorem 4 states that residual bootstrap after Select+mLS or Select+Ridge gives a valid approximation to the distribution of T n if the probabilities of selecting wrong models P (Ŝ = S) and P * (Ŝ * =Ŝ) decay at an exponential rate and the number of true predictors s grows slower than √ n. The proof of this theorem can be found in Appendix C.
In the next section, we will apply the above Theorems 1, 2, 3 and 4 to two special cases: Lasso+mLS and Lasso+Ridge.

Asymptotic Properties of Lasso+mLS and Lasso+Ridge
As we shown in Section 2, Select+mLS and Select+Ridge have attractive asymptotic properties (see Theorem 1, 2, 3 and 4) if the probability of selecting wrong models decays exponentially fast. Applying these theorems to Lasso+mLS and Lasso+Ridge, we can easily attain their convergence rates of bias and MSE, asymptotic normality and the validity of residual bootstrap. To proceed, we will first give a brief overview of the assumptions used to get model selection consistency of the Lasso investigated by [47]. Let sign(·) map positive entries to 1, negative entries to −1 and zero to zero, Definition 2 (Irrepresentable condition [21,30,41,44,47]). There exists a positive constant vector η, such that where 1 is a p−s by 1 vector with entries 1 and the inequality holds element-wise.
Assumption (i): there exist constant c 1 + c 2 < c 3 ≤ 1 and M > 0 so that Assumption (j): suppose that the tuning parameter λ n in the definition of the Lasso satisfies λ n ∝ n 1+c 4 2 with c 2 < c 4 < c 3 − c 1 .
The Irrepresentable Condition is a key assumption on the design that can be weakened by e.g. using stability selection instead of the Lasso. [32] showed that stability selection can achieve the same convergence rate of the probability of selecting wrong models as the Lasso does but under a less restrictive sparse eigenvalue condition. We will give a brief overview of stability selection combined with randomized Lasso in sparse linear regression in the Appendix B and discuss the asymptotic properties of two stage estimators based on stability selection in the next section. Assumption (i) requires a gap of size n c 3 2 between the decay rate of β * and n − 1 2 thus preventing the estimation from being dominated by the noise terms. It is weaker than min 1≤i≤s |β * i | ≥ M , which was assumed by [25] who studied the asymptotic properties of the Bridge estimator. [18] also proposed a similar constraint min 1≤i≤s |β * i | ≥ M n κ , 0 ≤ κ < 1 2 to show model selection consistency of the Sure Independent Screening. Now, we are ready to state our main results. Letβ andβ * denote the Lasso+mLS (or Lasso+Ridge) based on the original data (X, Y ) and that based on the resample (X, Y * ) respectively, then we can defineΨ, Ψ, T n , T * n , G n and G * n as Section 2 does. Firstly, combining Theorem 1 and the model selection property of the Lasso (see Lemma 2 in Appendix A), we can attain the following corollary: Proof. We only consider the Lasso+mLS since the proof for Lasso+Ridge is similar. By Lemma 2 in Appendix A, we have P (Ŝ = S) ≤ P (sign(β(λ n )) = sign(β * )) ≤ o(e −n c 2 ).
From assumption (c) (2.9), we know that Λ min (C 11 ) ≥ Λ min > 0. And since C 11 is an s by s matrix, we have Under condition (g) or equation (2.19), we have The corollary is obtained directly from Theorem 1.
imsart-ejs ver. 2012/08/31 file: postlasso.tex date:  Corollary 1 indicates that Lasso+mLS and Lasso+Ridge are asymptotically unbiased. In particular, their biases decay at an exponential rate and their MSEs achieve the oracle convergence rate of s n which is much faster than that of the Lasso. (Under the restricted eigenvalue condition, the Lasso achieves convergence rate of s log p n for MSE, see for example [38,45]). Secondly, combining Theorem 3 and Lemma 2, we can easily derive the asymptotic normality of Lasso+mLS and Lasso+Ridge.  (3.5) Remark 3.1. For fixed p and fixed β * , [27] showed that for λ n = o( √ n), the Lasso estimator is also asymptotically normal N (0, σ 2 C −1 ) under conditions 1 n X T X → C and 1 n max 1≤i≤n x T i x i → 0. Then the asymptotic covariance matrix of the rescaled and centered Lasso estimator √ n(β S − β * S ) is σ 2 multiplied by where matrix A B means A − B is positive definite. Therefore, Lasso+mLS and Lasso+Ridge have smaller asymptotic covariance matrix (which is σ 2 C −1 11 ) compared with the Lasso and hence reduce estimation uncertainty. Moreover, as pointed out by [17], one cannot find a λ n such that the Lasso estimator is model selection consistent and asymptotically normal ( √ n−consistency) simultaneously. In this sense, Lasso+mLS and Lasso+Ridge improve the performance of the Lasso. Lastly, we verify the validity of residual bootstrap after Lasso+mLS and Lasso+Ridge. We would like to begin with showing an interesting result that the Lasso estimatorβ * based on the resample (X, Y * ) also has model selection consistency.β * = argmin β ||Y * − Xβ|| 2 + λ n ||β|| 1 . Corollary 3. Suppose that conditions (a)-(e), (g)-(j), (dd) and the Irrepresentable Condition (3.1) are satisfied, then residual bootstrap after Lasso+mLS or Lasso+Ridge is consistent in the sense that with probability converging to 1 Remark 3.2. In practice, if the Irrepresentable Condition does not hold, one needs to try other model selection methods, e.g., Bolasso [1] or stability selection. As stated before, as long as the probabilities of selecting wrong models P (Ŝ = S) and P * (Ŝ * =Ŝ) decay at an exponential rate o(e −n c 2 ), residual bootstrap after two stage estimator gives valid approximation. Corollary 3 indicates that residual bootstrap after Lasso+mLS or Lasso+Ridge gives a valid approximation to the distribution of T n and then can be used to construct confidence intervals and test for parameter estimation. The proof is straightforward and we omit it.

Asymptotic Properties of Modified Least Squares and Ridge after Stability Selection
If the Irrepresentable Condition is violated, the Lasso cannot correctly select the true model. In this case, one needs to apply other model selection criteria instead of the Lasso. Stability selection is one popular method among many others. [32] showed that it can achieve the same convergence rate of the probability of selecting wrong models but under a less restrictive sparse eigenvalue condition. More details are given in the Appendix B. Letβ andβ * denote the modified Least Squares (or Ridge) after stability selection (SS+mLS or SS+Ridge) based on the original data (X, Y ) and that based on the resample (X, Y * ) respectively, then we can defineΨ, Ψ, T n , T * n , G n and G * n as Section 2 does. Applying the asymptotic properties of Select+mLS and Select+Ridge in Section 2, we can derive without any proofs the following corollaries parallel to Corollary 1, 2, 3.

Simulation
In this section we carry out simulation studies to evaluate the finite sample performance of Lasso+mLS and Lasso+Ridge. We have also constructed simulations for SS+mLS and SS+Ridge (modified Least Squares or Ridge after stability selection). Though in theory stability selection achieves the same convergence rate of the probability of selecting wrong models under weaker conditions compared with the Lasso, their finite sample performance is similar to the Lasso unless the signal to noise ratio is very high. In our simulation, Lasso+mLS and Lasso+Ridge work well and perform similarly with SS+mLS and SS+Ridge regardless of whether the Irrepresentable condition holds or not. Therefore we only present here the results for Lasso+mLS and Lasso+Ridge.

Comparison of Bias 2 , MSE and PMSE
This subsection compares the bias 2 (||Eβ − β * || 2 2 ), MSE and prediction mean squared error (PMSE) of the Lasso+mLS and Lasso+Ridge with those of the Lasso. We use R package "glmnet" to compute the Lasso solution. As part of the simulation, we fix τ n = 1/n and µ n = 1/n, so the only tuning parameter for all the three methods is λ n which can be chosen by a 5-fold cross-validation.
Our simulated data are drawn from the following model We set p = 500, s = 10 and σ = 1. The predictor vector x is generated from a multivariate normal distribution N (0, Σ). The value of x is generated once and then kept fixed. We consider two different Toeplitz covariance matrices Σ which control the correlation among the predictors: (1) Σ = I and (2) Table 1 summaries eight different example settings.
After X was generated, we examined the Irrepresentable condition and found that it holds in examples 1 − 6 and is violated in examples 7 − 8. In order to evaluate the prediction performance, we generate an independent testing data set of size 500 and compute the PMSE. Summary statistics are calculated based on 100 replications (keeping X fixed) and showed in Table 2 and Figure 1. We see that Lasso+mLS and Lasso+Ridge perform almost the same. They not only dramatically decrease the bias 2 of the Lasso by more than 90% but also can reduce the variance (in fact, their variances are 20%−55% smaller than that of the Lasso in all examples except in example 7 where their variances are 45% larger), therefore they improve the MSE and PMSE by 40%−80% and 5%−25% respectively. These benefits occur regardless of whether the Irrepresentable condition holds or not, which indicates that Lasso+mLS and Lasso+Ridge dominate the performance of the Lasso in terms of estimation.

Confidence Intervals and Coverage Probabilities
In this subsection, we study the finite sample performance of residual bootstrap Lasso+mLS and Lasso+Ridge using the examples from Table 1. Since Lasso+mLS and Lasso+Ridge behave similarly, we only show the results of Lasso+mLS for the sake of brevity.
We found that basic confidence intervals based on residual bootstrap Lasso provide more accurate coverage probabilities while having the same length as percentile confidence intervals (the basic 90% confidence intervals can achieve coverage probabilities larger than 80% while the percentile confidence intervals are too biased that their coverage probabilities can be lower than 20% for the nonzero-valued parameters, see Figure 2). The distribution of residual bootstrap  Lasso is far away from being centered at the true value ( Figure 3) which makes the percentile confidence intervals fail. Similar phenomenon happens for paired bootstrap Lasso method. Therefore, we suggest using the basic confidence intervals in practice with high-dimensional data. In what follows, our confidence intervals are all basic. Figure 4 shows the coverage probabilities of 90% confidence intervals based on residual bootstrap Lasso+mLS and residual bootstrap Lasso. In this figure, only 20 zero-valued parameters are presented for the sake of brevity. The coverage probabilities for the remaining p − s − 20 zero-valued parameters are similar to the 20 presented and therefore are omitted. In addition, we average the coverage probabilities and interval lengths over nonzero-valued parameters part and zero-valued parameters part respectively (see Tables 3 and 4). From Figure 4 and Tables 3 and 4, we can see that residual bootstrap Lasso+mLS gives accurate coverage probabilities (approximately 88% for nonzero β * j and 1 for zero β * j ) when the Irrepresentable condition holds (see examples 1-6), which verifies Corollary 3. Note that, for zero-valued parameters, residual bootstrap Lasso+mLS produces confidence intervals with coverage probabilities close to  cannot provide accurate coverage probabilities unless n is large enough (the coverage probability for nonzero β * j is around 75% for n = 200 and is around 85% for n = 400). Even when the Irrepresentable condition doesn't hold (see examples 7-8), residual bootstrap Lasso+mLS can also provide reasonable coverage probabilities (83.5%, 90.3% for nonzero β * j , and 99.1%, 99.7% for zero β * j ) while residual bootstrap Lasso does not (71.7%, 82.1% for nonzero β * j and 93.1%, 92.6% for zero β * j ). Compared with residual bootstrap Lasso, the coverage probability of residual bootstrap Lasso+mLS is about 7% in average closer to the preassigned level (90%) for nonzero β * j and 5% closer to 1 for zero β * j . Even though residual bootstrap Lasso has shorter (17% in average) interval lengths for the nonzero β * j , it loses accuracy in coverage. Overall, residual bootstrap Lasso+mLS is better than residual bootstrap Lasso. Moreover, when n increases, the performance of both methods become better.  In practice, many people prefer to perform paired bootstrap Lasso (resampling from the pairs (x i , y i ), i = 1, ..., n instead of from the residual) even when it makes sense to think of the design matrix as fixed. Therefore, we give some comparisons of residual bootstrap Lasso+mLS and paired bootstrap Lasso. Figure 6 shows the coverage probabilities v.s. average interval lengths based on residual bootstrap Lasso+mLS and paired bootstrap Lasso for different examples. Again, we average the coverage probabilities and interval lengths over nonzero-valued parameters part and zero-valued parameters part respectively and show them in Table 3 and Table 4. We can see that residual bootstrap Lasso+mLS pro- vides more accurate coverage probabilities (0.5% closer to 1 for zero β * j and 14% in average closer to the preassigned level for nonzero β * j ) with more than 90% shorter (for zero β * j ) or at least comparable (for nonzero β * j ) interval lengths compared with paired bootstrap Lasso. Based on our simulations, we conclude that residual bootstrap Lasso+mLS and residual bootstrap Lasso+Ridge are better choices for constructing confidence intervals. Coverage probabilities v.s. average interval lengths for 90% confidence intervals based on two methods: paired bootstrap Lasso (PBL, black circle) and residual bootstrap Lasso+mLS (RBLmLS, red triangle). The latter is better since it provides more accurate coverage probabilities (0.5% closer to 1 for zero β * j and 14% in average closer to the preassigned level (90%) for nonzero β * j ) but with 90% shorter (for zero β * j ) or at least comparable (for nonzero β * j ) interval lengths.

Conclusion
We have derived for the first time the asymptotic properties of Lasso+mLS and Lasso+Ridge in sparse high-dimensional linear regression models where p ≫ n. Under the Irrepresentable condition and other common conditions on scaling (n, s, p), we showed that both Lasso+mLS and Lasso+Ridge are asymptotically unbiased and they achieve oracle convergence rate of s n for MSE which improves the performance of the Lasso. In addition, Lasso+mLS and Lasso+Ridge estimators have an oracle property in the sense that they can select the true predictors with probability converging to 1 and the estimates of nonzero parameters have the same asymptotic normal distribution that they would have if the zero parameters were known in advance.
We then proposed residual bootstrap after Lasso+mLS and Lasso+Ridge methods and showed that they give valid approximations to the distributions of Lasso+mLS and Lasso+Ridge, respectively, provided that the probability of the Lasso selecting wrong models decays at an exponential rate and the number of true predictors s goes to infinity slower than √ n. In fact, our analysis is not limited to adopting Lasso in the selection stage, but is applicable to any other model selection criteria with exponentially decay rates of the probability of selecting wrong models, for example, stability selection, SCAD and Dantzig selector. Lastly, we presented simulation results assessing the finite sample performance of Lasso+mLS and Lasso+Ridge and observe that they not only dramatically decrease the bias 2 of the Lasso by more than 90% but also reduce the MSE and PMSE by 40% − 80% and 5% − 25% respectively. Further, we constructed 90% confidence interval based on our residual bootstrap Lasso+mLS (or Lasso+Ridge) and examined the coverage accuracy. We found that our method resulted in coverage probability approximately 88% for nonzero β * j and 1 for zero β * j , which is much more accurate (approximately 7% closer to the desired level) than bootstrap Lasso method. (2010)). For a cut-off π thr with 0 < π thr < 1 and a set of regularization parameters Λ, the set of stable variables is defined asŜ

Definition 4 (Meinshausen and Buhlmann
For stability selection with randomized Lasso, one can obtain selection consistency by just assuming sparse eigenvalues, a condition that is much weaker than that of the Irrepresentability. Sparse eigenvalues condition essentially requires that the minimum and maximum eigenvalues, for a selection of order s predictors, are bounded away from 0 and ∞ respectively. (2010)). For any K ⊆ {1, ..., p}, let X K be the restriction of X to columns in K. The minimal sparse eigenvalue φ min is defined for k ≤ p as φ min (k) = inf a∈R ⌈k⌉ ,K⊆{1,...,p}:|K|≤⌈k⌉

Definition 5 (Meinshausen and Buhlmann
and analogously for the maximal sparse eigenvalue φ max .
Applying the same trick as (C.3), we obtain the result.
Proof of Theorem 3. We prove the results for Select+mLS and Select+Ridge respectively.
In the following, we will prove the validity of residual bootstrap after Se-lect+mLS. The proof for residual bootstrap after Select+Ridge is omitted since the techniques are almost the same.
Therefore where the last equality holds since we suppose that τ n ∝ 1 n and that 1 n ||Xβ * || 2 2 = O(n).
Finally, we can prove Theorem 4 now.
Proof of Theorem 4. Using the same notations as [4], let F be the true distribution of ε i ; let F n be the empirical distribution of ǫ 1 , ..., ǫ n ; letF n be the empirical distribution of the residualsǫ 1 , ...,ǫ n ; and letF n beF n centered at its meanμ. We first show that the Mallows metric of G n and G * n can be bounded by √ s · d(F,F n ). Proof.
By assumption (f), there exists a set A n be such that P (A n ) → 1 and for every ω ∈ A n , P * (Ŝ * =Ŝ) = o(e −n c 2 ).