Debiasing the Debiased Lasso with Bootstrap

In this paper, we prove that under proper conditions, bootstrap can further debias the debiased Lasso estimator for statistical inference of low-dimensional parameters in high-dimensional linear regression. We prove that the required sample size for inference with bootstrapped debiased Lasso, which involves the number of small coefficients, can be of smaller order than the existing ones for the debiased Lasso. Therefore, our results reveal the benefits of having strong signals. Our theory is supported by results of simulation experiments, which compare coverage probabilities and lengths of confidence intervals with and without bootstrap, with and without debiasing.


Introduction
High-dimensional linear regression is a highly active aera of research in statstics and machine learning.When the dimension p of the model is larger than the sample size n, regularized least square estimators are typically used when the signal is believed to be sparse.Properties of regularized least square estimators in prediction, coefficient estimation and variable selection have been extensively studied.However, regularized methods do not directly provide valid inference procedures, such as confidence intervals and hypothesis testing.
Among the regularized regression procedures, the Lasso (Tibshirani, 1996) is one of the most popular methods as it is computationally manageable and theoretically well-understood.However, the limiting distribution of the Lasso estimator (Knight and Fu, 2000) depends on unknown parameters in low-dimensional settings and is not available in high-dimensional settings.Chatterjee and Lahiri (2010) showed the inconsistency of residual bootstrap for the Lasso if at least one true coefficient is zero in fixed-dimensional settings.Thus, there is substantial difficulty in drawing valid inference based on the Lasso estimates directly.

Debiased Lasso
In the p n scenario, Zhang and Zhang (2014) proposed to construct confidence intervals for regression coefficients and their low-dimensional linear combinations by "debiasing" regularized estimators, such as the Lasso.Such estimators are known as the "debiased Lasso" or the "desparsified Lasso".
Along this line of research, many recent papers study computational algorithms and theories for the debiased Lasso and its extensions beyond linear models.Van de Geer et al. (2014) proved asymptotic efficiency of the debiased Lasso estimator in linear models and for convex loss functions.Javanmard and Montanari (2014a) carefully studied a quadratic programming in Zhang and Zhang (2014) to generate a direction for debiasing the Lasso and demonstrated its benefits.Jankova and Van De Geer (2015) and Ren et al. (2015) proved asymptotic efficiency of the debiased Lasso in estimating individual entries of a precision matrix.Mitra and Zhang (2016) proposed to debias a scaled group Lasso for chi-squared-based statistical inference for large variable groups.Fang et al. (2016) considered statistical inference with the debiased Lasso in high-dimensional Cox model.Chernozhukov et al. (2017) studied debiased method in a semiparametric model with machine learning approaches.
The sample size requirement for asymptotic normality in aforementioned papers is typically n (s log p) 2 , where s is the number of nonzero regression coefficients.However, it is known that point estimation consistency of the Lasso estimators holds with n s log p.Therefore, it becomes an intriguing question whether it is possible to conduct statistical inference of individual coefficients in the regime s log p n (s log p) 2 .Very little work has been done in this direction.Cai and Guo (2017) proved that adaptivity in s is infeasible for statistical inference with random design when n (s log p) 2 in a minimax sense.However, for standard Gaussian design, Javanmard and Montanari (2014b) proved that the debiased estimator is asymptotically Gaussian in an average sense if s = O(n/ log p) with s/p, n/p constant, but they did not provide theoretical results when the covariance of the design is unknown.Javanmard and Montanari (2015), denoted as [JM15], proved that asymptotic normality for the debiased Lasso holds when s n/(log p) 2 , s j n/ log p and min{s, s j } √ n/ log p under Gaussian design and other technical conditions, where s j is the number of nonzero elements in j-th column of the precision matrix of the design.In this paper, we show that the sample size conditions for the debiased Lasso can be improved by bootstrap if a significant proportion of signals are strong, for both deterministic and random designs.

Bootstrap
Bootstrap has been widely studied in high-dimensional models for conducting inference.For the debiased Lasso procedure, Zhang and Cheng (2017) proposed a Gaussian bootstrap method to conduct simultaneous inference with non-Gaussian errors.Dezeure et al. (2016) proposed residual, paired and wild multiplier bootstrap for the debiased Lasso estimators, which demonstrates the benefits of bootstrap for heteroscedastic errors as well as simultaneous inference.However, the aforementioned papers do not provide improvement on the sample size conditions.
For fixed number of covariates p, Chatterjee and Lahiri (2011) proposed to apply bootstrap to a modified Lasso estimator as well as to the Adaptive Lasso estimator (Zou, 2006).In a closely related paper, Chatterjee and Lahiri (2013) showed the consistency of bootstrap for Adaptive Lasso when p increases with n under some conditions which guarantee sign consistency.They also proved the second-order correctness for a studentized pivot with a bias-correction term.It is worth mentioning that a beta-min condition is required in their theorems as sign consistency is used to prove bootstrap consistency.
In this paper, we prove that the bias of the debiased Lasso estimator can be further removed by bootstrap without assuming the beta-min condition.We provide a refined analysis to distinguish the effects of small and large coefficients and show that bootstrap can remove the bias caused by strong coefficients.Under deterministic designs, the sample size requirement is reduced to n max{s log p, (s log p) 2 }, where s is the number of coefficients whose size is no larger than C log p/n for some constant C > 0 (Theorem 3.4 in Section 3).One can see that the condition on the overall sparsity s, s n/ log p, provides the rate of point estimation.If a majority of signals are strong, say s s, our sample size condition is weaker than the usual n (s log p) 2 .Comparable results are also proved for Gaussian designs, which involve the sparsity of the j-th column of the precision matrix (Theorem 4.3 in Section 4).

Some other related literatures
In the realm of high-dimensional inference, many other topics have been studied.For bootstrap theories, Mammen (1993) considered estimating the distribution of linear contrasts and of F-test statistics when p increases with n.Chernozhukov et al. (2013) and Deng and Zhang (2017) developed theories for multiplier bootstrap to approximate the maximum of a sum of high-dimensional random vectors.Belloni et al. (2014) proposed to construct confidence regions for instrumental median regression estimator and other Z-estimators based on Neyman's orthogonalization, which is firstorder equivalent to the bias correction.Inference based on selected model has been considered in many recent papers (Berk et al., 2013;Lockhart et al., 2014;Lee et al., 2016;Tibshirani et al., 2016).Barber and Candes (2016) considered false discovery rate control via a knockoff filter in high-dimensional setting.

Notations
For vectors u and v, let u q denote the q norm of u, u 0 the number of nonzero entries of u, v, u = u T v the inner product.For a set T , let |T | denote the cardinality of T and u T the subvector of u with components in T .We use e j to refer to the j-th standard basis element, for example, e 1 = (1, 0, . . ., 0).For a matrix A ∈ R k 1 ×k 2 , let A q denote the q operator norm of A. Specially, let A ∞ = max j≤k 1 A j,. 1 .Let Λ max (A) and Λ min (A) be the largest and smallest singular values of A, A T 1 ,T 2 the submatrix of A consisting of rows in T 1 and columns in T 2 .For a vector b ∈ R p , let sgn(b) be an element of the sub-differential of the 1 norm of b.Specifically, (sgn(b) , where Z is a standard normal random variable.
We use v 0 , v 1 , v 2 , c 1 , c 2 , . . . to denote generic constants which may vary from one appearance to the other.

Outline of the paper
The remainder of this paper is organized as follows.In Section 2, we describe the procedure under consideration and layout the main ideas of the proof.In Section 3 and 4, we prove the bootstrap consistency for the debiased Lasso and the asymptotic normality of a bias-corrected debiased Lasso estimator under fixed designs and Gaussian designs, respectively.We illustrate our theoretical results with simulation experiments in Section 5 and conclude with a discussion in Section 6. Proofs of the main theorems and lemmas are provided in Section 7.

Main contents
In this section, we describe the procedure of bootstrapping the debiased Lasso under consideration and the main ideas of this paper.

Bootstrapping the debiased Lasso
Consider a linear regression model y i = x T i β + i , where β ∈ R p is the true unknown parameter and 1 , . . ., n are i.i.d.random variables with mean 0 and variance σ 2 .We assume the true β is sparse in the sense that the number of nonzero entries of β is relatively small compared with min{n, p}.For simplicity, we also assume that x j 's are normalized, s.t.x j 2 2 = n, for j = 1, . . ., p.The Lasso estimator (Tibshirani, 1996) is defined as where λ > 0 is a tuning parameter.Suppose that we are interested in making inference of a single coordinate β j , j = 1, . . ., p.The debiased Lasso (Zhang and Zhang, 2014) corrects the Lasso estimator by a term calculated from residuals.Specifically, it takes the form where z j is an estimate of the least favorable direction (Zhang, 2011).For the construction of z j , it can be computed either as the residual of another 1 -penalized regression of x j on X −j (Zhang and Zhang, 2014;Van de Geer et al., 2014) or by a quadratic optimization (Zhang and Zhang, 2014;Javanmard and Montanari, 2014a).We adopt the first procedure in this paper.Formally, While it is also possible to debias other regularized estimators of β, such as Dantzig selector (Candes and Tao, 2007), SCAD (Fan and Li, 2001) and MCP (Zhang, 2010), we restrict our attention to bootstrapping the debiased Lasso.We consider Gaussian bootstrap although the noise i are not necessarily assumed to be normally distributed.We generate the bootstrapped response vector as where x i are unchanged and ˆ * i are i.i.d.standard Gaussian random variables multiplied by an estimated standard deviation σ.Namely, where ξ i , are i.i.d standard normal.For the choice of variance estimator, we use (Sun and Zhang, 2012;Reid et al., 2016;Zhang and Cheng, 2017;Dezeure et al., 2016).This is the same proposal of bootstrapping the residuals as in Zhang and Cheng (2017).However, we do not directly use ˆ * i in (6) to simulate the distribution of the debiased estimator.Instead, we recompute the debiased Lasso based on (X, y * ) as follows: where z j is the same as the sample version in (3) and β * is the bootstrap version of the Lasso estimator computed via (1) with (X, y * ) instead of the original sample.We construct the confidence interval for β j as β(DB) where q c (u) is the c-quantile of the distribution of u.
We prove that under proper conditions, the approximation error of the debiased Lasso estimator β(DB) j in ( 2) is dominated by a constant term.We propose to estimate this dominating constant bias by the median of the bootstrapped approximation errors and construct a double debiased Lasso (DDB) estimator which is asymptotically normal under proper conditions.

Main ideas
Our analysis is based on a different error decomposition for the debiased Lasso from the one originally introduced.In Zhang and Zhang (2014), the error of the debiased Lasso is decomposed into two terms, a noise term and a remainder term: This is the starting point of many existing analysis of the debiased Lasso ( Van de Geer et al., 2014;Javanmard and Montanari, 2014a;Dezeure et al., 2016).Typically, the Orig.remainderterm is bounded by O P (sλλ j ) through an ∞ -1 splitting with λ j in (4).Our analysis is motivated by the following observations.Let S and Ŝ be the support of β and β respectively.It follows from the KKT condition of Lasso (1) that assuming that X T Ŝ X Ŝ /n is invertible.Our idea is to approximate β by an oracle estimator βo in the analysis, where when X T S X S /n is invertible.This estimator βo is oracle as it requires the knowledge of the true support of β.However, it is different from the oracle least square estimator as the last term in βo S is added to mimic the Lasso estimator.In fact, βo = β when the Lasso estimator is sign consistent.
Inference based on the oracle estimator βo (12) is relatively easy, because its approximation error does not involve random support selection.In fact, its approximation error is linear in with an unknown intercept.Our idea is that when the difference between the oracle estimator βo and the Lasso estimator β is small, the approximation error of the debiased Lasso in ( 11) is dominated by a bias term associated with this intercept.Therefore, bootstrap can be used to remove this main bias term.Specifically, we decompose the error of the debiased Lasso in (2) as Here the N oise term is the sum of the Orig.noise in ( 11) and a noise term associated with the oracle estimator βo in ( 12).The Bias term is from the intercept of the oracle estimator βo in ( 12), which is a constant of order O P (sλλ j ) (Remark 3.1 in Section 3).The Remainder term arises from the difference between the oracle estimator βo and the Lasso estimator β in (1).We prove that the consistency of bootstrap when the Remainder is of order o(n −1/2 ), even if the Bias term is of larger order than n −1/2 .The error decomposition in (13) will demonstrate benefits over the decomposition in (11) when the Remainder term in ( 13) is of smaller order than the Orig.remainderterm in (11).
One way to bound the Remainder term in ( 13) is by considering the event that the selected support by Lasso is inside the true support and the an ∞ -bound exists for its estimation error: Recall that when the Lasso estimator β is sign consistent, βo = β and Remainder in ( 13) is zero.Let S be a set of "small" coefficients, such that S = {j : 0 In Ω 0 , we can get sgn(β j ) = sgn( βj ), for j ∈ S\ S.And hence the sign inconsistency only occurs on S. Formally, We show that the Remainder term in ( 13) is associated with the order of | S|.This leads to the improvement in sample size requirement when | S| is of smaller order than |S|.

Main results: deterministic design
In this section, we carry out detailed analysis for deterministic designs.We first provide sufficient conditions for our theorems.For ease of notation, let Σ n = X T X/n.
Condition 3.1.X is deterministic with As K 1 is assumed to be a constant in Condition 3.3, the eigenvalue condition in Condition 3.1 is redundant in the sense that Λ min (Σ n S,S ) ≥ 1/K 1 .Note that the eigenvalue condition and Condition 3.3 are only required on a block of the Gram matrix consisting of rows and columns in the true support.The quantity in Condition 3.2 is called incoherence parameter (Wainwright, 2009).This condition is equivalent to the uniformity of the strong irrepresentable condition (Zhao and Yu, 2006) over all sign vectors.Another related condition, the neighborhood stability condition (Meinshausen and Bühlmann, 2006), has been studied for model selection in Gaussian graphical models.Condition 3.3 is required for establishing an ∞ -bound of estimation error of the Lasso estimator.Condition 3.4 involves only first four moments of allowing some heavy-tailed distributions.Condition 3.5 contains some regularity conditions on z j , which are verifiable after the calculation of z j .

Preliminary lemmas
We first prove that event Ω 0 in ( 14) holds true with large probability for deterministic designs.
Lemma 3.1 is proved in Section 7.1.Lemma 3.1 asserts that the Lasso estimator does not have false positive selection with large probability under Conditions 3.1 -3.4.It is known that Condition 3.3 and beta-min condition together imply the selection consistency of the Lasso estimator.However, we do not impose the beta-min condition but distinguish the effects of small and large signals.Note that g 1 (λ) λ for λ log p/n.Next we show that analogous results of Lemma 3.1 hold for the bootstrap version of the Lasso estimator β * .Lemma 3.2.Assume that Conditions 3.1 -3.4 are satisfied.If (n, p, s, λ) satisfies (16), n s log p and 4σ 1 − κ then with probability going to 1, Lemma 3.2 is proved in Section 7.2.We mention that the condition n s log p is required for the consistency of σ2 (see Lemma A.4 for details).Note that it is Ŝ instead of S that is the true support under the bootstrap resampling proposal and Ŝ ⊆ S with large probability by Lemma 3.1.However, ( 19) is sufficient for the error of the bootstrapped debiased Lasso to approximate the error decomposition in (13), which can be seen from the next Lemma.
In the following lemma, we consider the error decomposition in ( 13) and bound the Remainder term for the debiased Lasso and its bootstrap analogue.Let N oise * be the bootstrap version of N oise, which is Let s be the number of small coefficients, such that for g 1 (λ) and g 1 (λ) defined in ( 17) and ( 19) respectively.
Lemma 3.3 (Bounding the remainder terms).Suppose that Conditions 3.1 -3.4 hold true, λ log p/n satisfies ( 16) and ( 18) and n s log p.For β(DB) j and β( * ,DB) j defined in ( 2) and ( 8) respectively, we have where N oise and Bias are defined in (13), N oise * is defined in (20) and s is defined in (21).
Lemma 3.3 is proved in Section 7.3.The factor z T j x j /n is calculable and can be treated as a positive constant typically.In fact, this factor is proportional to the standard deviation of N oise and N oise * , so that it will be cancelled in the analysis of the asymptotic normality.Therefore, we have proved that the Remainder term in (13) is of order O P (sλλ j ).
Remark 3.1.Under Conditions 3.1 -3.4 and λ λ j log p/n, we can get a natural upper bound on Bias in (13): Note that the order of Bias is not guaranteed to be o(n −1/2 ) under the sample size conditions of Lemma 3.3.There will be no guarantee of improvement on the sample size requirement if we do not remove the Bias term.

Consistency of bootstrap approximation
Inference for β j is based on the following pivotal statistics where β(DB) and β( * ,DB) are defined in ( 2) and ( 8) respectively.We show the consistency of bootstrap approximation of R * j to R j as well as the asymptotic normality of a pivot based on the double debiased Lasso estimator β(DDB) j in (10): We specify the sample size conditions as following: A 1 = (n, p, s, s, λ, λ j ) : (n, p, s, λ) satisfies ( 16) and ( 18), λ λ j log p/n and As discussed in Section 1, the condition on the overall sparsity recovers the rate of point estimation.
If s s, our sample size condition is weaker than the typical one n (s log p) 2 .
For the double debiased estimator (10), the Bias in ( 48) is estimated by the median of the distribution of β( * ,DB) j − βj .In practice, the median β( * ,DB) j − βj can be approximated by the sample median of bootstrap realizations.
Remark 3.2.Suppose we are interested in making inference for a linear combination of regression coefficients a 0 , β for a 0 ∈ R p .It is not hard to see that Gaussian bootstrap remains consistent under the conditions of Theorem 3.4 if a 0 1 / a 0 2 is bounded.

Main results: Gaussian designs
This section includes main results in the case of Gaussian design.The proof follows similar steps as for deterministic designs.We first describe conditions we impose in our theorems.
Condition 4.1.X has independent Gaussian rows with mean 0 and covariance Σ.
This condition is on a set T , which is actually the support of the estimation error of a perturbed Lasso estimator, while Condition 4.3 is assumed on the true support S.

Preliminary lemmas
Lemma 4.1.Assume that Conditions 4.1 -4.5 are satisfied and (n, p, s, λ) satisfies that Then for it holds that with probability greater than Lemma 4.1 is proved in Section 7.5.Note that , which is the same as deterministic design case.Theorem 3 in Wainwright (2009) considers the same scenario, but their results require s → ∞ and their upper bound on β − β ∞ only holds for sign consistency case.
In the next Lemma, we prove a bootstrap analogue of Lemma 4.1.
then with probability going to 1, Lemma 4.2 in proved in Section 7.6.Same as the deterministic design case, the condition n s log p is required for the consistency of σ2 .

Consistency of bootstrap approximation
Under Conditions 4.1 -4.5, we prove the consistency of Gaussian bootstrap under Gaussian designs.For g 2 (λ) defined in (28), define We specify the required sample size condition as following: A 2 = (n, p, s, s, s j , λ, λ j ) : (n, p, s, λ) satisfies ( 27) and ( 30), λ λ j log p/n and n max{ss log p, (s log p) 2 , s j log p} for s in (32) . ( Theorem 4.3.Suppose that Conditions 4.1 -4.5 are satisfied and (n, p, s, s, s j , λ, λ j ) ∈ A 2 .Then it holds that sup α∈(0,1) For R (DDB) j defined in (25), sup α∈(0,1) It can be seen from the proof (Section 7.7) that condition n ss log p in ( 33) is used to achieve desired rates for |Remainder| and its bootstrap analogue, such that ).The condition n s j log p is required to prove that z j 2 2 /n is asymptotically bounded away from zero.
In terms of the sparsity requirements, A 2 (33) implies that it is sufficient to require s = O( √ n) and s = o( √ n/ log p).Compared to the typical condition, s = o( √ n/ log p), our condition allows at least an extra order of log p.Moreover, if s is constant, our requirement on s is s n/ log p, which recovers the rate of point estimation.Comparing with the sparsity condition assumed in [JM15] for unknown Gaussian design case, our analysis still benefits when s is sufficiently small: • If the sparsity of the j-th column of precision matrix is much larger than the sparsity of β, i.e. s ≤ s s j , [JM15] required n max{(s log p) 2 , s j log p}, which is no better than the rate in A 2 (33) as discussed above.If s s, A 2 is weaker than the sparsity conditions assumed in [JM15].
• If the j-th column of the precision matrix is much sparser, i.e. s s j , [JM15] required that n max{s(log p) 2 , (s j log p) 2 }.If s log p, then ss log p s(log p) 2 and hence the sample size condition in A 2 is weaker.If s log p, [JM15] required weaker condition on s but stronger condition on s j .

Simulations
In this section, we report the performance of the debiased Lasso with Gaussian bootstrap and other comparable methods in simulation experiments.
Consider deterministic design case with n = 100, p = 500, X i ∼ N (0, I p ) and i ∼ N (0, 1).We consider a relatively large sparsity level, s = 20, and two levels of true regression coefficients as following.
(i) All the signals are strong: (ii) A large proportion of signals are strong: We compare the performance of bootstrapping the debiased Lasso (BS-DB), the debiased Lasso without bootstrap (DB) and the Adaptive Lasso with residual bootstrap (BS-ADP).For BS-DB, we generate (1 − α)% confidence interval (CI) according to (9) with 500 bootstrap resamples.We take λ = λ j at the universal level for the Lasso procedures.For DB, we estimate the noise level by ( 7) and take λ = λ j at the universal level for the Lasso procedures.(1 − α)% confidence intervals are generated according to For BS-ADP, we consider the pivot defined in (4.2) of Chatterjee and Lahiri (2013), which can achieve second-order correctness under some conditions.Such estimators also have a bias-correction term, which can be explicitly calculated assuming sign consistency.The choices of λ 1,n and λ 2,n are according to Section 6 of Chatterjee and Lahiri (2013).Each confidence interval is generated with 500 bootstrap resamples.We construct two-sided 95% confidence intervals using each of the aforementioned methods.Each setting is replicated with 1000 independent realizations.In the following table, we report the average coverage probability on S and S c ( cov S and cov S c , respectively) as well as the average length of CIs on S and S c ( S and S c , respectively) for identity covariance matrix and equicorrelated covariance matrix with Σ j,j = 1 and Σ j,k = 0.2 (j = k).One can see that BS-DB always gives larger coverage probabilities than DB across different settings.We mention that noise level is overestimated (see (7)).For example, in setting (i) and (ii) with the identity covariance matrix, the average of σ is 2.240 and 2.244, respectively.The CIs given by BS-DB are longer than those computed with DB on S, but on S c the CIs given by BS-DB are shorter than the ones given by DB.On the other hand, BS-ADP exhibits the overconfidence phenomenon: the average lengths of CIs are small, which results in low coverage probabilities on S. In the presence of equicorrelation, which is a harder case, BS-DB is significantly better than DB and BS-ADP in terms of coverage probability.q q q q q q q q q 0.0 0.5 1.0 1.5

DB DDB Las
Centers of CIs q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.4 0.8 1.2 1.6 DB DDB Las q q q q q q q q q q q q q q q q q q q q q q q q 0.25 0.50 0.75 1.00 1.25 DB DDB Las q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1.6 2.0

DB DDB Las
Centers of CIs q q q q q q q q q q q q q q q q q q q q q q q q q 1.0 1.5 2.0 2.5 DB DDB Las q q q q q q q q q q q q q q q q q q q 1.5 2.0 DB DDB Las q q q q q q q q q q q q q q q q q q q q q q q q q q q q −0.2 0.0 0.2 0.4

DB DDB Las
Centers of CIs q q q q q q q q q q q q −0.4 −0.2 0.0 DB DDB Las q q q q q q q q q q q q q q q q q q q q q q −0.4 −0.2 0.0 0.2

DB DDB Las
Figure 1: Boxplots of the double debiased Lasso (DDB) (10), the debiased Lasso (DB) (2) and the Lasso (Las) (1) with the identity covariance matrix in setting (ii).First row consists of estimates for weak signals: Second row consists of estimates for strong signals: Third row consists of estimates for zeros: β 21 = β 22 = β 23 = 0.Each Boxplot is based on 1000 independent replications.
Figure 1 demonstrates the bias-correction effects of debiasing and bootstrap across different levels of signal strengths.Concerning the overall performance, DDB is better than DB in terms of bias-correction, which is in line with our theoretical results.For j ∈ S, DDB and DB are less biased than the Lasso estimators.On S c , the Lasso estimates the regression coefficients as zero with a large probability.Thus, the Boxplot degenerates to a point at zero with a few outliers.Comparing row-wise, one can see that bootstrap has more significant correction effects on strong signals (second row) than on weak signals (first row).When true coefficients are zeros, DDB is also less biased than DB.

Discussions
We consider the bias-correction effect of bootstrap for statistical inference with debiased Lasso under proper conditions.Our analysis on the approximation error of debiased Lasso admits sample size conditions in terms of the number of weak signals.Our results contribute to the inference problem in the regime s log p n (s log p) 2 , but also demonstrate the benefits of having strong signals for the debiased Lasso procedure.We establish the consistency of Gaussian bootstrap and show that confidence intervals can be constructed based on bootstrap samples.
Besides Gaussian bootstrap, we also considered residual bootstrap, which is robust in the presence of heteroscedastic errors.However, the proof involves a more technical analysis and may impair the sample size conditions.To focus on the main idea, this is omitted from the paper.We also considered the proof techniques in [JM15], which construct a perturbed version of the Lasso estimator assuming β j is known and utilize its independence of x j .However, these techniques cannot be directly applied to the bootstrapped debiased Lasso, since the "true" parameters β and σ under the bootstrap resampling plan are not independent with x j for j ∈ S.

Proofs of main lemmas and theorems
To simplify our notations in this section, let ûS = βS − β S , û * 7.1 Proof of Lemma 3.1 Proof.Firstly, we use Lemma 1 -Lemma 3 in Wainwright (2009) to prove that ( 17 Define T j as By Lemma 1 of Wainwright (2009), if Σ n S,S is invertible and |T j | < 1 for ∀j ∈ S c , then the β is the unique solution to the Lasso with Ŝ ⊆ S. Note that We use standard symmetrization techniques to prove that Q 1 ≤ (1 − κ)/2 with large probability (see Lemma A.1 for detailed results).By Condition 3.2 and (76) in Lemma A.1, there exists some c 1 , c 2 > 0 such that for λ in ( 16).Together with Condition 3.1, we have Ŝ ⊆ S with probability greater than 4 exp (−c 1 log p) + c 2 /n.By the KKT condition of the Lasso (1), Ŝ ⊆ S implies that Then we have By (77) in Lemma A.1 and Condition 3.3, there exists some c 3 , c 4 > 0 such that 7.2 Proof of Lemma 3.2 Proof.Formally, the bootstrapped Lasso estimator β * is defined via Define a restricted Lasso problem with observations (X S , y * ): Define T * j as By Condition 3.1 and Lemma A.2, Note that .
By construction (5), Q * 1,j is a Gaussian variable with mean zero and variance no larger than σ2 /(nλ 2 ), ∀j ∈ S c , conditioning on σ2 .Thus, where the last step is due to the consistency of σ2 in (7) (see Lemma A.4). Condition 3.2 implies and some c 1 > 0, we have By Lemma 3.1 and (40), P( Ŝ ⊆ S) → 1 and hence By the KKT condition of the bootstrapped Lasso (37), in the event that { Ŝ * ⊆ S}, Therefore, Again using the Gaussian property of ˆ * , there exists some c 2 > 0 such that ≤ 2 exp(−c 2 log p) + o(1).
Together with (42), the proof is completed.
For the bootstrap version, define an oracle Lasso estimator computed with the bootstrap samples.Formally, β( * ,o) If Ŝ ⊆ S and Ŝ * ⊆ S, we can plug in β( * ,o) S and obtain that β( * ,DB) where N oise * is in (20) and Bias is in (13).

Proof of Theorem 3.4
We simplify the notations for the terms in ( 13) and ( 48).Let b j = Bias in (13), Rem j = Remainder in (13), η j = N oise in (13), Proof.Define a version of pivots in (24) which is standardized and bias-removed: We first find the limiting distribution for R o j and R ( * ,o) j . Let ζ j be the normalized version of η j in (50): where where the second step is due to Condition 3.5 and the last step is by our sample size condition.Note that ζ j is a random variable with mean zero and variance s 2 n , where Note that H 1 in (54) can be bounded by where the second last step is by Conditions 3.1 and 3.5 and the last step is by (43).Similarly, H 2 in (54) can be bounded by Thus, for n s log p, we have Now we check the Lyapunov condition, which is Using Condition 3.4, we can obtain that For ease of notation, let c i = (z T j x j e T j − z T j X) S (Σ n S,S ) −1 X T i,S /n, i = 1, . . ., n.Then we have As a consequence, In view of (58), it holds that as long as s 3 λ 4 n.For s n/ log p and λ j log p n , it is easy to check that s 3 λ 4 = O(n/ log p) n.
We have proved that Together with ( 53) and (57), we have For the bootstrap version, consider R ( * ,o) j defined in (51).By Lemma A.4 and (23) in Lemma 3.3, where is a Gaussian variable with mean zero and variance 1 + o P (1).This implies that Let F * (c) be the cumulative distribution function of ζ * j , i.e.F * (c) = P(ζ * j ≤ c).For ∀v 1 , v 2 > 0 and ∀α ∈ (0, 1), where the first step is due to the monotonicity of F * , the second step is by the definition of quantile function, and the last step is due to (61).By first taking v 2 → 0, we have proved that for ∀α ∈ (0, 1) and ∀v 1 > 0, A matching lower bound can be proved by a completely analogous argument.Thus, sup α∈(0,1) To complete our proof, note that by Lemma A.4, Together with ( 59) and ( 62), it holds that sup α∈(0,1) Next, we prove the asymptotic normality of R Due to (48), we can easily obtain that where the second step is due to (62).By definition of β(DDB) j (10) and R j defined in (24), where the last step is due to (59) for Z ∼ N (0, 1).For R 7.5 Proof of Lemma 4.1 For βS and T j defined in ( 34) and ( 35) respectively, we can rewrite T j as Conditioning on X S and , t o j is a Gaussian random variable with mean 0 and variance at most Σ j,j .Thus, Var(E 1,j |X S , ) ≤ Σ j,j (X S (X T S X S ) −1 sgn( βS ) + Define an event Thus, by (67) and Condition 4.4, in B 1 , Thus, by Lemma A.3 and Condition 4.5, for some constant c 1 , c 2 , c 3 > 0. Let x = (1 − κ)/2 and solve there exists some c 1 > 0, such that for some constant c 1 , c 3 , c 4 > 0.
(ii) The second task is to bound ûS ∞ .In the event that { Ŝ ⊆ S}, .
For E * 1,j , we first show that X S (X T S X S ) −1 sgn( β * S ) is independent of t o j in (66) ∀j ∈ S c , in the event of B 2 .Note that by Lemma 1 in Wainwright (2009), B 2 implies that β in ( 34) is the unique solution to the Lasso (1).As a result, β is a function of (X S , ) and Ŝ ⊆ S. Ŝ ⊆ S further implies that β * S in ( 38) is a function of (X S , β, ˆ * ).Therefore, the following arguments hold true: Thus, and (71) hold true for n s log p and λ ≥ 4σ 1−κ log p n .We conclude that for n s log p and λ > 4σ In B 3 , (41) holds true and we have By the Gaussian property of ξ, in B 3 , B 3 is a large probability event due to part (i) of the proof, Lemma A.3 and Lemma A.4. Putting these pieces together, we have for some c 2 > 0 and λ satisfying (30).

Proof of Theorem 4.3
Proof.Under Gaussian designs, we still consider error decompositions as in ( 13) and (48).We use simplified notations described in (50).
In the event that (29) holds, we can obtain that where the last step is by ( 43  The asymptotic normality of R (DDB) j can be similarly proved as for the fixed design case and is omitted here.≤ n max j∈S c ,i≤n x i,j − x i,S (X T S X S ) −1 X T S x j 2 ≤ n max i,j Thus, for any constant C 2 , And hence by ( 78) and ( 79), for n ≥ 8σ 2 /(λ 2 η 2 (1 − κ) 2 ), Take C 2 = Taking x = √ s ∨ √ log p in (82), we have with probability 1 − exp(− log p/2), Putting these arguments together, we have b S 1 and βS c = 0. (34)
n, are i.i.d from Gaussian distribution with mean 0 and variance σ 2 .
where max j≤p (Σ −1 ) j,j ≤ 1/C * by Condition 4.4.Thus, for R o j defined in (51) and ζ j in (52) we haveR o j = O P nK 1 (1 + C n )sλλ j σ z j 2 + ζ j + o P (1)Next we show that for ζ j in (74), ), Lemma 4.1, Lemma A.3 and the definition of s in (32).By Lemma 5.3 of Van de Geer et al. (2014), if n s j log p, z j Moreover, by Chebyshev's inequality, ∀j ∈ S c max j∈S c P