Recovery of weak signal in high dimensional linear regression by data perturbation

: How to recover weak signals (i


Introduction
To achieve high accuracy in feature selection performed by a regularization procedure, as proved in [10] and others, the informative (or useful) features whose corresponding regression coefficients are nonzero must be well-separated from the uninformative (or useless) features whose coefficients are zeros. However, strong columnwise correlations in the design matrix and small nonzero regression coefficients make informative and uninformative features inseparable and result in poor selection accuracy. Theoretically, it has been shown that the Least Absolute Shrinkage and Selection Operator (LASSO) [8], a convex regularization procedure, needs to meet the Irrepresentable condition [13] and the Beta-min condition [4] to achieve high selection accuracy. The Irrepresentable condition requires the true model comprising of all informative features be weakly correlated with each uninformative feature, while the Beta-min Condition requires all nonzero regression coefficients be sufficiently large versus some threshold. Similar "weak correlation and strong signal" conditions on the design matrix and the nonzero regression coefficients are needed for nonconvex regularization procedures such as Minimax Concave Penalty (MCP) [9] and Multi-stage Convex Relaxation (MCR) [10]. In summary, "weak correlation and strong signal" are prerequisites for recovery of the true model for both convex and nonconvex regularization procedures. However, they are unlikely to hold and hard to check in practice [4]. In this article, we aim to develop a computationally feasible procedure that is able to recover the true model when there exist strong correlation and/or weak signal.
Many efforts have been directed towards overcoming the Irrepresentable condition and the Beta-min condition. They fall into two categories: decorrelation and resampling. The elastic net [14] weakens columnwise correlation by utilizing the squared second norm of penalty (i.e., the ridge penalty). Other studies show that resampling often leads to higher selection accuracy [1]. Two representative approaches are Stability Selection (SS) [7] and Bootstrapped Enhanced LASSO (BoLASSO) [1].
In this work we develop a procedure, Perturbed LASSO (PLA), to improve selection accuracy by combining the power of decorrelation and resampling. The procedure PLA is implemented in two steps. First, we generate H pseudosamples from the original data by adding random perturbations to the design matrix repeatedly and create a model subspace by applying LASSO with a set of D predefined regularization parameters to each pseudo-sample; consequently, H pseudo-samples produce H model subspaces and each model subspace consists of D models. Thus, the union of the H subspaces include no more than DH unique models. After the union model space is created, the next step is to perform model selection based on the original data by an information criterion. As shown in theory and numerical studies the random perturbation does not only weaken correlation in the design matrix but also strengths the signal. As a consequence, PLA overcomes both the Irrepresentable condition and the Beta-min condition.
As for the computing cost of PLA, we provide a quantitative relationship between the number of perturbations H and the probability (lower bound) of 3228 Y. Zhang selecting the true model, which we believe is a key contribution to the field. In a nutshell, the increase in computation always improves selection accuracy while smallish nonzero regression coefficients always necessitate heavy computation. We run a series of simulations to demonstrate the gains of PLA in selection accuracy against its computing cost and testify that PLA obtains substantially higher selection accuracy than its competitors. Moreover, most these competitors deal with the Irrepresentable condition, but our method addresses both the Irrepresentable condition and the Beta-min condition.
In the past two decades the trade-off between goodness-of-fit and parsimony is a central topic in literature on model selection. However, the trade-off between computing cost and selection accuracy imposes great challenges and is worth our inputs with the advent of high dimensional data era. This work explores the second type of tradeoff from both theoretical and numerical perspectives.
The paper is organized as follows. In Section 2, we set up the problem. PLA is proposed and its properties are studied in Section 3. In Sections 4 and 5, simulation results and a real data example are presented, respectively. Concluding remarks are in Section 6. The proofs of the main results are in the Appendix.

Model setting
As is typical in high-dimensional regression analysis, throughout this article we only consider linear models of the form which includes a response variable Y and p deterministic features (x 1 , · · · , x p ).
Here we use the standard notation for β = (β 1 , · · · , β p ) . All nonempty subsets of (x 1 , ..., x p ) constitute the model space M and each element in M defines a model M with cardinality |M |.
Assume the data are generated by where is distributed as N (0, σ 2 ). Then we define the true model M 0 as the set of informative or useful features whose regression coefficient β 0 j = 0. The other features corresponding to zero coefficients are defined as uninformative or useless. Let q = β 0 0 . Moreover, assume that the first q coefficients β 0 1 = (β 0 1 , · · · , β 0 q ) are nonzero and all other (p − q) coefficients β 0 2 = (β 0 q+1 , · · · , β 0 p ) are zeros. Thus, supp(β 0 ) = {1, · · · , q}. Define β 0 min = min(|β 0 1 |, · · · , |β 0 q |). The primary goal of model selection is to recover these nonzero coefficients or the informative features (x 1 , · · · , x q ). In this work we investigate feature selection in the context of high dimensional data (p > n) and sparse true models (n/(log p) q). The original sample (X, Y) include an n × p feature matrix X and an ndimensional vector Y representing n independent observations on the response variable Y . Let μ = E(Y) and ε = Y − μ. Across this article we only consider the deterministic design. Let X 1 be the submatrix spanned by the first q columns of X and X 2 be the submatrix by the remaining p − q columns. Let X [,j] denote the j-th column of X and suppose X [,j] 2 2 = n (j = 1, ..., p). The p × p matrix X X can be expressed in a block-wise form, Suppose Σ n 11 = n −1 X 1 X 1 → Σ 11 elementwise as n → ∞ and the two q × q matrices Σ n 11 and Σ 11 are both positive definite. The eigenvalues of Σ n 11 and Σ 11 are 0 < Λ n 1 ≤ · · · ≤ Λ n q < q and 0 < Λ 1 ≤ · · · ≤ Λ q < q, respectively. Denote n −1 X 1 X 2 by Σ n 12 , which converges to Σ 12 elementwise as n → ∞.

Proposed method
This paper focuses on the feature selection problem (i.e., recovery of nonzero coefficients) by a regularization criterion in the form where Y − Xβ 2 is the Euclidean distance between Y and Xβ, and p λ (β) works as a penalty of controlling model dimension. The family of regularization criteria (3.1) include the L 1 regularization method (i.e., LASSO) where p λ (β) = λ β 1 (λ > 0); the L 0 regularization method (i.e., information criteria) where p λ (β) = λ β 0 (λ > 0); and other methods such as MCP and MCR, both of which belong to nonconvex regularizations. In this article, we are mainly interested in feature selection by LASSO, in which the L 1 penalty λ β 1 imposes sparsity by shrinking some coefficients to zero and the regularization parameter λ controls sparsity level in that a large λ always leads to large bias and great sparsity. The set of features corresponding to the nonzero entries in β LA,λ constitute the model selected by LASSO with the regularization parameter λ. As a convex regularization criterion, LASSO is more computationally efficient than nonconvex regularization criteria such as MCP, which can achieve unbiased feature selection by solving a nonconvex optimization problem.

The Beta-min conditions of convex and nonconvex regularizations
Both convex and nonconvex regularization criteria need to meet their own "weak correlation and strong signal" conditions to achieve exact recovery of the true model [10]. Concerning feature selection by LASSO, the "weak correlation and strong signal" conditions refer to the Irrepresentable condition and the Beta-min condition. In the language of [13] the Irrepresentable condition is where δ is a scalar between 0 and 1, the left-hand expression is a (p − q) dimensional vector and < c denotes that each element of a vector is less than a scalar. The Irrepresentation condition (3.3) works for ruling out the strong correlation in the design matrix. The Beta-min Condition for LASSO is formulated in [4] as where C is a generic constant and φ 2 0 is the so-called compatibility constant determined by the design matrix X. Consequently, small nonzero regression coefficients below the threshold Cσ q log p/(nφ 2 0 ) cannot be detected by LASSO (in a consistent way). In [10], the LASSO-version Beta-min condition is presented as min 1≤j≤q |β 0 j | ≥ C 1 σ q log p n . (3.5) where C 1 corresponds to C/φ 0 in (3.4). As for MCP, the sparse Riesz condition (SRC), which is in consonance with the Irrepresentable condition (3.3) of LASSO, is supposed and works for ruling out strong columnwise correlation in the design matrix [9]. Under SRC, the Beta-min condition for MCP requires that where C 2 is a constant depending on X.
It is worth noting that the Beta-min condition of MCP (3.6) is independent of q, and consequently, weaker than that of LASSO (3.4). As discussed in [9] and [10], it is the bias of LASSO that causes the threshold value of LASSO to be " √ q times larger" than that of MCP in magnitude. We will confirm by simulations in Section 4 that MCP does outperform LASSO in terms of detecting small nonzero regression coefficients. Next, we develop a procedure based on LASSO to overcome both the Beta-min condition and the Irrepresentable condition, and this procedure outperforms MCP and other competing methods in terms of selection accuracy across various settings.

The algorithm: Perturbed LASSO (PLA)
As for any regularization procedures, strong correlation in the design matrix and weak signal (i.e., small nonzero regression coefficients) are two formidable hindrances to selection accuracy. An intuitive but effective approach of weakening correlations is to create a perturbed design matrix Z by adding an n × p random matrix composed of np iid random entries distributed as N (0, τ 2 ) onto X. Then the columnwise correlation of the perturbed design matrix reduces to below 1/(1 + τ 2 ), and consequently, the Irrepresentable condition is overcome as τ is large enough. Moreover, as shown in Section 3.3, the random perturbation improves the chance of recovering weak signal through adding an amount to the regression coefficients while the threshold is kept unchanged. Given a set of pre-defined regularization parameters Λ = {λ 1 , · · · , λ D }, the above-described scheme is formalized in the following two-step procedure, which we call Perturbed LASSO (PLA). LA,λ0 by applying LASSO with

Inclusion: Get estimates β
[2] to the original sample (X, Y). Generate H independent n × p random matrices Ξ h 's (h = 1, · · · , H) and each of them is composed of np entries identically and independently distributed as N (0, 1). Let In summary, PLA proceeds in two steps: 1) Create a model space M PLA,Λ,H by repeating perturbations and a set of preselected penalty coefficients; 2) Perform model selection within this space by an information criterion. In this article, we adopt RIC c , whose advantage over other information criteria has been testified in [11]. The model subspace M PLA,Λ h (h = 1, · · · , H) generated by each perturbation includes D models and the union space M PLA,Λ,H includes no more than DH unique models. Therefore, in the following simulations three indices are adopted to assess the performance: the inclusion accuracy (P 1 ) to measure whether the true model is included by M PLA,Λ,H , the selection accuracy (P 2 ) to measure whether the true model is ultimately selected and the size (N ) to measure the number of unique models in M PLA,Λ,H . Increasing D and H will improve inclusion accuracy in the cost of computing load.

Y. Zhang
Obviously, a large τ can remove correlation (almost) entirely, but an overly large τ will blur the relationship between Y and (x 1 , · · · , x p ). Hence, the perturbation size τ needs to balance the above two ends. Next we conduct theoretical analyses on our proposed method PLA and provide theoretical guidance on the choice of the perturbation size τ and the perturbation number H.

Theory
For a perturbed sample (Z h , W h ) generated by (3.8) and (3.9), let θ 2 2 +λ θ 1 ). Split the perturbed feature matrix Z h into two submatrices Z h,1 and Z h,2 , which are spanned by the first q and the remaining (p − q) columns, respectively. Accordingly, the matrix Ξ h is split into Ξ h,1 and Ξ h,2 . Using KKT condition [3] we derive four sufficient conditions for θ LA,λ h and β 0 to have the same sign, which are summarized in the following proposition. Let * denote the elementwise product of two vectors. Proposition 3.1. If the following conditions hold for a scalar η ∈ (−∞, ∞) and a scalar δ ∈ (0, 1), then θ LA,λ h and β 0 have the same sign.
In the above proposition, Conditions (3.10) and (3.11) each consisting of (p − q) inequalities function for excluding all uninformative features, whereas Conditions (3.12) and (3.13) each consisting of q inequalities function for including all informative features. The proof of Proposition 3.1 is deferred to Appendix. To distinguish from (3.3) and (3.4), these four conditions are named p-Irrepresentable, p-Exclusion, p-Beta-min and p-Inclusion conditions in order where p-stands for perturbed. Here it is worth pointing out that Conditions (3.12) and (3.13) are derived from θ , where the former is a sufficient and necessary condition for sign( θ LA,λ h,1 ) = sign(β 0 1 ), whereas the latter is a sufficient but unnecessary condition. Next we briefly discuss how PLA overcomes the Irrepresentable condition and the Beta-min condition. The columnwise correlation of Z h will decrease to zero as τ goes up to ∞ such that all columns are uncorrelated and the Irrepresent condition is met (in asymptotic sense). In more detail, each element of the lefthand side of (3.10) is bounded by qτ −2 in probability (the proof is provided in Appendix: Lemmas 1 and 2). Thus, as the perturbation size τ > √ q, the Irrepresentable condition is satisfied (asymptotically). The perturbed Beta-min condition (3.12) differs from the Beta-min condition (3.4) by the q-dimensional random vector sign(β 0 with σ bj is inversely proportional to τ (more details are provided in the proof of Theorem 1). Consequently, when the Beta-min condition (3.4) is violated (i.e., |β 0 j | ≤ η for some j), the random term b h,j will push |β 0 j | above η, and a small τ is preferred. Theoretically, no matter how small a nonzero regression coefficient, it can always be recovered by performing perturbations many times (i.e., large H).
Let κ 2 0 be the Restricted Eigenvalue constant as defined in Eq (16) of [2]. Concerning the inclusion consistency of PLA we have the following conclusion. (3.14) where Theorem 1 is about inclusion consistency, i.e., the probability that the true model is included by the subspace M PLA,Λ,H goes to 1. The proof of Theorem 1 is delayed to the appendix.
There are several consequences of Theorems 1.
1. As indicated by Theorem 1 and discussed above, a large τ is desired for overcoming the Irrepresentable condition, while a small τ is preferred for overcoming the Beta-min condition. We adopt which overcomes the Irrepresentable condition entirely under the sparsity assumption, i.e., q n/ log p. 2. The inclusion probability (lower bound) by PLA (3.14) decreases in (σ, p and q) but increases in (n, κ 2 0 , |β 0 1 | and H). Hence, if there exist small |β 0 j |; j = 1, · · · , q, strong correlation in X (small κ 2 0 ), large p and/or large q, then a large H is needed to achieve high selection accuracy.

Y. Zhang
3. In the context of "strong correlation and strong signal" where the Betamin condition (3.4) is met but the Irrepresentable condition (3.3) is not, PLA can obtain 100% inclusion accuracy by one perturbation, while the inclusion accuracy of LASSO is 0. This is to be testified by simulations in Section 4.
The computing cost and selective performances of PLA will be examined by simulations in Section 4 where we assume that the true model is linear, sparse and included in the candidate model space. However, it is impractical to suppose the existence of a sparse and linear true model in real data, and this will be investigated in Section 5.

Simulations
For the data-generating processes, each row of the n×p design matrix X = (X i,j ) is independently generated from N (0, Σ p×p ), where 0 denotes the p-dimensional vector of 0's and Σ p×p = (ρ |j1−j2| ) denotes a p × p covariance matrix with j 1 and j 2 (j 1 , j 2 = 1, · · · , p) denoting the row and column indices, respectively. Two values of ρ (0.6 and 0.9) are examined. The responses are generated from the true model (2.2) with σ = 1. Among β 0 , q randomly selected coefficients (β 0 j1 , · · · , β 0 jq ) (q = 9 or 12) are assigned nonzero values by the following rule: where the shrinkage factor α = 0.6 or 0.3, and all other β j 's are assigned 0. Thus, the minimal nonzero regression coefficient β 0 min = 2α. Four values of k (1, 3, 5, 7) are examined. As ρ and k increase and α decreases, the Beta-min and the Irrepresentable conditions are more likely to be violated and the performances of all procedures will deteriorate.
In all simulations, σ 2 is assumed unknown and the penalty coefficients, Λ = {λ 1 , · · · , λ 100 } that are the default candidates in the R package ncvreg are utilized. For each setting, 100 replications (realizations of sample data) are performed. In PLA and BoLASSO (BLA), H perturbations or bootstrappings (H = 10 or 1000) are performed.
The following six procedures are compared. (2) Selection: model selection within the subspace by an information criterion. Thus, their performances are assessed by inclusion accuracy (P 1 ) that is the proportion of replications where the method includes the true model M 0 , and selection accuracy (P 2 ) that is the proportion of replications where the method selects the true model M 0 . Undoubtedly, P 2 ≤ P 1 for all procedures. Though our ultimate goal is model selection, we want to emphasize that, rather than the selection accuracy, the inclusion accuracy is more adequate for assessing whether a procedure is able to overcome the Beta-min Condition because a misused information criterion may result in misidentifying the true model. Therefore, in the following table the inclusion accuracy is highlighted by bold fonts and square brackets. Additionally, we report the size of model subspace (N ) that is the number of unique models, the system time consumed for creating model subspace (T 1 ) and the system time for model selection within the model subspace (T 2 ). Consequently, the sum of T 1 and T 2 measures the total computational time of each procedure. The unit of T 1 and T 2 is second.
As a reference, we also examine the selective performance of LASSO and MCP equipped with Cross-validations (CV-LAS and CV-MCP). Though 5-fold or 10fold CV is often adopted to tune the penalty level, neither of them can guarantee that optimal λ is tuned [12]. Therefore, as LASSO or MCP misidentifies the true model, it may be due to mistuning of the optimal penalty. Even worse, the bias of LASSO often leads to the choice of an overly small λ, which causes overly large false positives, and consequently, extremely low selection accuracy. Therefore, we strongly recommend the strategy of "screen by LASSO (i.e., creating a model 3236 Y. Zhang space by a set of regularization parameters Λ) and select by information criteria" [11] to replace the more routine procedure "tune by CV and select by LASSO", and the latter is more computationally burdensome than the former as shown in the following simulations. For CV-LAS and CV-MCP, only selection accuracy (P ) and total computational time (T ) are reported.
The results are as follows: As shown in Table 1 and Table 2, LAS-RIC c beats CV-LAS in both computational efficiency and selection accuracy across various settings. The same pattern is observed between MCP-RIC c and CV-MCP, though the gap is not so big as LASSO.
It is worth noting that in the case of moderate correlation ρ = 0.6, the inclusion accuracy (P 1 ) of MCP-RIC c does not change with q, but the inclusion accuracy of LAS-RIC c goes down greatly as q goes up from 9 to 12. This is caused by a fact that the threshold in the Beta-min condition of LASSO (3.4) increases with q, but the threshold of MCP (3.6) does not. However, in the strong correlation case (ρ = 0.9), the performance of MCP-RIC c worsens as the true model size q increases from 9 to 12 because a large q brings up the chance of violating the Irrepresentable condition or the sparse Riesz condition, especially when strong correlations exist in X.
As shown in Table 1 and Table 2, in the nicest case where the columnwise correlation is moderate (ρ = 0.6) and the nonzero coefficients are "sufficiently large" (α = 0.6 such that β 0 min = 1.2), the inclusion accuracy of LAS-RIC c is above 88% while all other two-step procedures (i.e., MCP-RIC c , BLA-RIC c and PLA-RIC c ) achieve above 99% inclusion accuracy. However, all six procedures' performances deteriorate gradually as ρ and q increase and the nonzero coefficients decrease in size, which cause the Irrepresentable condition and the Beta-min condition less likely to hold. Overall, PLA with H = 1000 outperforms all other competitors in all cases. In the worst case (q = 12, ρ = 0.9, q s = 7 and β 0 min = 0.6), the inclusion accuracy (P 1 ) of LAS-RIC c and MCP-RIC c are 0 and 11%, respectively. Our procedure PLA elevates the inclusion accuracy up to 34% as H=10 and up to 61% as H = 1000. Another case worthy of attention is that PLA achieves almost 100% inclusion and selection accuracy by performing only H = 10 perturbations when both signal and correlation are strong (ρ = 0.9 and α = 0.6), whereas the inclusion accuracy of LASSO is below 5%. This demonstrates the power of PLA on overcoming the Irrepresentale condition. This confirmed a conclusion implied by Theorem 1: the computing load is mainly driven by small nonzero β 0 j when the true model is sparse. In conclusion, LASSO requires stronger Beta-min condition than MCP, and consequently, suffers lower selection accuracy. However, PLA lowers the threshold down to 0 (as H is large enough) and accomplishes higher selection accuracy than MCP.
Between the two resampling-based methods BoLASSO and PLA, a notable difference is that the performance of PLA always improves with increasing H that supports the conclusion of Theorem 1, but BoLASSO does not. In particular, when there exist some "small" coefficients (α = 0.3 such that β 0 min = 0.6), the inclusion accuracy of BoLASSO worsens as H goes up because a large H increases the chance of misselecting relevant features in some perturbations.    Overall, PLA achieves higher inclusion (P 1 ) and selection accuracy (P 2 ) than BoLASSO while costing less computing time. Examining T 1 and T 2 , it is clear that the first step of BoLASSO and PLA accounts for most computing load. For example, in the worst case (q = 12, ρ = 0.9, q s = 7 and β 0 min = 0.6), the first step of PLA used up 1461 seconds on creating a space including 6648 unique models and the second step, model selection by RIC c , only used up 16 seconds.

Y. Zhang
Additional simulation results of various (n, p, ρ, β 0 , Λ) further supporting the robustness and flexibility of PLA are available upon request. Similar patterns to the ones presented in this section are observed, further demonstrating the success of our proposed method. The often sizable advantage of PLA over its competitors, especially when there exist small regression coefficients and strong correlation among features, makes PLA a powerful tool in high dimensional variable selection.
We also examined the performance as perturbation partners with MCP, which is referred to as "Perturbed MCP", and the simulation results are available upon request. As demonstrated by the simulation, Perturbed MCP improves the inclusion and selection accuracy of MCP but not so substantial as Perturbed LASSO. Furthermore, Perturbed LASSO outperforms Perturbed MCP across all settings.
Due to constraints of computing resource, it is hard to recover overly small regression coefficients, which may not contribute to prediction in the context of "large p small n". However, the harvest in selection and prediction is always directly proportional to pay in computation as shown in the following real data example.

Real data application
In this section we analyze the dataset, riboflavin that is about vitamin B2 production and publicly available through the R package, hdi (www.r-project.org) [5]. The data comprise of 71 observations on a single real-valued response variable that is the logarithm of the B2 production rate, and p = 4088 features measuring the logarithm of the expression level of 4088 genes.
First of all, we compare the predictive performances of the four two-step procedures (PLA-RIC c , BLA-RIC c , LAS-RIC c , MCP-RIC c ) studied in Section 4. In PLA-RIC c and BLA-RIC c , H = 1000 bootstrappings or perturbations are performed. The comparison is done in three steps. First, the 71 observations are divided into an evaluation set of size N e (N e = 2, · · · , 5) and an estimation set of size n = 71 − N e . Second, the four competitors each develop predictive models from the estimation set, which are used to make predictions on the evaluation set. Finally, we perform the above two steps 100 times, selecting the evaluation sets at random each time, and get N = 100 average prediction errors for each method. The average mean squared error (MSE) for each of the four procedures are displayed in Table 3. From Table 3, PLA-RIC c yields the best overall predictive accuracy.
Next, we apply PLA-RIC c to the whole sample and select six genes: ARGF at, XHLB at, YDDK at, YEBC at, YOAB at and YXLD at. A similar analysis was 3240 Y. Zhang Table 3 Comparison  done by [5] using the R package hdi and three genes: LYSC at, YOAB at and YXLD at are selected. The linear regression analysis based on the two models is done in R and the output is presented in Table 4. As shown Table 4, PLA-RIC c recovered regression coefficients whole absolute values are below 0.4, while the other approach only recovered coefficients whole absolute values are above 0.4.

Discussion
In high dimensional feature selection problems, strong correlation and weak signal are the two main hindrances to exact recovery of informative features. The method developed in this paper offers a solution by adding perturbations to the design matrix. As confirmed by the simulation, PLA achieves substantial advantage over other methods on selection accuracy. At the same time, our method performs well in prediction when the true model has unknown form or may be excluded from the candidate model space as demonstrated in the real data example. Another great difficulty in high dimensional feature selection problems is the immense computing load caused by huge model space. However, literature has focused on the tradeoff between parsimony and goodness-of-fit, while the tradeoff between selection accuracy and computing efficiency has been largely unex-plored. From a theoretical perspective, we investigate this tradeoff by establishing a quantitative relationship between selection consistency and computation, and provide some guidance on how to balance selection with computation. Further investigation is necessary to generalize the proposed method to nonlinear regression models.

Appendix A: Proof
Firstly, we prove some inequalities used in the proof of Proposition 3.1 and Theorem 1.
A symmetric positive definite q × q matrix Σ 11 can be decomposed as ( Similarly, the matrix Σ n 11 = X 1 X 1 /n can be decomposed as Σ n 11 = q k=1 Λ n k A n k where A n 1 , · · · , A n q represent q × q symmetric and idempotent matrices such Let γ 1 = (Σ 11 + τ 2 I q,q ) −1 sign(β 1 ) and γ 2 = (Σ 11 + τ 2 I q,q ) −1 0 (j) where 0 (j) is q-dimensional vector of 0's except the j-th entry being 1, then The above conclusions continue to hold with Σ 11 replaced by Σ n 11 . Proof. First of all, we have

Proof of Theorem 1
First of all we derive the probability lower bounds of the Irrepresentable Condition, Exclusion, Beta-min and Inclusion conditions. Suppose τ ≥ √ 8q 3/4 and let

Step 1: Perturbed Irrepresentable Condition
Let A h denote the Perturbed Irrepresentable Condition holding at the h-th perturbation and P r( [,j] be j-th column of X (j = q + 1, · · · , p). Then, Therefore, P r(A c 2h ) = o(P r(A c 1h )) and

Step 2: Perturbed Exclusion Condition
Let B h denote the Exclusion Condition holding at the h-perturbation. First, Finally, P r(B c 2h ) = o(P r(B c 1h )).
We have established the probability lower bounds of P r(A h ), P r(B h ), P r(C h ) and P r(D h ) and need to derive the upper bound of P r(∩ H h=1 (A c h ∪B c h ∪C c h ∪D c h )). The following inequality holds, This completes the proof.