Variable Selection via Adaptive False Negative Control in Linear Regression

Variable selection methods have been developed in linear regression to provide sparse solutions. Recent studies have focused on further interpretations on the sparse solutions in terms of false positive control. In this paper, we consider false negative control for variable selection with the goal to efficiently select a high proportion of relevant predictors. Different from existing studies in power analysis and sure screening, we propose to directly estimate the false negative proportion (FNP) of a decision rule and select the smallest subset of predictors that has the estimated FNP less than a user-specified control level. The proposed method is adaptive to the user-specified control level on FNP by selecting less candidates if a higher level is implemented. On the other hand, when data has stronger effect size or larger sample size, the proposed method controls FNP more efficiently with less false positives. New analytic techniques are developed to cope with the major challenge of FNP control when relevant predictors cannot be consistently separated from irrelevant ones. Our numerical results are in line with the theoretical findings.


Introduction
We consider a sparse linear model where y = (y 1 , . . . , y n ) T is the vector of n observations of response, X = [x 1 , . . . , x p ] ∈ R n×p is the design matrix, β = (β 1 , . . . , β p ) T is the vector of unknown coefficients, and ε ∼ N n 0, σ 2 I is the vector of random errors. We assume σ 2 = O (1). Let I 1 = {1 ≤ j ≤ p : β j = 0} be the set of indices for nonzero coefficients with cardinality s = |I 1 | and I 0 = {1 ≤ j ≤ p : β j = 0} with cardinality p 0 = |I 0 |. Variable selection methods often provide sparse solutions for the estimation of β. The non-zero elements of an estimate correspond to variables selected as candidates for relevant predictors. A great amount of literature with many fruitful ideas has contributed to the development of sparse solutions to accommodate the underlying features of the data. We refer to [5] and the references therein for a nice introduction.
Given a selection result, a false positive (FP) occurs when an irrelevant predictor is selected, and a false negative (FN) occurs when a relevant predictor is not selected. It is natural to interpret a selection result in terms of false positive or false negative control, and exciting progress has emerged for false positive control, e.g. [3], [4], [9], [17], [24], [32], [38]. However, the study for efficient false negative control remains relatively underdeveloped.
False negative control is important in many real applications and sometimes a more serious concern than false positive control. For example, in pre-surgical brain mapping with functional MRI, the primary goal is to reduce false negatives where genuine functional areas are not identified. This is because neurosurgical patients are more likely to experience significant harm from mistakenly deeming a region to be functionally uninvolved and subsequently resecting critical tissue than from incorrectly assigning function to an uninvolved region [25,26,31].
Another example where false negative control is of main concern is in the exploratory stage of high-dimensional data analysis, where pre-screening is often conducted to reduce data dimension while keeping a high proportion of true signal variables for follow-up studies.
The problem of false negative control is conceptually related but methodologically very different from Sure Screening in, e.g., [14,15]. Sure Screening aims to reduce the data dimension by removing only irrelevant predictors. For instance, the Sure Independence Screening procedure in [14] ranks variables by estimated marginal regression coefficients and selects the top d variables where d is fixed at n−1 or n/ log n. It has been proved that under certain conditions, the screening procedure has eliminated only irrelevant predictors with high probability. The false negative control problem considered here focuses on selecting a high proportion of relevant predictors without including many unnecessary irrelevant predictors. It may be regarded as a more refined screening procedure with a data-adaptive selection rule instead of a fixed d.
We use false negative proportion (FNP) as a measure for false negative control. For a given selection rule, FNP is defined as the ratio of the number of false negatives to the total number of relevant predictors. FNP takes values in [0,1] and is equivalent to 1 − Sensitivity in binary classification framework. Our work starts with consistently estimating FNP for a given selection rule. To achieve this, we develop novel analyses on the tail behavior of the empirical processes associated with FNP. Based on the estimation of FNP, we develop a new variable selection procedure to control FNP at a user-specified level. If users can tolerate more false negatives, they may implement lower control levels on FNP in the procedure and select less candidates for relevant predictors. On the other hand, if the effect of relevant predictors gets stronger or sample size increases, the procedure controls FNP more efficiently with less false positives.
An important component of the proposed FNP control method is an estimator for the number of relevant predictors. We provide a consistent estimator for dependent test statistics, for which we adopt the recently developed debiased Lasso estimates [20,34,37].
Although FNP, by definition, is equivalent to the power in (single) hypothesis testing, our proposed study on FNP control is very different from the existing power analysis in hypothesis testing. In the latter, a decision rule is built upon Type I error control and followed by power calculation with assumptions on the effect size. For such methods to control FNP in addition to controlling family-wise Type I error when multiple hypotheses are considered, the effect sizes of relevant variables need to be larger enough to ensure essentially perfect separation of relevant and irrelevant variables. The proposed method, on the other hand, directly bound the estimated FNP at a user-specified level, which allows a more effective control on FNP. Our condition on effect size for FNP control is shown to be weaker than the existing beta-min conditions that are required for perfect separation of relevant and irrelevant variables.
The rest of the paper is organized as follows. Section 2 presents FNP estimation in two steps: (1)  selection method to control FNP at a user-specified level and a computational algorithm to implement the method. Section 4 presents the finite-sample performance of the proposed method in simulation. Conclusion and further discussion are provided in Section 5. Proofs for the main theoretical results are presented in Section 6. Extra technical details are provided in Appendix.

False Negative Proportion Estimation
Recall that for a selection rule, FNP is the ratio of the number of false negatives to the total number of relevant predictors. In this section, we rank the predictors based on the debiased Lasso estimates and approximate FNP at a given cut-off point on the list of ranked predictors.

Test Statistics Based on Debiased Lasso Estimates
Recall model (1.1). The well-known Lasso estimator iŝ where λ is a tuning parameter [33]. Recently, the debiased Lasso estimator has been developed to mitigate the bias of Lasso estimator [34,37]. The debiased Lasso estimator is defined aŝ whereΘ ∈ R p×p is an estimate for the precision matrix of the predictors and can be obtained via nodewise regression on X as in [28]. LetΣ = n −1 X T X. It has been shown that Under certain conditions, δ ∞ = o p (1), which implies the asymptotic normality ofb [6,20,21,34,37]. We present the set of conditions from [21] as A1) -A3) in Appendix A.1.
In this paper, we obtain test statistics for β using the standardized debiased Lasso estimator as whereΩ jj denotes the (j, j) entry ofΩ. Therefore, for each 1 ≤ j ≤ p, where, given X, (2.6)

Approximating False Negative Proportion
We aim to determine a cut-off value for the realized test statistics to control false negative proportion (FNP) at an user-specified level. For this purpose, we first study the consistent estimation of FNP. For any t > 0, define Note that FN (t) is unobservable as I 1 is unknown, and that the dependence among z j 's also affect FN (t). It is easy to see that and Since R (t) is directly observable from the data, the unknown quantities in (2.8) are FP (t) and s. We propose to substitute FP (t) in (2.7) by 2(p − s)Φ (−t), where Φ(·) is the cumulative distribution function (CDF) of a standard Normal random variable, because z j is asymptotically standard Normal for j ∈ I 0 . Further, we can plug in an estimatorŝ for s, which results in the estimator From the definitions of FNP (t) and FNP (t), it can be shown that and |ŝ/s − 1| = o P (1). (2.10) Because FP(t) is the summation of p 0 terms and s can be much smaller than p 0 , approximating FP(t)/s requires more delicate analysis than approximating FP(t)/p 0 , which has been studied in the literature for False Discovery Proportion (FDP) control (e.g. [13]). Also, the dependence among test statistics {z j } p j=1 adds another layer of difficulty.
In this paper, we consider s = p 1−η for some η ∈ (0, 1), so that the number of relevant predictors is of a smaller order than the total number of variables. On the other hand, we consider t values calibrated as t = t ξ = √ 2ξ log p for some ξ > 0, so that the scale of t is comparable to that of the extreme value of p independent standard Gaussian variables. Such calibration has been utilized to study the detection of Gaussian mixtures [2,7,12], and to analyze variable selection consistency in linear regression [23]. In this paper, we adopt the calibration to study the estimation of FNP in linear regression.
Further, define the precision matrix of the predictors as Θ and let Namely, the parameter s max represents the row-sparsity of the precision matrix, which contributes to the strength of the dependence among the test statistics. Define The next theoretical result demonstrates the range of t values in which the first equation Theorem 2.1. Consider model (1.1) and the test statistics {z j } p j=1 in (2.5). Assume conditions A1) through A3) in Appendix A.1 for the asymptotic normality of {z j } p j=1 . Let s = p 1−η for some η ∈ (0, 1) and t = t ξ = √ 2ξ log p for ξ > 0. Assume ξ > min{η, γ * } for γ * in (2.11), then Because FP(t) is the summation of p 0 indicator functions and p 0 s(= p 1−η ), FP(t)/s blows up at constant t. Theorem 2.1 says that the approximation of FP(t)/s is achievable for t at the scale of t ξ . This is substantially different from the existing study of FDP control, where the approximation of FP(t)/p 0 and R(t)/p 0 are studied at constant t.
The condition ξ > min{η, γ * } can be decomposed as follows. When η ≤ γ * , we have ξ > η, and the claim in (2.12) follows by showing that s −1 FP(t ξ ) = o p (1) = s −1 p 0 Φ(−t ξ ). On the other hand, when η > γ * and γ * < ξ ≤ η, more delicate analysis is needed to study the variability of FP(t ξ ). The condition ξ > γ * 1 essentially controls the variability of s −1 FP w (t ξ ), where w is the Gaussian component of z as in (2.6) and FP w (t ξ ) = j∈I0 1 {|w j |>tξ} . The condition ξ > γ * 2 controls the cumulative errors caused by the component δ of z. Existing study in [23] has shown optimal phase diagram in (ξ, η) for highdimensional variable selection. Their work, however, focuses on scenarios with ξ > η. We extend the analysis to the more challenging case with γ * < ξ ≤ η, for which we study the variability of s −1 FP w (t ξ ) under the dependence of test statistics. Recall the covariance matrix σ 2Ω in (2.3). SinceΩ =ΘΣΘ T and thatΣ is not a sparse matrix, σ 2Ω is not sparse or possessing any well-known structures. The study in [23] imposes conditions on the covariance matrix of predictors that essentially prohibit excessive signal cancellations when performing marginal regression. Our condition of dependence, on the other hand, demonstrate the effect of the sparsity of precision matrix (s max ) through γ * . Overall, ξ > min{η, γ * } is easier to be satisfied with larger n, smaller p, or smaller s max .
To achieve the second equation in (2.10), we modify the estimator introduced in [29] and study its consistency for estimating s in our setting. We refer to the modified estimator as the MR estimator. Recall the standardized debiased Lasso The MR estimator for the portion of relevant predictors (π = s/p) is constructed aŝ where c p is a bounding sequence pre-specified as follows.
In other words, c p can be looked upon as an upper bound for V p probabilistically, and the implement of c p in (2.13) eventually controls over-estimation on π.
Compared to the original MR estimator in [29], the key modification in (2.13) and (2.14) is the use of F p (t) and G p (t), two empirical processes each with dependent random summands. Naturally, this requires different techniques to find {c p } p≥1 . The setting in [29] considers independent p-values that are uniformly distributed under the null hypothesis. Since the limiting distribution of the uniform empirical process with independent summands is known and has an analytic expression, a bounding sequence can be directly found from the distribution in the construction of the original MR estimator. However, in our settings {z j } p j=1 are dependent, and the exact distributions of {b j } p j=1 are unspecified. In fact, {b j } p j=1 asymptotically has covariance matrix σ 2Ω = σ 2ΘΣΘ T . In theory, |Ω ij − Θ ij | = o p (1) for any (i, j) under conditions A1) through A3) in Appendix A.1. However,Ω itself is neither diagonal nor sparse, and the approximation errors of all the elements inΩ add up to influence V p . Note that V p is the higher criticism statistic of [12] based on the Gaussian component w of z. Unfortunately, existing techniques for higher criticism statistic under short-range and long-range dependence [18] cannot be applied here because our test statistics with covariance matrix σ 2Ω cannot be partitioned as in [18].
In this paper, we employ a discretization technique adopted from [1] to derive bounds on the variance of a discretized {H(t) : t > 0} and define a discretized version of V p as for two positive constants τ 0 and τ 1 such that 0 < τ 0 < τ 1 . Then, a discretized version of the MR estimator is defined aŝ as a measure on the minimal effect size of relevant variables. The following theorem demonstrates the consistency ofπ * . Its proof is presented in Section 6.2.
In summary, Theorem 2.1 and Theorem 2.2 facilitate the two equations in (2.10) for FNP(t) estimation by FNP(t). Note that in practice we will need to simulate V p and c p to derive the estimated s and FNP(t). Please refer to Section 3.2 for details of the numerical implementation.

FNP Control at a User-Specified Level
In this section, we introduce a new method for FNP control at a user-specified level in high-dimensional regression. We say that a variable selection method asymptotically controls FNP at a pre-specified level ∈ (0, 1) if the FNP of its selection outcome satisfies Such methods are useful in applications where data dimensions need to be largely reduced for subsequent analyses while controlling false negatives at a tolerable level.

The FNC-Reg Procedure
Based on the approximation results of FNP, we propose the False Negative Control for Regression (FNC-Reg) procedure, which determines the cut-off threshold on the list of ranked {|z j |} p j=1 as for an user-specified ∈ (0, 1). FNC-Reg selects predictors with |z j | > t * ( ). It can be seen that FNC-Reg is a procedure built upon direct estimation of FNP and a user-specified control level of FNP. Given that FNP(t) is nonincreasing with t, FNC-Reg selects the smallest subset of {z j } p j=1 such that the estimated FNP is less than . Moreover, this procedure depends on user's preference for the control level of FNP. Since t * ( ) is non-decreasing with , if users can tolerate missing a higher proportion (larger ) of relevant variables, they may select less variables using the procedure. The selected subset of variables can be much smaller than the full set of variables, which corresponds to no false negatives. The next theorem shows that under certain conditions, the FNC-Reg procedure asymptotically controls the true FNP at the level of . (2.17) and (2.11) and some con- Consequently, t * ( ) determined by (3.1) withŝ =πp and c p = c * p also satisfies P (FNP(t * ( )) ≤ ) → 1. We compare the condition on µ min in Theorem 3.1 with the beta-min condition of variable selection consistency. Our condition on µ min achieves the order O( (log p)/n) for β min , which is the optimal oder for variable selection consistency [19,35]. On the other hand, Our condition on µ min specifies the constant term √ 2γ * with γ * in (2.11), while existing beta-min conditions for different methods have various constant terms that are often not fully specified. Therefore, we attempt to compare with the optimal constant term for variable selection consistency in the ideal setting, where the predictors (X i1 , . . . , X ip ) are generated as i.i.d. samples from N (0, I p×p ). Existing study in, for example, [23] has shown that the optimal constant is √ 2 + 2(1 − η), i.e. smaller η (larger s) makes it harder to perfectly separate all the signals from noise. Then, it follows that √ 2γ * < √ 2 + 2(1 − η) for any η ∈ (0, 1). The above analysis shows that in the ideal setting, our condition on µ min is weaker than the optimal beta-min condition for variable selection consistency, and that FNP control can be achieved by FNC-Reg when relevant and irrelevant variables may not be perfectly separated.

Numerical Implementation of FNC-Reg
We provide a computational algorithm to implement the proposed FNC-Reg procedure. First, the estimation of s relies on the bounding sequence c p , which is pre-fixed as the (1 − α p )-th quantile of V p . In numerical implementation, we suggest to simulate V p and c p as follows. We simulate the data under the global null hypothesis that no relevant predictors exist and calculate the corresponding standardized debiased Lasso estimatorz j . Note thatz j is asymptotically distributed as w j under the global null hypothesis. We orderz j 's by their absolute values such that |z (1) | > |z (2) | > . . . > |z (p) | and calculatẽ Repeat the above 1000 times and determinec p as the (1 − 1/ √ log p)-th quantile of the empirical distribution ofṼ . Consequently, given the ordered test statistics |z (1) . 6. Obtain t * ( ) = max{|z (j) | : FNP(|z (j) |) ≤ } for a user-specified > 0. 7. Select predictors with |z j | ≥ t * ( ).

Numerical Analysis
Examples in this section have the response y simulated by the regression model (1.1) with ε ∼ N n (0, I). Each row of X is simulated from N p (0, Σ). We use the Ergös-Rényi random graph in [8] to generate the precision matrix Θ = Σ −1 with s max ∼ Binomial(p, θ), such that the nonzero elements of Θ are randomly located in each of its rows with magnitudes randomly generated from the uniform distribution Uniform[0.4, 0.8]. The nonzero coefficients are set at β 1 , . . . , β s with the same values. The debiased Lasso estimates are obtained by applying the R package hdi [11].

Estimating s
We compare the estimatedŝ with the true s in two settings. The first setting has p = 200, n = 100, s = 10, θ = 0.02, and β 1 = 0.2 − 0.5. The second setting increases sample size n to 150. As claimed in Theorem 2.2, the accuracy of s increases with the magnitude of non-zero coefficients and the sample size. Figure 1 presents the box-plots of the ratioŝ/s from 100 replications. When β 1 or sample size is small,ŝ tends to under-estimate the true s. As β 1 increases from 0.2 to 0.5 or n increases from 100 to 150,ŝ/s concentrates more around 1.

FNP control
We apply the FNC-Reg algorithm presented in Section 3.2 to the simulated data with p = 200, n = 150, s = 10, θ = 0.02, and β 1 = 0.2 − 0.5. Table 1 has fixed at 0.1 and reports the mean value of FNP(t * ) as β 1 increased from 0.2 to 0.5. We also calculated the associated false discovery proportion (FDP(t * ) = FP(t * )/R(t * )) to reveal the price in incurring false positives for FNP control. Further, we calculate the F-measure, which summarizes FNP and FDP by the harmonic mean of (1-FNP) and (1-FDP) [30]. F-measure takes a value between 0 and 1, and higher value corresponds to better summarized performance.
Because we are not aware of any existing methods that directly control FNP in high-dimensional regression, we present the corresponding results of two other methods that perform variable selection based on different criteria. These results help to better understand the results of FNC-Reg. The first method is Lasso whose solution is obtained using the R package hdi, in which λ is determined by cross validation. The second method is Knockoff, which has been developed to control false discovery rate (FDR) at a user-specified level in high-dimensional regression [3,9]. We use the "knockoff.filter" function in default from the R package knockoff, which creates model-X second-order Gaussian knockoffs as introduced in [9]. The nominal level is set at 0.1. Table 1 The mean values and standard deviations (in brackets) of FNP, FDP, and the F-measure from 100 replications for FNC-Reg, Lasso, and Knockoff. It can be seen from Table 1 that as β 1 increases, the FNP of FNC-Reg decreases, which agrees with the theoretical insight provided by the condition on µ min in Theorem 3.1. In the challenging scenarios where β 1 is very small, the FNP of FNC-Reg mostly exceeds the nominal level of 0.1, which is due to the under-estimation ofŝ and FNC-Reg's tendency to select less variables to capture the under-estimated number of signals. Furthermore, both FNP and FDP of FNC-Reg get smaller for larger β 1 , suggesting that FNC-Reg automatically adapt to and benefit from increasing signal intensity for both false negative and false positive control. Table 1 also shows that Lasso has lower FNP and much higher FDP than FNC-Reg, which agrees with Lasso's known tendency of over-selection when p > n. On the other hand, Knockoff has FDP reasonably controlled at the nominal level of 0.1 but much higher FNP than those of FNC-Reg and Lasso. In terms of the F-measure that summarizes FNP and FDP, FNC-Reg seems to outperform the other two methods under different β 1 values.
We further illustrate the adaptivity of FNC-Reg to the user-specified control level of FNP. For various values of , we calculate the relative frequency of the event {FNP(t * ( )) ≤ }. Table 2 summarizes the results for different settings with = 0.1, 0.2, 0.3 and β 1 = 0.3, 0.5, 0.7. It can be seen that the relative frequency of FNP ≤ for FNC-Reg increases with β 1 , which is consistent with the theoretical insight in Theorem 3.1. On the other hand, for a fixed β 1 , the relative frequency of FNP ≤ and the FDP of FNC-Reg decreases with , which agrees with our expectation for FNC-Reg as more liberal control of FNP incurs less price in false positives. Note that the results of Lasso and Knockoff do not change with the varying .

Conclusion and Discussion
We propose a new variable selection method, FNC-Reg, to efficiently control false negatives in linear regression. Different from existing methods and theory for power analysis and Sure Screening, our procedure directly estimates the FNP of a decision rule and selects the smallest subset of variables that has the estimated FNP less than a user-specified control level. FNP control is specifically challenging when relevant variables cannot be consistently separated from irrelevant ones due to limited sample size and effect size. We develop new techniques to analyze FNP control in the challenging setting and to cope with difficulties caused by the dependence of test statistics. FNC-Reg possesses two types of adaptivity property. First, it adapts to the user's preference level on the control of FNP. When a user can tolerate a less stringent control on FNP, he or she can input a larger in the FNC-Reg procedure and select less variables with less false positives. Secondly, the proposed method is adaptive to the unknown effect size. Note that the implementation of the procedure does not requires the information of effect size. Nevertheless, the result of the procedure automatically improves in both FNP and FDP as effect size increases.
Our theoretical study presents a weaker condition on µ min for FNP control by FNC-Reg than the beta-min condition for variable selection consistency. It is also of interest to understand the result of FNC-Reg if the condition on µ min may not be satisfied. Assume that among the s signal variables only s 1 of them satisfy µ j ≥ 2(γ * + c) log p for some constant c > 0. Then, similar arguments as in the proof of Theorem 2.2 can be applied to show that P ((1 − δ)s 1 < s < (1 + δ)s) → 1 for any δ > 0. Note thatŝ does not consistently estimate s anymore, nor is it a consistent estimator for s 1 . Suchŝ tends to under-estimate s, which can cause the proposed method to select less variables to capture the under-estimated number of signals. Because FNC-Reg ranks the test statistics by their significance and select variables from the top, one can make a statement about FNP control for the signals with effect sizes larger than the observed cutoff position. Such interpretation of results remains valid whether the condition on µ min holds or not.
Last but not least, we adopt the debiased Lasso estimator as the test statistic in the paper to demonstrate the new analytic framework of FNP control. We expect that the proposed framework can incorporate other test statistics in linear regression and promote further developments in false negative control based variable selection.

Proofs
This section contains the proofs of Theorem 2.1, Theorem 2.2, and Theorem 3.1. Auxiliary lemmas are provided in the appendices. We will frequently use the Mill's ratio, i.e.,Φ without mentioning it at each instance. All arguments will be conditional on X, and the symbol C denotes a generic, finite constant whose values can be different at different occurrences.

Proof of Theorem 2.1
The proof is composed of two parts. The first part assumes ξ > η and the second part assumes ξ ≤ η.
Consider the first part with ξ > η. It suffices to show s −1 FP (t ξ ) = o P (1) and when ξ > η. On the other hand, for a fixed constant a > 0, as .
The following lemma help quantify the order of P (|z j | > t ξ ) for j ∈ I 0 , and its proof is provided in Section A.3.

Proof of Theorem 2.2
First, we have the following lemma showing the order of the bounding sequence c * p for V * p . The proof is provided in Section A.5. Lemma 6.3. Assume conditions A1) through A3) in Appendix A.1. Consider V * p as in (2.15). Then c * p at the order of (s max /n) 1/4 log p satisfies P (V * p > c * p ) → 0 as p → ∞. Now, recall F p (t) = p −1 p j=1 1 {|zj |>t} and defineΦ p (t) = p −1 p j=1 1 {|µj +w j |>t} . Consider the decomposition where T is defined in (A.10). The first summand within the parentheses on the right hand side (RHS) of (6.3) can be safely ignored when boundingπ * /π as asserted by the following Lemma 6.4.
We first show thatπ * * is an asymptotic lower bound of π. Recall the definition of V * p as SinceΦ where the second inequality follows since c * p is non-decreasing in p and c * p (1 − π) −1 > c * p0 . However, Lemma 6.3 asserts P (V * p0 > c * p0 ) → 0. So, Next, we show thatπ * * is an asymptotic upper bound of (1 − δ)π for any δ > 0. Let FP w (t) = j∈I0 1 {|w j |>t} and rewritē for any any t ∈ T. Now set t in the inequality (6.6) to be t τ = 2τ log p with τ = γ * + c/2, (6.7) where γ * is defined in (2.11). We will show that each term on the RHS of (6.6) is o P (1).

Proof of Theorem 3.1
Recall Recall the definition of t * ( ) and simplify the notation by t * = t * ( ). We have the following Lemma 6.6, whose proof is provided in Section A.8.
Now we aim to show P (t * ≥ t τ ) → 1. The proof of the following Lemma 6.7 is presented in Section A.9.
Lemma A.1. Assume A1) and A2). Then there exist positive constants c and c depending only on C min , C max andκ such that, for max{s, s max } < cn/ log p, the probability that Note that the above result relaxed the ultra-sparse condition s = o( √ n/ log p) in [34] to s = o n/(log p) 2 as shown in A3).
Lemma A.3. LetK be the correlation matrix of w. Assume A1) and A2). Then
A.3. Proof of Lemma 6.1 By assumption A3), d p = o(1), s n/ log p, and n log p. Then Lemma A.1 implies where c * = C min /16. By Lemma A.2 and the definition of , which is bounded as follows.
The rightmost term On the other hand, the leftmost term Summing up the above gives and the claim in Lemma 6.1 follows.

A.4. Proof of Lemma 6.2
For i = j, let ρ ij be the correlation between w i and w j and C ij,ξ = Cov 1 {|w i |≤t ξ } , 1 {|w j |≤t ξ } . Then, by Lemma A.2, ρ ij is also the correlation between w i and w j . Further, By Mill's ratio, It is left to bound i =j C ij,ξ in (A.6). Define c 1,ξ = −t ξ and c 2,ξ = t ξ . Fix a pair of (i, j) such that i = j and |ρ ij | = 1. Now we will use the results in Section A.2. Since C ij,ξ is finite and the series in Mehler's expansion in (A.4) as a trivariate function of (x, y, ρ) is uniformly convergent on each compact set of R × R × (−1, 1) as justified by [36], we can interchange the order of the summation and integration and obtain Inequality (A.5) implies, for some finite constant C 0 > 0 , Combining (A.6) with (A.7) and (A.8) gives where the last inequality follows from Lemma A.3, i.e., K 1 = O P (p 2 λ 1 √ s max ).

Summing up the above gives (A.11).
A.7. Proof of Lemma 6.5 We will show A 1 (t τ ) = o P (1). Fix a constant a > 0, and for each j ∈ I 1 We only need to uniformly bound the RHS of (A.13).