Significance testing in non-sparse high-dimensional linear models

In high-dimensional linear models, the sparsity assumption is typically made, stating that most of the parameters are equal to zero. Under the sparsity assumption, estimation and, recently, inference have been well studied. However, in practice, sparsity assumption is not checkable and more importantly is often violated; a large number of covariates might be expected to be associated with the response, indicating that possibly all, rather than just a few, parameters are non-zero. A natural example is a genome-wide gene expression profiling, where all genes are believed to affect a common disease marker. We show that existing inferential methods are sensitive to the sparsity assumption, and may, in turn, result in the severe lack of control of Type-I error. In this article, we propose a new inferential method, named CorrT, which is robust to model misspecification such as heteroscedasticity and lack of sparsity. CorrT is shown to have Type I error approaching the nominal level for \textit{any} models and Type II error approaching zero for sparse and many dense models. In fact, CorrT is also shown to be optimal in a variety of frameworks: sparse, non-sparse and hybrid models where sparse and dense signals are mixed. Numerical experiments show a favorable performance of the CorrT test compared to the state-of-the-art methods.


Introduction
Hypothesis testing for high-dimensional models is widely used. Statisticians usually utilize hypothesis tests to study the importance of one or many variables in parametric models. Given pairs of observations (y i , w i ), with y i ∈ R, w i ∈ R p and i = 1, . . . , n, the importance of j-th variable of a parametric model E[y i |w i ] = w i β is studied by developing hypothesis tests for the related null hypothesis H 0 : β j = 0 against a suitable alternative, for example H 1 : β j = 0. In high-dimensional settings where p n, several theoretical results are available on the design of such tests, including asymptotic optimality [70,32,63,3,45,16]. A common thread among these works is an underlying assumption on the sparsity of the parametric model at hand. A recent method introduced by [73], extends this body of work by removing some of the restrictive assumptions pertaining to the sparsity of the parametric model.
In particular, [73] introduced the notion of restructured regression where a specific moment of interest leads to robust and stable inference in high-dimensional setting. The authors introduce feature stabilization and synthesization pertaining to the null hypothesis of interest as a way of transforming a parametric null into a suitable and testable moment condition that is equivalent to the null hypothesis. We split into the parameter under testing β * and the nuisance parameter γ * . Applied to the problem of testing for significance of variables in high-dimensional linear models, a restructured regression as proposed in [73] takes the form of Y = Xγ * + Zβ * + ε, (1.1) where W := (w 1 , . . . , w n ) = [X, Z] ∈ R n×(p+1) with Z ∈ R n and X ∈ R n×p being the design matrix, Y = (y 1 , . . . , y n ) ∈ R n and ε ∈ R n is the error term. When interested in testing H 0 : β * = β 0 , vs.
H 1 : β * = β 0 , (1.2) [73] consider an auxiliary model Z = Xθ * + u, (1. 3) where θ * ∈ R p is a high-dimensional and unknown parameter and u ∈ R n is the model error. [73] showcase that one can transform the original parametric null (1.2) into the moment condition (1. 4) In the above, V is defined as a pseudo-response, This paper extends the early work of [73] into a flexible method for developing asymptotically exact and optimal hypothesis tests regarding the importance of variables in high-dimensional linear models as identified through a family of moment conditions (1.4). We aim to build inference methods that inherit the desirable empirical properties of the test of [73] -such are the preservation of type I error rate and no loss in efficiency -but can be used in the wide range of linear model setting characterized by (1.1). Two of the most significant improvements over [73] are related to (conditional) heteroscedasticity and double robustness.
For example, this paper encompasses the case where the error term ε i is heteroscedastic in that the conditional variance of ε i can depend on the model features; see Condition 1. More generally, we showcase that the moment conditions of the form (1.4) can be used to provide valid and asymptotically optimal tests as well as confidence intervals, regardless whether the initial model, (1.1), or the auxiliary model, (1.3), is dense and/or heteroscedastic. An example of a possibly non-sparse auxiliary model is the one whose covariate's dependence is represented through a dense graphical model. Although both this paper and [73] deal with the inference of high-dimensional linear models with non-sparse structures, it is worth pointing out that under the more general settings (e.g., heteroscedasticity and lack of knowledge of which one of γ * and θ * is dense), valid inference is substantially more difficult. Section 2 gives a detailed treatment of this perspective. Moment condition like (1.4) are understood as robust moment conditions, i.e., moment conditions are written as the product of two residuals, one arising from the initial model at hand (1.1) and the other arising from an auxiliary model on covariates (1.3). Because individual residuals can have a potentially high bias, such product stabilizes inference whenever either one of the residuals is estimated accurately. This idea of correcting for bias using an auxiliary model is closely related to the work on score decorrelation, Neyman score orthogonalization, and double machine learning; see e.g., [44,45,14]. Important examples in high-dimensional problems include [39] and [14] for average treatment effect estimation, [4] for quantile estimation, and [61,72,1] for models related to time-to-event, survival analysis. These methods are quite general and can deliver valid inference with the liberty of choosing from a large class of estimators; these estimators need to have fast enough rate of convergence for both the original model and the auxiliary model. However, even consistency is quite difficult for dense models. Therefore, robust moment conditions alone are not enough to achieve valid inference for non-sparse high-dimensional models.
We seek a unified, general framework for computationally efficient and problem-specific tests. These tests are optimized for the primary objective of testing for significance of capturing a key parameter of interest, especially when the consistent estimation of the nuisance parameters is not achievable. This pushes the limit of the known optimality of doubly robust moments even in lowdimensional setting. A consistent (but not necessarily optimal) estimation was once required; now an inconsistent estimator can be allowed. This paper addresses the challenges resulting from inconsistency by designing a set of adaptive estimators and adopting a self-normalized structure. The challenge in generalizing moment-based methods is that their success hinges on whether the adaptive estimator adequately highlights the dimensionality and heterogeneity of the model of interest. While it is natural to require estimators that achieve an optimal rate of convergence in appropriate norms (e.g., 1 -norm or prediction norm) for sparse models, achieving even consistency is not a trivial problem for dense models. Hence, we design estimators that enjoy additional adaptive properties. If the target parameter is sparse, our estimator enjoys similar properties to adaptive estimators that are designed only for sparse problems; if the target parameter is dense, our estimator has the stability needed for valid inference. When we use such adaptive estimators for both θ * and γ * , asymptotic normality holds even if only one of them is sparse. However, the correct "asymptotic variance" would depend on which one is sparse. Here, the self-normalization plays a critical role by automatically providing the correct corresponding normalization. As a result, the proposed inference method is valid without knowledge of the source of sparsity. In light of this, it is natural to see why the aforementioned existing strategies that can be coupled with generic estimators do not guarantee inference validity for dense models.
Moment conditions of the form (1.4) typically arise in scientific applications where rigorous statistical inference is required. The bulk of this paper is devoted to a theoretical analysis of moment based tests, and to establishing asymptotic normality of the resulting estimates even in the presence of severe model misspecification (e.g., non-Gaussianity, heteroscedasticity, lack of sparsity). We also develop the theory for efficiency as well as power property of the introduced tests. Our analysis is motivated by classical results for orthogonal moments, in particular [52], paired with machinery from [73] to address the adaptivity of the tests to model misspecification. The resulting framework presents a flexible method for robust statistical estimation as well as inference with formal asymptotic guarantees while allowing model misspecification.

Related Work
The idea of building tests on the basis of moment conditions (and orthogonal moments) has a long history, including Neyman's C(α) tests and Rao's score tests. They were introduced to address unobserved heterogeneity in parametric statistics models. In medicine, popular applications of these techniques include testing for rare variants [36,42], estimation of treatment effects and causal parameters [28], in economics, for estimation and inference in missing data models as well as studying the effects of program or policy interventions [29]. At its essence the method necessitates that the moment condition used to identify the parameter of interest, needs to be insensitive towards small changes in the estimated nuisance function.
A challenge facing this approach is that if the covariate space has more than two or three dimensions performance can suffer if some of the parameters are estimated at a rate slower than n −1/4 . Unfortunately, in misspecified and high-dimensional models such rate cannot be guaranteed. For well specified models, i.e., Gaussian models, [73] showcased that this rate can be ignored. However, our paper proposes a generalization of their work in that the developed test is both robust and efficient even when the model is not correctly specified.
The original condition for Neyman orthogonality was proposed by [44], building on insights from the semi-parametric estimation framework; its optimality in linear models was established in [43]. It was then utilized to develop a class of methods called doubly robust procedures, designed to mitigate selection bias, nonrandom treatment assignment in observational studies and noncompliance in randomized experiments [53,64]. Doubly robust methods can be viewed as a refinement of a weighted estimating-equations approach to regression with incomplete data proposed by [54,55] and [56].
The perspective we take on moment condition (1.4) is a form of weighted estimating-equations, where the weights are defined to be adaptive and to stabilizing for model misspecification. However, we most closely build on the proposal of [73] designed for Gaussian regression models. This adaptively weighted estimating-equations perspective also underlies several statistical analyses of missing data and causal inference problems in the context of improved local efficiency (see for example [57]). Our adaptive weighting scheme draws heavily from a long tradition in the literature of adaptive estimation; specifically, where the dimensionality of the model is high [21,13,24,18,17]. Our goal differs from the above in that we are not focused on adaptive estimation itself. Instead, we are interested in constructing a test that is as robust as possible to model misspecification; we then rely on the self-normalization to achieve statistical stability.
In this sense, our approach is related to the studies of nonparametric function estimation where a "gap" between estimation and inference has been well established, i.e., a gap between the existence of adaptive risk bounds and the nonexistence of adaptive confidence statements. We showcase here a reciprocal "gap"; a "gap" between the existence of confidence intervals and the nonexistence of risk or estimation bounds. We argue here in this article, that optimal inference procedures exist even in the models for which estimation consistency for the model parameters cannot be guaranteed.
Our asymptotic theory relates to an extensive recent literature on the high-dimensional statistical inference, most of which focuses on the cases of Gaussian models [31,32,34,12,10]; correctly specified models [70,63,3,45,16]. Our present paper complements this body of work by showing how methods developed to study models with missing data can also be used to develop tests for misspecified models through robust and yet adaptive tests.

Motivating example: a challenge of the lack of sparsity
To better understand whether discoveries based on current state-of-the-art methods in high-dimensional inference can be spurious in the models that lack sparsity, we consider the problem of testing (1.2) in the model (1.1) with β 0 = 0. For emphasize the issue of lack of sparsity, we consider a homoscedastic Gaussian model. Assume the simple setup of orthogonal designs and known noise level: the entries of Z, X and ε are known to be independent standard normal random variables. Moreover, assume that log p = o(n). Let β * = 0 and γ * = ap −1/2 1 p , where a ∈ [−10, 10] is a fixed constant and 1 p = (1, . . . , 1) ∈ R p . It is easy to see that γ * is not sparse if a = 0. Let π * = (β * , γ * ) ∈ R p+1 and W = (Z, X) ∈ R n×(p+1) . With λ = 16 n −1 log p we define the Lasso estimator π = arg min π∈R p+1 1 2n Y − Wπ 2 2 + λ π 1 . The de-biased estimator is then defined as π = π + ΘW (Y − W π)/n, where Θ is the estimated precision matrix of the design 1 . Let β be the first entry of π. Therefore, a Wald test for the hypothesis β * = 0 with nominal size α ∈ (0, 1) can be defined as a decision to reject the hypothesis whenever | β| > Φ −1 (1 − α/2)/ √ n, where Φ(·) is the cumulative distribution function of the standard normal distribution. The Type I error of this test is characterized in the following result.
Theorem 1. In the above setup, a high-dimensional Wald test satisfies It is immediately obvious that when the model is sparse, i.e., a = 0, we have that F (α, a) = α and thus the Type I error of the test is asymptotically equal to the nominal level. However, when the model is not sparse, i.e., a = 0, we have that F (α, a) > α (see Figure 1), meaning that the Type I error asymptotically exceeds the nominal level. In fact, in the non-sparse case, the Type I error 1 Since the true precision matrix is I p+1 the (p+1)×(p+1) identity matrix, we set Θ = I p+1 for simplicity; our result still holds if we use the nodewise Lasso estimator in Equation (8) of (author?) [63]. Moreover, observe that the method of [33] is equivalent to (author?) [63] in this particular setting. By Theorem 2.2 therein, where ξ conditional on W has a normal distribution with mean zero and variance Z Z/n. of the Wald test can be close to one. In Figure 1, we see that the state-of-the-art Wald test with nominal size 1% can reject, in large samples, a true null hypothesis with probability as high as 80%.
In the proof of Theorem 1, we show that π = 0 with high-probability; this turns out to be a minimax optimal estimator in 2 -balls; see [19]. However, even though the zero vector resulted from (entry-wise) shrinkage methods might be a good approximation of π * in the ∞ -norm, this approximation is poor for the inference problems. Hence, this illustrates the drawback of the Wald principle in non-sparse models in general, as even optimal estimators might not be good enough for this principle to perform well. Figure 1: Plot of the asymptotic Type I error of the high dimensional Wald test in a linear model with independent design. The horizontal axis denotes a where γ * = ap −1/2 1 p and the vertical axis denotes the rejection probability F (α, a) under the null hypothesis. Observe that, in practice, rejecting the hypothesis that certain regression coefficient is zero is typically interpreted as evidence supporting new scientific discovery, e.g., treatment effect. However, due to the non-robustness to the lack of sparsity as demonstrated above, researchers could, with high probability, obtain "discoveries" that do not exist. Therefore, new inferential tools are called for that do not reject true hypotheses too often regardless of whether the sparsity condition holds.

Notation and organization
Throughout the article, we use bold upper-case letters for matrices and lower-case letters for vectors. Moreover, denotes the matrix transpose and I p denotes the p×p identity matrix. For a vector v ∈ R k , v j denotes its jth entry and its q -norm is defined as follows where 1I denotes the indicator function. For any matrix A, σ min (A) and σ max (A) denote the minimal and maximal singular values of A. For two sequences a n , b n > 0, we use a n b n to denote that there exist positive constants C 1 , C 2 > 0 such that ∀n, a n ≤ C 1 b n and b n ≤ C 2 a n . We also introduce two definitions that will be used frequently. The sub-Gaussian norm of a random variable X is defined as X ψ2 = sup q≥1 q −1/2 (E|X| q ) 1/q , whereas the sub-Gaussian norm of a random vector Y ∈ R k is Y ψ2 = sup v 2 =1 v Y ψ2 . A random variable or a vector is said to be sub-Gaussian if its sub-Gaussian norm is finite. Moreover, a random variable X is said to have an exponential-type tail with parameter This is a generalization of the sub-Gaussian property; see [67].
The rest of the paper is organized as follows. Section 2 discusses the main idea behind the proposed CorrT procedure. It introduces the moment construction technique and a construction of a selfnormalizing test statistic related to that moment of interest. Subsection 2.1 designs tuning-adaptive estimators that are particularly useful for estimation in potentially non-sparse models. Theoretical results are presented in Section 3, including robustness to lack of sparsity, sparsity-adaptive property and relation to the oracle. Section 4 shows numerical examples and contrasts the outcomes with two state-of-the-art methods. The detailed proofs of the main results and that of a number of auxiliary results are collected in the Appendix.

Correlation Test: CorrT
In standard doubly robust methods as proposed by [44,54], the test for a particular parameter of interest is constructed by solving the estimating equations with respect to that parameter of interest; the solution would provide a direct and consistent way of constructing confidence intervals or hypothesis tests (1.4). However, in high-dimensional setting, we abandon this approach and construct a test statistic directly mimicking the estimating equations. There is significant advantage of doing so when the model parameter or feature correlation can be extremely dense.

Self-adaptive and sparsity-adaptive estimation
We begin by providing estimates for the unknown parameters γ * and θ * , which will be used to construct the CorrT test.
One approach to estimating these parameters is to apply one of the many methods designed for sparse problems; e.g., the Lasso estimator, Dantzig selector, self-tuning methods by [5], [59] and [24], and many more. Although these methods often work well in sparse models, they are extremely sensitive to the model misspecification and the level of sparsity. Here, we provide adaptive and stable estimators that can be used in conjunction with (1.4). Our new algorithms introduce problem-specific constraints for sparsity-inducing regularization. This construction delivers an adaptivity in the following sense. When the estimation target fails to be sparse, the size of the estimated residuals is stable and does not diverge too quickly with the sample size due to the constraints; when the estimation target is sparse, the estimator automatically achieves consistency through the 1 -regularization.
Our method is based on the solution path of the following linear program. For a > 0, let γ(a) := arg min γ∈R p γ 1 (2.1) where the constants η 0 = 1.1n −1/2 Φ −1 (1 − p −1 n −1 ) and ρ n = 0.01/ √ log n. The first inequality in (2.1), much like the Dantzig selector, provides a necessary gradient condition guaranteeing that the gradient of the loss is close to zero. Observe that constant a is different from (Eε 2 i ) 1/2 as the model is heteroscedastic in nature. Optimal choice of the constant a is driven by (2.2). The constant η 0 is scalefree and is derived from moderate deviation results of self-normalized sums. The second and third inequalities introduce stability in estimation, whenever the model parameter is grossly non-sparse.
Estimation of the "variance" is extremely important for any testing problem. Since we consider a heteroscedastic model, the variance of the error terms might not be informative as E( For this reason, we invoke moderate deviation results for self-normalized sums and would like a to target the quantity a 0,γ , where This quantity mimics the celebrated [69]'s heteroscedasticity-robust standard error. Hence, we treat the unknown quantity a 0,γ as a parameter, which is chosen to be the largest element in the set S γ defined as In other words, a γ = arg max{a : a ∈ S γ }. The requirements in the set S γ can be viewed as a data-dependent way of detecting sparsity and the tuning parameter a.
Our final estimator is then defined as the following combination When γ * is sparse, we show that S γ = ∅ with probability approaching one and that the 1 -norm in the objective function, the constraint on n −1 X (V − Xγ) ∞ and the definition of S γ induce the self-adapting property by adjusting a γ to be close to max 1≤j≤p E[x 2 i,j ε 2 i ]. Thus, for sparse γ * , γ( a γ ) is consistent and behaves like other sparsity-based estimations, such as Lasso or Dantzig selector with ideal choice of tuning parameters.
When γ * is not sparse, the constraints in the definition of γ(·) guarantee that X (V−X γ) ∞ / V− X γ 2 is not growing too fast; see Lemma 3 in the Appendix. Hence, the resulting estimator γ is not only stable but also is able to automatically "detect" sparseness and exploit it to achieve estimation accuracy.
The estimation of θ * is done in a similar manner. We define a → θ(a) by where η 0 and ρ n are as defined before. Let a θ = arg max{a : a ∈ S θ }, where The final estimator is then defined as

CorrT Test
We now introduce the test statistic. The main difference between the proposed test related to other high-dimensional tests is that its asymptotic distribution is not affected by misspecification (e.g., in terms of heteroscedasticity and sparsity) of models (1.1) and (1.3). Motivated by the success of the test proposed by [73], our approach mimics the plug-in nature of their test while automatically tailoring normalization of the test to the appropriate level. Let γ and θ be defined in (2.3) and (2.5), respectively. We propose to consider the following correlation test (CorrT) statistic where ε = V − X γ, u = Z − X θ. The notation T n (β 0 ) indicates that the test statistic is constructed with the knowledge of the null hypothesis H 0 : β = β 0 . Whenever possible we suppress its dependence on β 0 and use T n instead.
Self-normalizing nature of the test statistics also has an advantage that it allows us to derive an analytical critical value for the test; we show that T n (β 0 ), under the null hypothesis, converges in distribution to N (0, 1). Hence, a test with nominal size α ∈ (0, 1) rejects the hypothesis (1.2) if and only if |T n (β 0 )| > Φ −1 (1 − α/2).
We briefly discuss the mechanism of our test and outline the reason for its robustness to the lack of sparsity in γ * or θ * ; in fact the method is blind to the choice of the sparse parameter and provides valid inference when either of the two models (1.1) and (1.3) is sparse. The product structure in (2.6) helps to decouple the bias introduced by γ or θ, allowing us to establish uncorrelatedness between the estimated residuals ε and u under the null hypothesis H 0 . We can show, without assuming sparsity of γ * , that n −1/2 ε u = n −1/2 ε u + O P ( log p θ − θ * 1 ). Under the null hypothesis, the first term on the right hand side has zero expectation and the second term vanishes fast enough due to sparse θ * . On the other hand, without assuming sparsity of θ * , we can show that . Under the null hypothesis, the first term on the right hand side has zero expectation and the second term vanishes fast enough due to sparse γ * .
The construction of γ and θ as in (2.3) and (2.5) is fundamentally critical to establishing limiting distribution of the leading term in the decomposition above. We note that commonly used estimators in sparse regression of Y against X and Z do not deliver such de-correlation and, for non-sparse γ * or θ * , do not possess tractable properties.

Computational Aspects: parametric simplex method
The proposed method allows for efficient implementation. We point out that both optimization problems (2.1) and (2.4) can be written in the form of a parametric linear program, where the "additional unknown parameter", a, is present only as an upper bound of the linear constraints. It is wellknown that, if a problem can be cast as a parametric linear program, then the parametric simplex method can be used to obtain the optimal solution. Computational burden of obtaining the solution paths a → γ(a) is the same of the burden of computing the solution for one value of a; see [66] and [47]. We now explicitly formalize the optimization problem (2.1) as parametric linear programs; the formulation of (2.4) is analogous. Let a > 0, and denote with γ(a) is defined as the solution to the following parametric right hand side linear program b(a) = arg max b∈R 2p where the matrices It is known that the solution of a parametric simplex algorithm is a piecewise linear and convex function of a.

Theoretical Properties
In this section we present theoretical properties of the proposed method and consider an asymptotic regime in which p and the sparsity level of γ * or θ * are allowed to grow to infinity faster than n.
Suppose that we have n independent and identically distributed (i.i.d) observations, indexed i = 1, . . . , n. For each observation, we have access to the response variable y i along with a set of auxiliary Here, x i , z i and y i denote the i-th row of X, the i-th entry of Z and of Y, respectively. The j-th entry of x i will be denoted by x i,j . Similar notations will be used for u i , v i and ε i , which are the i-th entry of u, V and ε, respectively.
In order to establish the theoretical claims, we impose the following regularity condition.
. A few comments are immediate. For generality, we set up Condition 1 in an abstract way. We notice that in the context of our main problems of interest requiring Condition 1 is not particularly stringent. Unlike [73] or [31] -among references that allow dense model parameters -we require only sub-Gaussian designs.
In addition, we consider a non-sparse model that allows for conditional heteroscedasticity. If the errors ε i are being considered as independent of the features, non-sparse models were studied by [73]. However, it has been observed (e.g. [40,3]) that heteroscedasticity could have severe consequences for high-dimensional models. In particular, the Lasso or Dantzig selector become very unstable in this context. Heteroscedasticity in this paper refers to conditional heteroscedasticity, which means that the conditional variance of ε i depends on the features. A natural example is that of Even for low-dimensional models, heteroscedasticity has been known to cause several complications (for example, Gauss-Markov theorem no longer applies) with a large literature on constructing valid inference in the presence of heteroscedasticity, see e.g., [48,26,69].
Conditional heteroscedasticity is also allowed in the auxiliary model, therefore allowing for feature heteroscedasticity; for example, variance of one feature may directly depend on many other features in the model. This includes a wide spectrum of dependence structures among the features.
Our next assumption controls the size of the two models.

Condition 2.
For some constant κ 6 > 0, γ * 2 ≤ κ 6 and s θ = o n 1/2 /(log(p ∨ n)) 5/2 , where s θ = θ * 0 . The rate for s θ is slightly stronger than the conditions in [3] and [45] who impose o( √ n/ log p) and in [63] who impose o(n/ log p). However, we do not impose that the original model is sparse as long as a row of the precision matrix θ * is sparse enough. This can be interpreted as the cost of allowing for non-sparse γ * as well as general heteroscedatic model errors. Moreover, the validity of CorrT is also guaranteed for dense θ * and sparse γ * ; see Theorem 3. Therefore, our test can detect automatically which one of the two (γ * or θ * ) is sparse and utilize it for effective inference; hence, the practitioner does not need knowledge of which of the two is sparse when applying our method. Additionally, in studying the estimation problem for signals with many nonzero entries, it is common to consider parameters in bounded convex balls, e.g. [20], [51] and [31]. Notice that the bounded 2 -norm itself does not impose direct constraints on the sparsity, see e.g., [31].

Size properties: validity regardless of sparsity
Given these assumptions, we are now ready to provide an asymptotic characterization of the proposed CorrT test.
Theorem 2 formally establishes that the new CorrT test is asymptotically exact in testing β * = β 0 while allowing p n. In particular, CorrT is robust to dense γ * or θ * , in the sense that even under dense γ * or θ * , the proposed procedure does not generate false positive results. Compared to existing methods, we have moved away from the unverifiable model sparsity assumption and hence can handle more realistic problems.
Remark 1. Our result is theoretically intriguing as it circumvents limitations of the "inference based on estimation" principle. This principle relies on accurate estimation, which is challenging, if not impossible, for non-sparse and high-dimensional models. To see the difficulty, consider the full model parameter π * := (β * , γ * ) ∈ R p+1 . The minimax rate of estimation in terms of 2 -loss for parameters in a (p + 1)-dimensional q -ball with q ∈ [0, 1] and of radius r n is r n (n −1 log p) 1−q/2 (see [51]). Theorem 2 says that CorrT is valid even when π * cannot be consistently estimated. For example, suppose that log p n c for a constant c > 0 and π * j = 1/ √ p for j = 1, . . . , p + 1. Then, as n → ∞, π * q (n −1 log p) 1−q/2 → ∞ for any q ∈ [0, 1], suggesting potential failure in estimation.
Instead of solely relying on estimation of π * , we impose the null hypothesis in our inference procedure and fully exploit its implication. With sophistication in constructing the test, the inaccuracy in π * does not impair the validity (Type I error control) of our approach.
With an almost analogous argument as in the proof of Theorem 2, we can show the following result.
By Theorem 3, the asymptotic validity of CorrT still holds if we replace s θ with s γ := γ * 0 in the statement of Condition 2. This means that as long as one of γ * and θ * is sparse, CorrT delivers valid inference. In this sense, CorrT automatically detects and adapts to sparse structures from different sources in the data, either in the model parameter or the covariance matrix of features; prior knowledge of the exact source is not needed.
We note that confidence sets for β * can be constructed by inverting the CorrT test. Let 1 − α be the nominal coverage level. We define Theorem 2 guarantees the validity of this confidence set.
Moreover, the same conclusion holds if we replace Conditions 1 and 2 with Conditions 3 and 4.
Corollary 4 implies that the confidence set (3.3) provides exact asymptotic coverage even when the nuisance parameter γ * or θ * is non-sparse in that s γ /p → 1 or s θ /p → 1 with p n. To the best of our knowledge, this result is unique in the existing literature. As we step further into the non-sparse regime (as we take more and more entries to be none-zero), existing results could provide confidence sets with coverage probability far below the nominal level. In the example discussed in Section 1.2 with s γ = p, Figure 1 indicates that the 99% confidence interval based on debiasing method can have coverage probability around 20%.
We would also like to point out that Theorem 2 holds uniformly over a large parameter space and over a range of null hypotheses. To formally state the result, we define the nuisance parameter is determined by (β * , ξ * ). The probability distribution under (β * , ξ * ) is denoted by P (β * ,ξ * ) . The space for the nuisance parameter we consider is where κ 1 , . . . , κ 6 > 0 are constants. We have the following result.
Corollary 5 says that Theorem 2 holds uniformly in a large class of distributions. An analogous result can be stated for Theorem 3.

Power properties
In this subsection, we investigate the power properties of CorrT. Let us observe that it is still unclear what statistical efficiency means for inference in non-sparse high-dimensional models. Various minimax results on both estimation [51] and inference [30] suggest that no procedures are guaranteed to accurately identify non-sparse parameters. In light of these results, it appears that, for dense highdimensional models, it is quite difficult, if possible at all, to obtain tests that are powerful uniformly over a large class of distributions, e.g., a bounded 2 ball for γ * . However, our test does have desirable power properties for certain classes of models of practical interest. First, we show that our test is efficient for sparse models. Second, we show that our test has the optimal detection rate O(n −1/2 ) for important dense models.

Adaptivity: efficiency under sparsity
In Section 1.2, we have demonstrated that the lack of robustness could cause serious problems for existing inference methods designed for sparse problems. A certain oracle procedure with the knowledge of s γ = γ * 0 would proceed as follows: (1) use existing sparsity-based methods for sparse models in order to achieve efficiency in inference and (2) resort to certain conservative approaches to retain validity for dense problems. However, the sparsity level s γ is rarely known in practice. Hence, an important question is whether or not it is possible to design a procedure that can automatically adapt to the unknown sparsity level s γ and match the above oracle procedure in the following sense. We say that a procedure for testing the hypothesis (1.2) is sparsity-adaptive if (i) this procedure does not require knowledge of s γ , (ii) provides valid inference under any s γ and (iii) achieves efficiency with sparse γ * .
In this subsection, we show that the CorrT test possesses such sparsity-adaptive property. It is clear from Section 2 that CorrT does not require knowledge of s γ . In Section 3.1, we show that CorrT provides valid inference without any assumption on s γ . We now show the third property, efficiency under sparse γ * . To formally discuss our results, we consider testing H 0 : β * = β 0 versus is called a Pitman alternative and is typically used to assess the asymptotic efficiency of tests; see [65] and [37] for classical treatments on local power.
Theorem 6 establishes the power properties of the proposed CorrT test.

Remark 2.
In the case of homoscedastic errors with σ 2 ε = Eε 2 1 and σ 2 u = Eu 2 1 , we can compare the local power of CorrT with that of existing methods. In particular we consider the local power of [63] and note that similar analysis applies to [3] and [45]. Let b Lasso denote the debiased Lasso estimator defined in [63]. Under homoscedasticity, our condition of Eu 2 test, referred to as [63] test, is to reject Hence, it is not hard to see that the power of [63] test against H 1,h (3.4) is asymptotically equal to Ψ(k, κ, α) defined in Theorem 6. Since b Lasso is shown to be a semi-parametrically efficient estimator for β * , CorrT is asymptotically equivalent to tests based on efficient estimators.
Moreover, results from [33] suggest that our test is also minimax optimal whenever the model is homoscedastic. By Theorem 2.3 therein (adapted to the Gaussian setting), a minimax optimal α level test for testing β * = β 0 against H 1,h in (3.4) has power at most Φ a n hσ u σ −1 In the display above l n = √ n log n, a n = (n − s γ + l n )/n and F k (x) = P (W k ≥ x), where W k denotes a random variable from χ 2 (k), the chi-squared distribution with k degrees of freedom. By the Bernstein's inequality applied to the sum of i.i.d χ 2 (1) random variables, one can easily show that F n−sγ +1 (n − s γ + l n ) = o(1). Hence, as n → ∞, the limit of the bound in (3.5) is equal to Ψ(k, κ 0 , α) defined in Theorem 6. However, Theorem 6 also holds under a heteroscedastic model (1.1). Observe that the only condition regarding heteroscedasticity is related to the second moment between the design and the errors, summarized by κ. Hence, the local power only depends on κ even though heteroscedasticity allows a very rich class of dependence between ε i and u i . Theorems 2 and 6 establish the sparsity-adaptive property discussed at the beginning of Section 3.2.1: CorrT is shown to automatically detect sparse structures in γ * and utilize them to optimize power, while maintaining validity even in the absence of sparsity.

Extremely dense models
One important class of non-sparse models is what we shall refer to as the "extremely dense models". In these models, a potentially important case is that the entries of the model are all small individually but are strong collectively as the dimension of the model explodes. In our testing problem (1.2), this translates into an extremely dense nuisance parameter γ * with γ * 0 = p n. Consistent estimation of such extremely dense nuisance parametrs might not be possible. However, we are able to discuss the power of the proposed test in the presence of a particular dense nuisance parameter and obtain a strong result indicating that the proposed test controls Type II errors extremely well. To the best of our knowledge, the next result is unique.
We can draw a few conclusions from Theorem 7. For n, p → ∞ with √ log p/n = o(1), the Type II error of the proposed CorrT test, against alternatives with deviations larger than O(n −1/2 ), converges to zero.
For the case of independent columns of the design matrix and the case of Toeplitz designs, the condition Σ X γ * ∞ = O( (log p)/n) is satisfied if γ * ∞ = O( (log p)/n) as long as γ * lies in a bounded 2 -ball. More generally, γ * ∞ = O( (log p)/n) is a sufficient condition under any covariance matrix Σ X satisfying max 1≤j≤p Σ X,j 1 = O(1) (a condition sometimes imposed for consistent estimation of covariance matrices).
Theorem 7 offers new insight into the inference of dense models. When we test the full parameter (β * , γ * ) against an alternative, the minimax test might not have power against "fixed alternative" (deviation bounded away from zero) if the parameter is not sparse; see [30]. Theorem 7 says that the problem of testing single entries of a potentially dense model is an entirely different problem; we can indeed have power against alternatives of the order O(n −1/2 ), just like the case of fixed p. Hence, this rate is optimal. This would also imply that any confidence intervals computed by inverting the test T n are optimal regardless of the size of sparsity of γ * .

Hybrid high-dimensional model
To showcase wide applicability of our method and to provide the bridge between sparse and dense models, we can consider what is perhaps a more practical scenario for the nuisance parameter γ * . Under the so-called "sparse + dense" model, γ * is the sum of a sparse vector and an extremely dense vector. The estimation of high-dimensional models with this hybrid sparsity are discussed by [15], who derive bounds for prediction errors. However, their results only concern with estimation, and thus the statistical inference of these models even in moderately high dimensions is still an open problem. The following result offers the first solution to this issue.
Theorem 8 says that when high-dimensional models have parameters with the hybrid structure, testing single entries can have power in detecting deviations larger than O(n −1/2 ), the optimal rate even for low-dimensional models.
Condition (4) of Theorem 8 is a natural restriction on the interaction between the sparse part and the dense part of the model. When Σ X = I p , condition (4) is automatically satisfied if π * and µ * have disjoint supports. Since (π * +µ * ) Σ X µ * +E(ε 2

Numerical Examples
In this section we evaluate the finite-sample performance of the proposed method via Monte Carlo simulations. We also illustrate CorrT using a dataset on breast cancer trials.

Simulation Examples
We simulate the linear model Y = Wπ * + ε as follows. The design matrix W ∈ R n×p consists of i.i.d rows from one of the following distributions: LTD Light-tailed design: N (0, Σ (ρ) ) with the (i, j) entry of Σ (ρ) being ρ |i−j| .
HTD Heavy-tailed design: each row has the same distribution as Σ 1/2 (ρ) U, where U ∈ R p contains i.i.d random variables with Student's t-distribution with 3 degrees of freedom (the third moment does not exist.) We consider both uncorrelated designs (ρ = 0) and the correlated designs (ρ = −1/2). The error term ε ∈ R n is drawn as a vector of i.i.d random variables from either N (0, 1) (light-tailed error, or LTE) or Student's t-distribution with 3 degrees of freedom (heavy-tailed error, or HTE). Let s = π * 0 denote the size of the model sparsity and we vary simulations settings from extremely sparse s = 3 to extremely large s = p. We set the model parameters as follows: π * j = 2/ √ n for 2 ≤ j ≤ 4, π * j = 0 for j > max{s, 4} and other entries of π * are i.i.d random variables uniformly distributed on (0, 4/ √ n).
We test the hypothesis H 0 : π * 3 = 2/ √ n and the rejection probability represents the Type I error.
We compare our method with the debiasing method of [63] and the score algorithm of [45]. For both approaches, we choose the scaled Lasso to estimate π * and treat the true precision matrix and variance of the model as known. In a certain sense these procedures are quasi-oracle as they use part of the information on the (typically unknown) true probability distribution of the data. We do this primarily to reduce the arbitrariness of choosing tuning parameters (and hence the quality of inference) in implementing these methods. Moreover, these methods should not be expected to behave better under their original setting with unknown Σ (ρ) and noise level than under our "ideal setting". All the rejection probabilities are computed using 100 random samples. All the tests we consider have a nominal size of 5%.
We collect the result in Table 1, where we clearly observe instability of the competing debiasing and score methods. Their Type-I error diverges steadily away from 5% with growing sparsity s; in extreme cases, Type I error rate could reach 50%, close to a random guess. In contrast, CorrT remains stable even when the model sparsity level is equal to p. This is still true as we change the correlation among the features and the thickness of tails in the distribution of the designs and errors.
We also investigate the power properties of testing H 0 : π * 3 = 2/ √ n. The data is generated by the same model as in Table 1, except that the true value of π 3 is now π * 3 = 2/ √ n + h/ √ n. Table   2 presents full power curves with various values of h, which measures the magnitude of deviations from the null hypothesis. Hence the far left presents Type I error (h = 0) whereas other points on the curves correspond to Type II errors (h = 0). In all the power curves, we use the design matrices have a Toeplitz covariance matrix Σ (ρ) with ρ = −1/2. The two plots in the first row of Table 1 correspond to extremely sparse models with s = 3: LTD + LTE structure on the left and LTD + HTE on the right. In both figures, we observe that CorrT compares equally to the existing methods. The second row corresponds to the case of s = n; we clearly observe that the CorrT outperforms both debiasing and score tests by providing firm Type I error and also reaching full power quickly. Lastly, we present power curves for s = p; for these extremely dense models, debiasing and score methods can have Type I error close to 50% whereas CorrT still provides valid inference. CorrT still achieves full power against alternatives that are 1/ √ n away from the null.

An application to transNOAH breast cancer trial data
We now use real data to illustrate the proposed method. The sample was selected from the transNOAH breast cancer trial (GEO series GSE50948), available for download at "http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc Genome-wide gene expression profiling was performed using micro RNA from biopsies of 114 pretreated patients with HER2+ breast cancer. The dataset contains gene expression values of about 20000 genes located on different chromosomes. Clinically, breast cancer is classified into hormone receptor (HR)-positive, HER2+ and triplenegative breast cancer. The "HER2-positive" subtype of breast cancer over-expresses the human  Research suggests that the BRCA1 proteins regulate the activity of other genes including tumor suppressors and regulators of the cell division cycle. However, the association between BRCA1 and other genes is not completely understood. Moreover, it is believed that BRCA1 may regulate pathways that remove the damages in DNA introduced by the certain drugs. Thus understanding associations between BRCA1 and other genes provides a potentially important tool for tailoring chemotherapy in cancer treatment. HER2+ breast cancer is biologically heterogeneous and developing successful treatment is highly important. We apply the method developed in Section 2 to this dataset with the goal of testing particular associations between BRCA1 and a few other genes conditional on all remaining genes present in the study. Simply applying Lasso procedure results in an estimate of all zeros.
Results are reported in Table 3. Therein we report the test statistic of the proposed test, CorrT, as well as the debiasing test and the score test. We observe that there are a number of genes where the three tests report largely different values -namely CorrT reports a non-significant finding whereas the debiasing and the score test report significant findings. These genes include B3GALNT1, C3orf62, TNFAIP1 and LTB gene that have been previously linked to the lung cancer [8,68,71,50], as well as CCPG1, LRRIQ3 and LOC100507537 linked to prostate, colorectal and bladder cancer, respectively [60,2,62]. Such findings would indicate that this dataset likely does not follow a sparse model and that previous methods might have reported false positives. Observe that even mutations related to diseases other than cancer, such as ataxia and ELOVL4 gene, are being reported as significant by the debiasing and the score tests but are found insignificant by CorrT.
Chromatin regulators have been known to act as a guide for cancer treatment choice [35]. Here we identify retinoblastoma binding protein RBBP4, a chromatin modeling factor, as being highly associated with BRCA1 in all HER2+ patients. This indicates that early detection and chemoterapy treatment should target RBBP4.
Results of Table 3 demonstrate that a partial or complete loss of NARS2 is a significant event that can be associated with the HER2+ breast cancer. Significance of this gene was confirmed in [27] where a significant correlation was found between the gene CCND1 (a known trigger of the ER positive breast cancer) and NARS2. Our analysis enriches these findings and shows statistically significant association between NARS2 and BRCA1 for HER2+ cancer cells.
Lastly, Table 3 showcases significance of a known breast tumor suppressor gene IGF2R [46]. Mannose 6-phosphate/IGF2 receptor (M6P/IGF2R) is mutated in breast cancer, resulting in loss of IGF2 degradation [22] and leading to its usage as a sensitivity marker for radiation, chemotherapy, and endocrine therapy. Germ-line mutations in BRCA1 predispose individuals to breast and ovarian cancers. A novel endogenous association of BRCA1 with Nmi (N-Myc-interacting protein) in breast cancer cells was detected in [38]. Nmi was found to interact specifically with BRCA1, both in-vitro and in-vivo, by binding to two major domains in BRCA1. Table 3 showcases significance of the association between Nmi and BRCA1 gene.
In the above analysis, our method confirms existing knowledge and also delivers new discoveries in medicine. This provides evidence that our methods can be useful for scientific research involving high-dimensional datasets.

Conclusion
In summary, our procedure achieves the optimal detection rate while adapting to any level of sparsity or lack thereof. Therefore CorrT is extremely useful for applied scientific studies where a priori knowledge of sparsity level is unknown. Since there are currently no procedures for testing the sparsity assumption CorrT is exceptionally practical. It possesses the regime-encompassing property in that it behaves optimally in all three regimes: sparse, dense and hybrid and does so automatically without user interference while allowing heteroscedastic errors, therefore, providing a comprehensive tool for

Appendix A Assumptions for Theorem 3
Condition 3. There are constants κ 1 , ..., κ 5 > 0 such that the following hold. Proof of Theorem 1. The proof proceeds in two steps. We first show that P ( θ = 0) → 1 and then prove the desired result.
Notice that x 2 i,j is a chi-squared random variable with one degree of freedom. Since x 2 i,j −1 is a chisquared random variable right it has bounded sub-exponential norm, Proposition 5.16 of (author?) [67] and the union bound imply that for some constants c 1 , c 2 > 0 and for any z > 0 P max It follows that there exists a constant c 3 > 0 such that Let w > 0 satisfy 2 log w = 1.45 2 log p / 1 + c 3 n −1 log p . Then, on the event A n , where (i) holds by the definition of A n , (ii) holds by the definition of w, (iii) holds by Lemma 1 and (iv) holds by the definition of w. By the law of iterated expectations and the union bound, we have that where (i) holds by p → ∞, n −1 log p → 0 and (B.2). Notice that

B.2 Proof of Theorem 2
The proof of Theorem 2 is a consequence of a series of lemmas presented below.
We proceed in three steps with each step corresponding to one of the above claims. In the rest of the proof, we denote by v i the i-th entry of V.

Claim (a) follows by the fact that
Step 2: Establish the claim (b).
By the law of large numbers, Since Eε 2 1 is bounded away from zero, there exists a constant M > 0 such that P ( V 2 / √ n ≥ M ) → 1. On the other hand, ε i has bounded sub-Gaussian norm and thus by the union bound, ε ∞ = O P ( √ log n). Since √ n/ log 2 n √ log n, claim (b) follows.
with probability tending to one, we have that Proof of Lemma 3. We define a 2 * ,γ = max 1≤j≤p n −1 n i=1 x 2 i,j ε 2 i and a 0,γ = X ∞ V 2 / √ n. We first show that P (2a 0,γ ≥ a * ,γ ) → 1. By the law of large numbers, where (i) follows by (B.8) and (ii) follows by the fact that σ ε is bounded away from zero. In other words, P (2a 0,γ ≥ a * ,γ ) → 1. Notice that the feasible set of the optimization problem (2.1) is increasing in a. By Lemma 2, γ * lies in the feasibility set of the optimization problem (2.1) for a = a * ,γ with probability approaching one. Hence, P (2a 0,γ ≥ a * ,γ ) → 1 implies where A denotes the event that γ * lies in the feasibility set of the optimization problem (2.1) for a = 2a 0,γ . Observe that on the event A, γ is well defined. By the last constraint in (2.1), we have that on the event A, Hence, on the event A , Define the event M = {S γ = ∅}. On the event M A, γ = γ(2a 0,γ ) and thus where (i) follows by (B.9). On the event M c A, S γ = ∅, γ = γ( a γ ) for some σ γ ∈ S γ and therefore, where (i) follows by the definition of S γ in (2.2). Notice that on the event M c A, where (i) follows by the first constraint in (2.1) and the fact that γ = γ( a γ ) and (ii) follows by (B.11).
In regards to part (2), notice that on the event A, where (i) follows by the second constraint in (2.1) and (ii) follows by (B.9). Part (2) follows.
Lemma 4. Let Condition 1 hold. Then, with probability tending to one, θ * lies in the feasible set of the optimization problem (2.4) for a = a * ,θ , where a 2 * ,θ = max 1≤j≤p n −1 n i=1 x 2 i,j u 2 i .
Proof of Lemma 4. The argument is identical to the proof of Lemma 2, except that V, γ * , a * ,γ and {ε i } n i=1 are replaced by Z, θ * , a * ,θ and {u i } n i=1 , respectively. Since Z = Xθ * + u holds regardless of whether or not H 0 holds, the same reasoning in the proof of Lemma 2 applies.
By the well-behaved eigenvalues of Σ X in Condition 1, we have that m defined in the statement of Theorem 6 of (author?) [58] satisfies m ≤ min{Cs, p} for some constant C > 0. Since s/p → 0 and s log p = o(n), we have that, for large n, By the well-behaved eigenvalues of Σ X , K(s, 1, A) defined in (author?) [58] is bounded below by a constant C 0 > 0. Since X = ΨA, it follows, by Theorem 6 of therein (in particular Equation (14) therein), that for large n, P min A⊂{1,··· ,p}, |A|≤s The proof is complete.
where M i,j , M i and H i denote the (i, j) entry of M, the i-th row of M and the i-th entry of H, respectively and M ∞ = max 1≤j≤p, 1≤i≤n |M i,j |.
Proof of Lemma 6. The argument closely follows the proof of Theorem 7.1 of (author?) [6]. Define J = support(b) and By the triangular inequality, we have Therefore, where (i) follows by Holder's inequality, (ii) follows by (B.14) and (B.15) and (iii) follows by Holder's inequality and |J| ≤ s. Since ∆ J c 1 ≤ ∆ J 1 and |J| ≤ s, we have that n −1 ∆ M M∆ ≥ K 2 ∆ J 2 2 (due to the assumption in (B.13)). This and the above display imply that The first claim follows by Holder's inequality: To show the second claim, we define · M,j on R n by a M,j = n −1 n i=1 M 2 i,j a 2 i for a = (a 1 , ..., a n ) . For any a, b ∈ R n , we define a M,j = (M 1,j a 1 , ..., M n,j a n ) ∈ R n and b M,j = (M 1,j b 1 , ..., M n,j b n ) ∈ R n and notice that where (i) follows by the triangular inequality for the Euclidean norm in R n . Hence, · M,j is a semi-norm for any 1 ≤ j ≤ p. The semi-norm property of · M,j for all 1 ≤ j ≤ p implies that where (i) follows by (B.16) and (B.17). The second claim follows by observing The proof is complete.
Proof of Lemma 7. We define a 2 * ,θ = max 1≤j≤p n −1 n i=1 x 2 i,j u 2 i and the events A 1 = {θ * lies in the feasible set of the optimization problem (2.4) for a = a * ,θ .} where K > 0 is the constant defined in Lemma 5. By Lemmas 4 and 5, We proceed in three steps. We first show that P (A 3 ) → 1 and then that P (A 4 ) → 1. Finally, we shall show the desired results.
Step 1: Establish that P (A 3 ) → 1. Notice that, on the event A 1 A 2 , n −1 X (Z−X θ(a * ,θ )) ∞ ≤ η 0 a * ,θ , n −1 X (Z − Xθ * ) ∞ ≤ η 0 a * ,θ and θ(a * ,θ ) 1 ≤ θ * 1 . Then we can apply Lemma 6 with (H, M, b, b, σ, s, η) = (Z, X, θ * , θ(a * ,θ ), a * ,θ , s θ , η 0 ) and obtain that on the event Notice that a 2 * ,θ = max Thus, the above display implies that where (i) holds by η 0 ≤ 1.1 2n −1 log(pn) (due to Lemma 1), X ∞ = O P ( log(pn)) (due to the sub-Gaussian property and the union bound) and the rate condition for s θ . Therefore, Step 2: Establish that P (A 4 ) → 1. On the event A 3 , S θ = ∅ and thus θ = θ( σ θ ) for some a θ ≥ a * ,θ (since a θ is the maximal element of S θ ). Notice that the feasible set of the optimization problem (2.4) is nondecreasing in a and thus the mapping of a → θ(a) 1 is non-increasing. Therefore, on the event A 1 A 2 A 3 , θ( a θ ) 1 ≤ θ(a * ,θ ) 1 . By θ(a * ,θ ) 1 ≤ θ * 1 (on the event A 1 ), it follows that on the event Hence, we can apply Lemma 6 with (H, M, b, b, σ, s, η) = (Z, X, θ * , θ( a θ ), a θ , s θ , η 0 ) and obtain that on the event In other words, Step 3: Establish the desired results. Now notice that on the event where (i) is proved in Step 2, (ii) holds by the definition of S θ and (iii) and (iv) hold by the definitions of A 3 and A 1 , respectively. Hence, we can apply Lemma 6 with (H, M, b, b, σ, s, η) = (Z, X, θ * , θ( a θ ), 3a * ,θ , s θ , η 0 ) and obtain that on the event where (i) follows by Lemma 1. We notice that where (i) follows by the law of large numbers and (ii) follows by X ∞ = O P ( log(pn)) (due to the sub-Gaussian property and the union bound). The first result follows by (B.20), the above display and P For the second result, we notice that on the event where (i) follows by the first constraint in (2.4) and the fact that θ = θ( a θ ) on the event A 3 , (ii) follows by the definition of A 4 and (iii) follows by a * ,θ = O P ( log(pn)) (proved above) and η 0 = O( n −1 log(pn)) (due to Lemma 1). This proves the second result.
Notice that on the event A, where (i) follows by Lemma 7 and X ∞ = O P ( log(pn)) (due to the bounded sub-Gaussian norm of entries in X and the union bound). Let σ 2 u,i = E(u 2 i | X, ε). Observe that (B.22) implies where (i) follows by the definition of A. Let F n,0 be the σ-algebra generated by (ε, X). By assumption, {u i } n i=1 is independent across i conditional on F n,0 . Notice that (V, X) only depends on (ε, X) (due to (B.21)) and that γ and σ ε are computed using only (V, X). Therefore, { ε i } n i=1 , σ ε and 1{A} are F n,0 -measurable. Let K > 0 be a constant that upper bounds the sub-exponential norm of u 2 i − σ 2 u,i conditional on F n,0 ; such a constant K exists because u i has bounded sub-Gaussian norm conditional on F n,0 . Since E(u 2 i − σ 2 u,i | F n,0 ) = 0 and {u 2 i − σ 2 u,i } n i=1 is independent across i conditional on F n,0 , we apply Proposition 5.16 of (author?) [67] to the conditional probability measure P (· | F n,0 ) and obtain that there exists a universal constant c > 0 such that on the event A, ∀t > 0, where (i) follows by and (ii) follows by (B.25). By (B.22) and ρ n log 2 n → ∞, the above display implies that The above display implies that It follows by (B.23) that Since P (c 1 ≤ n −1 n i=1 σ 2 u,i ≤ c 2 ) → 1 for some constants c 1 , c 2 > 0, we have that We observe that where (i) follows by the definition of A and (ii) follows by (B.27) and Lemma 7.
where (i) follows by the definition of A and (ii) follows by the uniformly bounded sub-exponential norm of u 2 i and Corollary 2.6 of [9]. This proves claim (b). Notice that claim (a) follows by claim (b). By the definition of D, claim (c) follows by (B.26). Claim (d) follows by the assumption that P (c 1 ≤ n −1 n i=1 σ 2 u,i ≤ c 2 ) → 1 for some constants c 1 , c 2 > 0. The proof is complete.

B.3 Proof of Corollary 5
Proof of Corollary 5. Similar to the comment made in regards to the Corollary 2.1 of [63], the proof of Corollary 5 is exactly the same as in Theorem 2 once we notice the following.
Take an arbitrary sequence (β 0n , ξ n ) ∈ R × Ξ(s θ ). Notice that the properties of the test statistic T n (β 0n ) depend on X, Z and V exclusively. Moreover observe that under H 0 , V = Xγ + ε.
Moreover, in the arguments in the proof of Lemmas 2-7 and Theorem 2, we use bounds that hold in finite samples with universal constants as well as law of large numbers and martingale central limit theorem for triangular arrays. Hence, Lemmas 2-7 and the arguments in the proof of Theorem 2 still go through when ξ n is a sequence.

B.4 Proof of Theorems 6 and 7
Proof of Theorem 6. It suffices to notice that Theorem 6 is a special case of Theorem 8 in which the decomposition now reads γ * = π * + µ * with π * = γ * and µ * = 0. The result then follows by Theorem 8.
Proof of Theorem 7. It suffices to notice that Theorem 7 is a special case of Theorem 8 in which the decomposition now reads γ * = π * + µ * with π * = 0 and µ * = γ * . The result then follows by Theorem 8.

B.5 Proof of Theorem 8
The following result is due to [7] and pointed out by [41].
Suppose that min 1≤j≤p V ar(w i,j ) ≥ 2K 1 and for all 1 ≤ j ≤ p, the sub-exponential norm of w 1,j is upper bounded by K 2 , where K 1 , K 2 > 0 are constants.
Proof. Define t = Φ(1 − p −1 n −1 ) and ξ i,j = w i,j − m j as well as the following events where K 3 > 0 is a constant to be chosen. The proof proceeds in two steps. We first prove that M 1 , M 2 , M 3 and M 4 occur with probability approaching one and then show the desired result.
Step 1: show that M 1 , M 2 , M 3 and M 4 occur with probability approaching one. Fix any δ ≥ 2. Notice that Eξ 2 i,j is bounded away from zero and E|ξ i,j | 2+δ is bounded away from infinity. As in the proof of Lemma 2 (Step 1), Theorem 7.4 of [49] implies that there exist constants A, C 1 > 0 such that for any 1 ≤ j ≤ p, By the union bound, we have that where (i) follows by the fact that Φ(−t) = 1 − Φ(t) = p −1 n −1 and t = Φ −1 (1 − p −1 n −1 ) ≤ 2 log(pn) (due to Lemma 1). By the bounded sub-exponential norm of ξ i,j , there exist constants C 2 , C 3 > 0 such that for any 1 ≤ j ≤ p and any z > 0, P (ξ 2 1,j > z) ≤ C 1 exp(−C 2 √ n). Hence, for some constants C 3 , C 4 > 0, where (i) follows by Lemma 8. By the union bound and log p = o( √ n), Similarly, the above display holds if ξ i,j is replaced by w i,j . Hence, Moreover, notice that for some universal constant c > 0, where (i) follows by Proposition 5.16 of [67]. Hence, for Recall the elementary inequality that for any a, b ≥ 0, Moreover, observe that where (i) follows by Lemma 1 and (ii) holds by log p = o( √ n).
Hence, on the event M 1 M 2 M 3 , where (i) follows by the previous elementary inequality in (B.29), (ii) follows by the definitions of M 2 and M 3 , (iii) follows by the fact that 2Dtn −1 K 3 √ log p < K 1 for large n (due to (B.30)). Now we have that on the event M 1 M 2 M 3 , where (i) follows by (B.31), (ii) follows by the definition of M 2 and (iii) follows by (B.30).
Since D ≤ 0.04 √ K 1 , the above display implies that Step 2: prove the desired result.
We proceed in four steps corresponding to above three claims as well as the second result.
We have showed the claims (a)-(c), completing the proof for the first result.
Proof of Lemma 11. Define V = Y−Zβ 0 , ε (h),i = x i µ * +n −1/2 hu i +ε i and a 2 h,γ = n −1 max 1≤j≤p n i=1 x 2 i,j ε 2 (h),i as well as the following events A 1 = γ h,n lies in the feasible set of the optimization problem (2.1) for a = a h,γ .
where K > 0 is the constant defined in Lemma 5. By Lemmas 10 and 5, P A 1 A 2 → 1.
The rest of proof proceeds in two steps which establish the behavior of the numerator and the denominator, respectively.
The proof is complete.