Thresholding least-squares inference in high-dimensional regression models

: We propose a thresholding least-squares method of inference for high-dimensional regression models when the number of parameters, p , tends to inﬁnity with the sample size, n . Extending the asymptotic behav- ior of the F-test in high dimensions, we establish the oracle property of the thresholding least-squares estimator when p = o ( n ). We propose two auto- matic selection procedures for the thresholding parameter using Scheﬀ´e and Bonferroni methods. We show that, under additional regularity conditions, the results continue to hold even if p = exp( o ( n )). Lastly, we show that, if properly centered, the residual-bootstrap estimator of the distribution of thresholding least-squares estimator is consistent, while a naive bootstrap estimator is inconsistent. In an intensive simulation study, we assess the ﬁnite sample properties of the proposed methods for various sample sizes and model parameters. The analysis of a real world data set illustrates an application of the methods in practice.


Introduction
There is a wide interest in developing statistical and computational methods and theory for high-dimensional regression models when the number of parameters, p, tends to infinity with the sample size, n. In this paper, we propose a thresholding least-squares estimator (TLSE) for high-dimensional linear regression models as a computationally efficient alternative to penalized leastsquares. Thresholding inferential methods have been widely used in wavelet nonparametric regression (Donoho and Johnstone, 1994), wavelet nonparametric density estimation (Donoho et al., 1996), and estimation of sparse covariance matrices (Bickel and Levina, 2008;El Karoui, 2008). However, to the best of our knowledge, there is no systematic analysis of thresholding least-squares for high-dimensional linear regression models. The main purpose of this paper is to fill in this gap in the literature.
One of the most popular penalized least-squares estimators for linear regression models is the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996) which combines the favorable properties of model selection and ridge regression. Other penalized least-squares estimators are the Bridge estimators (Frank and Friedman, 1993), which include the LASSO as a special case. The asymptotic behavior of Bridge estimators were analyzed by Knight and Fu (2000). The smoothly clipped absolute deviation (SCAD) estimator was proposed by Fan and Li (2001) who also established that the SCAD estimator satisfies the oracle property. An estimator has the oracle property if it is variable selection consistent and the limiting distribution of its subvector corresponding to the non-zero coefficients is the same as if their set were known prior to estimation. Motivated by the fact that the LASSO does not have the oracle property, Zou (2006) proposed the adaptive LASSO (ALASSO) and proved its oracle property. All these methods and theoretical results have been developed under the assumption that p is fixed.
The literature on high-dimensional regression inference dates back to Huber (1973) who showed the asymptotic normality of M-estimators when p = o(n 1/2 ), results which were further extended by Portnoy (1984Portnoy ( , 1985 for the case when p log(n) = o(n 2/3 ). Asymptotic theory for M-estimators was also developed by Mammen (1989) for the case when hn 1/3 (log(n)) 2/3 → 0, where h is the maximum diagonal element of the hat matrix. The consistency of L 2 -boosting, which is similar to the forward stagewise least-squares variable selection method, was proven by Bühlmann (2006) when p = o(exp(n)). Asymptotic error rates and power for some multi-stage regression methods were developed by Wasserman and Roeder (2009) for p = o(exp(n)). More recently, van de Geer, Bühlmann and Zhou (2011) compared the ALASSO with the thresholded LASSO in potentially misspecified regression models when p ≥ n in terms of prediction error, mean absolute error, mean squared error, and the number of false positive selections. The oracle property of the ALASSO and the Bridge estimators were established by Huang, Ma and Zhang (2008) and Huang, Horowitz and Ma (2008) when p = o(n). Using a (marginal) componentwise estimator as an initial screening of relevant variables, they showed that, under additional regularity conditions, the results continue to hold even if p = exp(o(n)). The sure-independence screening methodology for ultra-high dimensional feature space was introduced by Fan and Lv (2008). More recently, Wang and Leng (2016) proposed an alternative screening procedure with improved statistical properties and similar computational complexity.
Bootstrap methods (Efron, 1979;Freedman, 1981) are popular computational intensive alternatives to the asymptotic inference which often improve accuracy of inference on small samples (Hall, 1992). The consistency of the residualbootstrap distribution of the least-squares estimator (LSE) was proved by Bickel and Freedman (1983) when p = o(n). The consistency of residual-bootstrap distributions of M-estimators in general and of the LSE in particular were proved by Mammen (1989Mammen ( , 1993. A parametric bootstrap in conjunction with thresholding inference for a high-dimensional mean with unknown covariance matrix was used by van der Laan and Bryan (2001). More recently, Chatterjee and Lahiri (2011) showed that the residual-bootstrap distribution of the LASSO is inconsistent when the model is sparse, i.e., when some regression coefficients are equal to zero, and that centering the bootstrap distribution at a consistent variable selection estimator provides consistent bootstrap inference. The consistency of the residual-bootstrap distribution of the ALASSO and the oracle property of the residual empirical process for high-dimensional regression models was proved by Chatterjee and Lahiri (2013) and Chatterjee, Gupta and Lahiri (2015).
In this paper, we propose a two-step thresholding least-squares method of inference for high-dimensional regression models. We first show the oracle property of the TLSE when p = o(n) based on an extension of the asymptotic distribution of the F-test for high-dimensional regression models. Similarly to Huang, Ma and Zhang (2008), Huang, Horowitz and Ma (2008), and Fan and Lv (2008), we then show that using a componentwise least-squares estimator as an initial dimension reduction estimator, the resulting TLSE has the oracle property even when p = exp(o(n)). Our theoretical results require that the number of non-zero coefficients, q, be of order q = o(n); this constitutes an advantage of the TLSE compared to multi-stage regression models (Wasserman and Roeder, 2009) which require q = O(1), and the ALASSO and the Bridge estimators (Huang, Ma and Zhang, 2008;Huang, Horowitz and Ma, 2008) which essentially require q = o(n 1/2 ). We propose two automatic selection procedures for the thresholding parameter which ensure the oracle property of the TLSE using Scheffé and Bonferroni methods adapted for high-dimensional models. We further show that, when properly centered, the residual-bootstrap distribution of the TLSE is consistent, and when the regression model is sparse, then a naive bootstrap distribution of the TLSE, as a random element in the space of probability distributions on a finite dimensional space, converges in distribution to a random normal distribution, and thus, it is inconsistent.
We conclude this section with an outline. In Section 2, we present the large sample properties of the TLSE for both cases: (i) p < n and (ii) p ≥ n, and present the automatic thresholding parameter selection methods. In Section 3, we study the asymptotic behavior of the bootstrap distribution of TLSE. In Section 4, we present the results of an empirical study of the finite sample properties of the proposed methods and in Section 5, we analyze a real-world data set to illustrate an application of the methods in practice. The proofs of theoretical results can be found in an Appendix.

Thresholding least-squares
Consider the linear regression model: where Y i ∈ R is the response and X i = (X i1 , . . . , X ip ) T ∈ R p is the (nonrandom) explanatory variable corresponding to the ith subject, β ∈ R p is the (unknown) regression parameter vector, and i ∈ R is the (unobserved) error, with 1 , . . . , n ∼ i.i.d. P , P is a distribution on R, with E( ) = 0, var( ) = σ 2 , and ∼ P . For identifiability reasons, we assume throughout the paper that q, the number of the non-zero components of β, is smaller than the sample size, i.e., q < n. We are interested in statistical inference for β in the case when its dimension, p, increases with the sample size, n, and β is sparse, i.e., when some of its components are zero. For notational convenience, we suppress the dependence on n of p, q, X i , and Y i . Without loss of generality, by centering the response and standardizing the covariates, we assume that the intercept term has been removed from the set of predictors. Thus,Ȳ = 0,X (j) = 0, andS (j) = 1, wherē , q J = card(J) for J ⊂ I and card(J) denotes the number of elements of J, Θ 1 = {b ∈ R p : K b = ∅}, and Θ 2 = R p \ Θ 1 , where ∅ is the empty set. The regression model (2.1) is called sparse if β ∈ Θ 1 , i.e., when some of the regression coefficients are 0.

Case p < n
In this section, we consider the case when p < n. Letβ be the least-squares estimator (LSE) of β, i.e.,

M. Giurcanu
Letσ 2 be the (unbiased) LSE of σ 2 given bȳ Let ρ 1 and ρ 2 denote the minimum and the maximum eigenvalues of Ω, respectively, and let a denote the Euclidean norm of a ∈ R p . We assume the following regularity conditions: Note that since tr(Ω) = p, where tr(Ω) denotes the trace of Ω, then ρ 1 ≤ 1. By condition A.2, it thus follows that p = o(n). Lemma 2.1 shows the consistency ofβ andσ 2 , that the rate of convergence of a T (β − β) depends on ρ 1 (which is allowed to tend to 0 but at a slower rate than p/n), and that the rate of convergence ofσ 2 to σ 2 is n −1/2 , exactly the same as in the case of fixed p.
Lemma 2.1. Suppose that conditions A.1-A.2 hold and let a ∈ R p with a = 1.
Let d − → denote convergence in distribution. Lemma 2.2 shows that a Tβ is asymptotically normal for all a ∈ R p , with a = 1.
Lemma 2.2. Suppose that conditions A.1-A.2 hold and let a ∈ R p with a = 1.
be the Schur complement of the block matrix Ω JJ , where K = I \ J. An immediate consequence of Lemma 2.2 is that for every fixed K ⊂ K β (i.e., K does not depend on n), we have where I q K is the identity matrix in R q K ×q K . Note that the F-test statistic for testing the null hypothesis H 0 : By Lemma 2.2, it follows that under H 0 : K β = K, then where χ 2 q K is the chi-squared distribution with q K degrees of freedom. Lemma 2.3 shows the limiting null distribution of the scaled and centered F-test statistic when the cardinality of K increases with n (and thus, q K increases with n).

Lemma 2.3. Suppose conditions A.1-A.2 hold and nρ
LetK be a thresholding estimator of the index set of the zero components of β, K β , given by:K = j ∈ I : |β j | ≤ γσ jj , whereβ = (β j : j ∈ I) T ∈ R p , γ is the thresholding parameter,σ jj = n −1/2σ ω 1/2 jj , and Ω −1 = (ω ij ) ∈ R p×p . We assume that γ satisfies the following conditions: and To get an intuition about conditions (2.4a) and (2.4b), suppose for the moment that lim inf n ρ 1 > 0, lim inf n ρ −1 2 > 0, lim inf n min j∈J β |β j | > 0, and q K β ∼ n τ , where 0 < τ < 1; here and elsewhere, we use the standard notation that a n ∼ b n if and only if a n = O(b n ) and b n = O(a n ). In this case, we can choose γ ∼ n τ0/2 , where τ < τ 0 < 1. Our default choice for γ is γ = (p log(p)) 1/2 , and as long as p = o(n/ log(n)), then conditions (2.4a) and (2.4b) hold. In Section 2.2, we will take up this problem in more detail and present two automatic thresholding parameter selection procedures.
The TLSE of β is defined as follows: be the TLSE of σ 2 . Similarly to Lemma 2.1, it readily follows thatσ 2 = σ 2 + O P (n −1/2 ). Theorem 2.1 shows thatβ has the oracle property, and thus, the TLSE has similar asymptotic properties as the LASSO-type estimators that have the oracle property.
Note that the regularity conditions of Theorem 2.1 are less restrictive than those imposed for Bridge estimators by Huang, Horowitz and Ma (2008). To this end, assume that lim inf n ρ 1 > 0, lim inf n ρ −1 2 > 0, and lim inf n min j∈J β |β j | > 0. Then, by condition (A3)(b) of Huang, Horowitz and Ma (2008), we obtain λ(p/n) γ/2 p −1 → ∞, where λ is the regularization parameter and γ ∈ (0, 1) is the power of the penalty component of the Bridge estimator. Hence λ/p → ∞. By their condition (A2)(b), then λq/n → 0. However, this is a restrictive condition, since, for example, if p ∼ n/ log(n), then q = o(log(n)); however, in this case, our regularity conditions require only that p = o(n/ log(n)). See the paragraph above for more details. Moreover, if the covariates are uniformly bounded, then condition (A5)(b) holds only if q = o(n 1/2 ) while our corresponding regularity condition max 1≤i≤p X i 2 = O(p) holds without additional restrictions. By Lemma 2.2 and Theorem 2.1, when β ∈ Θ 2 , the asymptotic distributions of a T (β−β) and a T (β−β) are the same for all a ∈ R p , with a = 1. Corollary 2.1 is a direct consequence of Theorem 2.1 and shows thatβ is more efficient than β in the sense that the asymptotic variance of a T (β − β) is smaller than or equal to the asymptotic variance of a T (β − β).
Thenβ is asymptotically at least as efficient asβ.

Thresholding parameter selection
Theoretically, any sequence γ satisfying (2.4a) and (2.4b) will ensure the oracle property ofβ. In this section, we describe two automatic selection procedures for γ with good numerical and statistical properties. These procedures are based on extensions of Scheffé and Bonferroni methods adapted for high dimensional regression models.

Case p ≥ n
In this section, we consider the case when p ≥ n and q < n, where recall that q = q J β is the number of non-zero components of β ∈ R p . We will show that in this case, we can make oracle inference about β in two steps using a similar initial variable screening method as Huang, Horowitz and Ma (2008), Huang, Ma and Zhang (2008), and Fan and Lv (2008). First, we select an index setĴ 0 ⊆ I, with card(Ĵ 0 ) < n, with the property that it contains the indices i ∈ I for which the absolute values of the t-statistics of the (marginal) componentwise least-squares estimator (CLSE) of β are larger than an appropriately chosen threshold value. In the second step, we perform the thresholding least-squares inference presented in Sections 2.1-2.2 using only the covariates from the index setĴ 0 . We will show that, under additional regularity conditions, the resulting TLSE has the oracle property even when p increases almost exponentially with respect to n.
Since the columns of the design matrix X ∈ R n×p are standardized and the response vector Y ∈ R n is centered at 0, then the CLSE of β is given bỹ LetΓ = {γ j : j ∈ I} be the set of the absolute values of the t-statistics corresponding toβ, i.e., Letγ (j) denote the jth order statistic ofΓ, j ∈ I. ThenĴ 0 is defined as the index set corresponding to the largest m absolute values of the t-statistics corresponding to the CLSE of β, where q ≤ m and m = o(n) is a pre-specified value corresponding to the hypothesized maximum number of non-zero regression coefficients. Thus,Ĵ The TLSE is defined in the same way as in Section 2.1 for the response vector Y and design matrix XĴ 0 . With a slight abuse of notation, let ρ 1 and ρ 2 denote the minimum and the maximum eigenvalues of ΩĴ 0Ĵ0 ∈ R m×m . We assume the following regularity conditions. B.1 E( ) = 0, E( 2 ) = σ 2 , where ∼ P , and P has sub-Gaussian tails, that is, there exists constants c 0 , C 0 > 0 such that B.3 (i) The maximum correlation between the covariates in J β and K β is of Theorem 2.2 shows that, under additional regularity conditions, the TLSE has the oracle property even if p grows almost exponentially with n.
The regularity conditions of Theorem 2.2 are similar to those imposed for the Bridge estimators by Huang, Horowitz and Ma (2008). However, we can highlight an important difference. Specifically, the partial orthogonality condition (B2)(a) of Huang, Horowitz and Ma (2008) requires the correlation coefficients between the covariates corresponding to the zero and the non-zero coefficients be of order O(n −1/2 ), while our corresponding condition B.3(i) requires to be only of order o(q −1 ). Under their condition (B3)(a), q = o(n 1/2 ), and thus, our condition is less restrictive. Note that van de Geer, Bühlmann and Zhou (2011) developed regularity conditions for an analytical comparison between ALASSO and LASSO with thresholding in terms of prediction error, mean absolute error, mean squared error, and the number of false positive selections. Since van de Geer, Bühlmann and Zhou (2011) did not prove the oracle property of the ALASSO and the LASSO with thresholding, we cannot compare our regularity conditions with theirs.
We can relax the condition of sub-Gaussian tails of the errors in condition B.1 on the expense of a slower growth of p. Specifically, assuming only finite fourth order moments of given by condition A.1, by Markov's inequality, we obtain Analysis of the proof of Theorem 3.1 shows that part (i) and (ii) of the theorem hold provided that pq = o(n 2 ). Note further that we can also provide more primitive conditions for B.3(ii). Specifically, we could request instead that lim inf n min j∈J β |β j | ≥ c 2 and max To this end, note that for j ∈ J β , we have and thus, B.3(ii) holds.
In practice, we can set m = n/ log(n) , where n/ log(n) is the integer part of n/ log(n). However, we can also select m via a k-fold cross-validation procedure. Specifically, for i = 1, . . . , k, let T i and V i denote the index sets corresponding to the ith training and validation data sets, respectively, where . . , n} be a search grid for m. For l ∈ L, letv i (l) denote the cross-validation estimate of the mean squared prediction error of the submodel corresponding to m = l using the ith validation set, i.e.,v whereŶ Ti j,l is the predicted value of Y j using the training data set T i and the design matrix X Ti,J l = (X t,j : t ∈ T i , j ∈ J l ) and J l = {j ∈ I :γ j >γ (p−l) }. Letv(l) = k −1 k i=1v i (l), and set m =m, wherem = argmin l∈Lv (l). In the simulation study and data analysis, we search l on a grid L of 20 equally spaced values on the log-scale (so that the grid is finer for smaller values of l), and we use k = 20.

Bootstrap inference
Letβ be the bootstrap version of β in the context of thresholding least-squares inference. Letˆ i = Y i − X T iβ be the ith (raw) residual; sinceX = 0 andȲ = 0, then n j=1ˆ j = 0, and thus, the residuals are inherently "centered" at zero. LetÊ 1:n = {ˆ 1 , . . . ,ˆ n } be the sample of residuals and let P = n −1 n i=1 δˆ i denote its empirical distribution, where δ x denotes the unit mass at x ∈ R. The residual-bootstrap method (Freedman, 1981) is performed by first sampling, with replacement, the residuals which are then added to the fitted values to obtain a bootstrap sample. Specifically, givenÊ 1:n , letÊ * 1:n = {ˆ * 1 , . . . ,ˆ * n } be a conditionally i.i.d. sample from P, i.e.,Ê * 1:n is a with replacement random sample of size n fromÊ 1:n . Then, for each i = 1, . . . , n, let Y Similarly to its sample version, the bootstrap version ofβ has a closed form solution given byβ * be the bootstrap versions ofσ 2 andŝ, respectively. The bootstrap estimator of L (ŝ −1/2 a T (β −β)) is the conditional distribution given Y ofŝ * −1/2 a T (β * −β), which we denote as L (ŝ * −1/2 a T (β * −β)|Y ). Another option is to useβ as a centering value, and in this case, L (ŝ * −1/2 a T (β * − β)|Y ) is the "naive" bootstrap distribution estimator of L (ŝ −1/2 a T (β − β)). Since it is determined by Y , the bootstrap distribution L (ŝ * −1/2 a T (β * −β)|Y ) is a random element in P, the space of distributions on R. We equip P with the Prohorov metric, which metrizes the weak convergence, and with the corresponding Borel sigma-field generated by the topology of weak convergence (see, e.g., Dudley, 2002, p. 393-399). Theorem 3.1 shows that, under the regularity conditions of Theorem 2.1, the bootstrap distribution L (ŝ * −1/2 a T (β * −β)|Y ) is consistent, and if β ∈ Θ 1 , the "naive" version L (ŝ * −1/2 a T (β * −β)|Y ) converges in distribution to a random distribution, and thus is inconsistent. This result is obtained for the case when p < n. Careful analysis of the proof of Theorem 3.1 shows that a similar result continue to hold also in the case when p ≥ n under the regularity conditions of Theorem 2.2. i ,r = n −1 r i , and h ii = X T i (X T X) −1 X i is the ith element of the "hat matrix" (Davison and Hinkley, 1997, p. 259). By the regularity conditions of Theorem 3.1, max 1≤i≤n h ii = O(p/(nρ 1 )) = o(1), and thus, n −1 n i=1 r 2 i Pr −→ σ 2 . Careful analysis of the proof of Theorem 3.1 shows that this version of residual-bootstrap is also consistent.

Simulation models
In this section, we present the results of a simulation study of the finite sample behavior of the thresholding least-squares inference in sparse low and highdimensional regression models. We assess our results in terms of computational and numerical efficiency of the ordinary least-squares, LASSO, and thresholding least-squares inferential methods. The computations are done in the R language and we use the glmnet package (Friedman, Hastie and Tibshirani, 2010;El Ghaoui, Viallon and Rabbani, 2012) to compute the LASSO estimator (with the regularization parameter selected via the 10-fold cross-validation procedure implemented in the package). We have implemented the thresholding least-squares methods in an R package, called TLSE, which is available from the author upon request. Simulations were performed on the HiPerGator cluster hosted by the High Performance Computing Center at University of Florida.
Our simulation models are similar to those considered by Huang, Horowitz and Ma (2008) for the Bridge estimators and we assess the finite sample performance of the methods in terms of (i) variable selection, (ii) prediction accuracy, and (iii) estimation efficiency. The variable selection performance is measured by the relative frequency of correct identification of the set of zero and non-zero regression parameters. Specifically, for j ∈ K β , the relative frequency of correct identification isp whereβ s = (β s 1 , . . . ,β s p ) T ∈ R p is the TLSE of β on the sth simulated sample, s = 1, . . . , S, and S is the total number of simulated samples. For j ∈ J β , the relative frequency of correct identification iŝ The prediction performance is measured by the (empirical) root mean squared prediction error, which is calculated as: where Y s i is the ith response of the sth simulated sample, andŶ s i is the predicted response for the ith observation in the sth simulated sample using the parameter estimates obtained on the sth independent simulated sample of size n from the regression model. The estimation efficiency is measured by the empirical root mean squared error (RMSE) of the parameter estimates, which is calculated as: and the root average mean squared error (RAMSE), which is calculated as: The samples are generated from the regression model (2.1), the design matrix X ∈ R n×p is generated once and then kept fixed across simulations, and the errors are generated from the normal distribution N (0, σ 2 ). We have compared the runtime of TLSE and LASSO on a data set generated from the model 6, with n = 20000, on a Intel i5 ultrabook, 3.6 GHz, 64 bit, 8GB RAM, running under Ubuntu 14.04. The runtime for the LASSO (we used the cv.glmnet function implemented in the glmnet package) was 12.7 sec, for the LSE (we used the lm function) was 6.2 sec, and for the TLSE was 6.6 sec. We anticipate a significant runtime improvement of the TLSE compared to the LASSO for ultra-high dimensional data sets. We generated the samples from the following six regression models.
Model 2. The same as Model 1 with r = 0.9.

Model 4. We set
Model 5. The same as Model 4 with r = 0.9.

Simulation results
In this section, we present the results of the simulation study. The number of simulations is S = 5000 and the sample sizes are n = 100, 200, 400, 800. Figures 1-2 show the empirical frequency of correct identification of the zero and non-zero regression parameters for the LASSO and the TLSE, respectively. Note that the empirical frequency of correct identification for models 3 and 6 is about 1 for all regression parameters and sample sizes for both the LASSO and the TLSE. The empirical frequency of correct identification for model 1 is about 1 for the TLSE for all parameters and sample sizes and smaller than 1 for the LASSO in the case of the zero regression parameters (with values of about 0.8 for n = 400). For model 5, the empirical frequency of correct identification is about 1 for the zero regression parameters and all sample sizes for both the LASSO and the TLSE; however, even though the empirical frequency is below 1 for smaller coefficients and smaller sample sizes, the TLSE outperforms the LASSO for all cases. For models 2 and 4, we have mixed results. Specifically, for model 2, while the LASSO outperforms the TLSE for smaller non-zero regression coefficients for n = 100, the TLSE outperforms the LASSO for all other sample sizes for both the zero and non-zero parameters. For model 4, while the LASSO outperforms the TLSE for small non-zero parameters for n = 100, the TLSE has higher frequency of correct identification for both the zero and non-zero regression parameters for all other sample sizes. Figures 3-5 show the root mean squared errors (RMSE) of the LSE, LASSO, and TLSE, respectively. For models 4-6, the LSE is calculated using the first m = n/5 variables in the model. Note that for models 1 and 2, the RMSE of LSE are larger for the zero regression parameters and smaller for the nonzero regression parameters than of the LASSO and the TLSE; the RMSE of the TLSE are generally smaller than of the LASSO for all parameters and sample sizes, with one exception, for n = 100, where the RMSE of TLSE are larger than of the LASSO for smaller regression parameters. For model 3, the RMSE of all estimators are similar. The RMSE of the LASSO and the TLSE are significantly smaller than of the LSE for all parameters and sample sizes for the models 4-6. In these cases, the RMSE of LSE for some regression parameters are as high as 12 for model 5 due to the high dimension of the model and high correlation among the variables. Note that, for these models the TLSE generally outperforms the LASSO for all samples and regression parameters. Tables 1-2 show the empirical root mean squared prediction errors (RMSPE) and the empirical root average mean squared errors (RAMSE) of the LSE, the LASSO, and the TLSE, respectively. Note that for models 1-2, the RMSPE of the LSE and the TLSE are similar and significantly smaller than of the LASSO for all sample sizes. For model 3, the RMSPE of the LASSO is smaller than of the LSE and the TLSE. Generally, the RAMSE of all estimators are similar, with smaller values for the LSE and the TLSE. The situation changes dramatically for the high-dimensional models 4-6. Specifically, the RMSPE and RAMSE of the LASSO and the TLSE are significantly smaller than of the LSE, and that generally, the TLSE outperforms the LASSO (with one exception, for model 5 with n = 200). The results of the simulation study shows a slightly better per- formance of the LSE and the TLSE over the LASSO in sparse low-dimensional regression models, and a significant better performance of the LASSO and the TLSE for high-dimensional regression models, with slightly better results for the TLSE on smaller and moderate samples.

Data analysis
In this section, we use ordinary least-squares and thresholding least-squares methods of inference to analyze a high-dimensional data set. The data set consists of the data collected on intervention chemicals (chemicals given by a ketogenic diet) and seizure load response (measured as the relative percent change in seizure day from the baseline seizure day over a two-week period) for a group of 55 children suffering from epilepsy. After removing incomplete cases as well as explanatory variables with no variation, the pre-processed data set has 6830 observations and 116 explanatory variables. Since the correlation matrix of explanatory variables is nearly singular, with more than 25 eigenvalues smaller than 0.001, we fist perform a hierarchical cluster algorithm to identify groups of highly correlated variables. The (distance) dissimilarity between variables is calculated as 1 minus the absolute value of the correlation coefficient of the variables and we use a group average agglomerative clustering algorithm (see, e.g., Hastie, Tibshirani and Friedman, 2008, Section 14.3.12). The dendrogram is cut at the height of h = 0.30; this choice implies that the average dissimi- Table 1 Empirical root mean squared prediction errors of the LSE, the LASSO, and the TLSE for regression models 1-6. The number of simulations is S = 5000 and the samples sizes are n = 100, 200, 400, 800. For models 4-6, the LSE is calculated using the first m = n/5 covariates in the models. larity (distance) between the clusters is larger than 0.30, and thus, the average absolute correlation between the clusters is smaller than 0.70. For each group of variables determined by the hierarchical clustering algorithm, we perform a principal component analysis and use the scores of the first principal components as covariates in the regression analysis. Our final design matrix has p = 74 columns and n = 6830 rows, and its minimum eigenvalue is ρ 1 = 0.004 (which is greater than log(n)/n = 0.001).  Figure 6 shows the LSE and TLSE of the regression parameters with the corresponding 95% individual normal confidence intervals. The cardinality of the non-zero regression parameter estimates of the TLSE is 14 (and thus, 74 − 14 = 60 components of the TLSE are set equal to zero). Note that, the zero components of the TLSE correspond to non-significant components of the LSE, and the widths of the corresponding confidence intervals are significantly smaller. This is in agreement with the theoretical and simulation results which shows that the TLSE is more efficient than the LSE for sparse regression models.
Proof of Lemma 2.2. Sinceσ 2 = σ 2 + o P (1), by Slutsky's theorem, it is enough To this end, note that Since Thus, it is enough to show that the Lindeberg condition holds (see, e.g., van der Vaart, 1998, p. 20): This concludes the proof of the lemma.
Proof of Lemma 2.3. Without loss of generality, we state (and prove) this result for the case when K = I. By rescaling, we further assume without loss of generality that σ 2 = 1. Since H 0 : K β = I holds, we have Note that {(Y j,n , F j,n ) : 1 ≤ j ≤ n} is a martingale array, where F j,n = σ a { i : 1 ≤ i ≤ j} is the natural filtration and σ a { i : 1 ≤ i ≤ j} denotes the σ-field generated by { 1 , . . . , j }. To this end, since Y j,n is F j,n -measurable by definition, we have to show that E(Y j,n |F j−1,n ) = Y j−1,n almost surely (a. s.), j = 1, . . . , n, where Y 0,n = 0 a. s., and F 0,n is the trivial σ-field. Let Since E( j |F j−1,n ) = 0 a. s. and E( 2 j |F j−1,n ) = 1 a. s., then E(Y j,n |F j−1,n ) = Y j−1,n a. s.. Hence {(Y j,n , F j,n ) : 1 ≤ j ≤ n} is a martingale array. Consider the martingale difference array {(Z j,n , F j,n ) : Let ν 2 j,n = E(Z 2 j,n ) and ν 2 n = n j=1 ν 2 j,n . Note that (A.10) By the central limit theorem for martingale difference arrays (see, e.g., Chow and Teicher, 1997) and (Athreya and Lahiri, 2006, p. 510), then (A.5) holds provided that the following conditions hold: Using Jensen's inequality |x + y| 3 ≤ 4(|x| 3 + |y| 3 ) for x, y ∈ R, we obtain We first prove that the first term on the right side of (A.13) tends to 0. To this end, using the identity n j1=1 h 2 jj1 = h jj (since H is an idempotent matrix) and Holder's inequality E(|Y | 3 ) ≤ E(|Y | 4 ) 3/4 , we have We now show that the second term on the right side of (A.13) tends to 0. To this end, note that Note that in the sum above, only the terms for which pairs of indices are equal are non-zero. Consider the terms for which j 1 = j 2 and k 1 = k 2 (all other terms are treated similarly). Hence, we have to show that This completes the proof.