Hypothesis testing sure independence screening for nonparametric regression

In this paper we develop a sure independence screening method based on hypothesis testing (HT-SIS) in a general nonparametric regression model. The ranking utility is based on a powerful test statistic for the hypothesis of predictive significance of each available covariate. The sure screening property of HT-SIS is established, demonstrating that all active predictors will be retained with high probability as the sample size increases. The threshold parameter is chosen in a theoretically justified manner based on the desired false positive selection rate. Simulation results suggest that the proposed method performs competitively against procedures found in the literature of screening for several models, and outperforms them in some scenarios. A real dataset of microarray gene expressions is analyzed.


Introduction
In recent years, fast advances in technology and data collection have facilitated the acquisition of high-dimensional data in several areas of research. The challenge arises when the number of predictors is larger than the sample size, which can be found for example in studies with genomic microarrays, high frequency functional MRI or imaging decoding. Several regularization methods can be used to perform variable selection in such situations, including the LASSO [17], the SCAD [7], the LARS [6], the elastic net [24] and the Dantzig selector [4]. Although these methods yield good results for high-dimensional data, when the number of predictors is ultra-high they may not perform well due to computational problems or statistical accuracy. In order to deal with these challenges, it becomes necessary to develop methods that reduce the dimensionality of the predictor space from an ultra-high scale to a relatively high scale.
Fan and Lv [8] were pioneers in studying theoretical aspects for the idea of screening out unimportant predictors in a regression model. They introduced the concept of sure independence screening (SIS), that is, with probability tending to 1, a well chosen subset of the predictors will contain the true set of predictors that contribute to the underlying model. The theoretical properties of this procedure were obtained under the strong assumption of a linear model. However, if this assumption is not accurate, predictors with high predictive significance whose effects are nonlinear might not be detected.
In order to identify nonlinear effects in a regression model, Fan, Feng and Song (2011) [11] considered nonparametric independence screening (NIS) with an additive model, ranking the utility of the covariates with Em 2 j (X j ), where m j = E(Y |X j ), the projection of Y onto X j . For multi-index models Zhu, Li, Li and Zhu (2011) [23] used E[xE{I(Y < y)|x}] as the population utility measure for a covariate, estimating it with the statistic (1/n) Several other authors have recently developed methods for variations of linear and nonlinear models, see for instance [13], [9], [10], [12], [18], [21], [14] and [22]. However, little is found in the literature regarding screening for fully nonparametric regression models. Li, Zhong and Zhu (2012) [15] innovatively considered a model-free sparse regression whose active predictors are those which F (Y |X) is functionally dependent on. In order to allow for arbitrary regression relationship, they used the distance correlation (DC-SIS) between each covariate and the response variable as the ranking for screening. In this paper we propose a novel screening method that, differently from the focus of the procedures in the literature, is based on a test statistic for the hypothesis that each available predictor has predictive significance. The signal strength of active predictors is based on the variance of the marginal nonparametric regression function. We use a powerful nonparametric test proposed by Zambom and Akritas (2014) [20] to compute the marginal utility of each predictor. New asymptotic theory is developed in order to establish the rates of convergence of the test statistic with a new Berry Essen type bound for its distribution and exponential convergence rates for the variance estimator. The proposed method is performed under a very general heteroscedastic nonparametric regression model, which does not require strong assumptions such as linearity or additivity of the mean regression function. Moreover, due to the fact that the predictors are ranked using a test statistic, a meaningful choice of the threshold parameter can be made, a fundamental advantage over the ad-hoc approaches in other procedures in the literature.
The remaining of the paper is as follows. In Section 2 we present the nonparametric regression model and preliminary asymptotic properties of the test statistic. The screening method HT-SIS and its sure independence properties are examined in Section 3. Section 4 describes a procedure to select the threshold parameter in order to maintain a desired false positive rate. Section 5 presents a comparison of the performance of HT-SIS, the parametric SIS and the model-free DC-SIS and finally a microarray dataset is analyzed in Section 6.

The model and preliminary results
Let Y denote the response variable, X = (X 1 , . . . , X d ) the vector of available predictors, and with some abuse of notation, let X ki be the i-th observation of the k-th covariate. Assume that the data come from the heteroscedastic nonparametric regression model where is the independent error with E( ) = 0 and constant variance (w.l.o.g. assume Var( ) = 1), uncorrelated with X. When the dimension d of the vector of covariates is high, it is often assumed that the regression model is sparse, in the sense that, there is a unique subset of indices I 0 such that the regression function m(·) is influenced only by those predictors whose indices are in I 0 .
Hence, we define I 0 ⊆ {1, . . . , d} such that the true underlying model is Note that the variance function σ(·) is not restricted to the set of predictors in I 0 , for we are only interested in selecting predictors that have predictive significance, that is, those that contribute to the underlying mean regression function.
There are several procedures in the literature for testing whether a covariate has no predictive value. The most common idea is to test for a constant conditional expectation of the response given the covariate. The majority of the literature proposes tests which assume homoscedasticity and hence become liberal under heteroscedasticity. Thus, a covariate with no predictive value stands a good chance of being selected as a predictor if the variance function, or even other aspects of the conditional distribution of the response, are not constant with respect to the covariate. Based on a sample of n iid observations from model (1), we propose ranking the utility of the covariates using, marginally, the test statistic introduced by Zambom and Akritas (2014) [20]. We now briefly recall the test statistic and its main properties. For the marginal regression model Y = m k (X k ) + σ k (X k ) k , consider the null hypothesis for a constant C k . Let (Y i , X ki ), i = 1, . . . , n represent data from a high-dimensional one-way ANOVA design with Y i being the observation at "level" X ki . Because of the ANOVA requirement of more than one observation per cell, each cell is augmented with neighboring observations in the following way. Consider that X ki is arranged in order of magnitude. Define the augmented cell X ki to consist of Y i and the Y j 's corresponding to the (p − 1)/2 X kj 's on either side of X ki , for a fixed odd constant p. Then, the set of indices j composing the augmented cell X ki can be written as whereF X k is the empirical distribution function of X k , so that W k i defines the augmented cell corresponding to X ki . The test statistic for the hypothesis in (2) is based on the high-dimensional one-way ANOVA type test statistic in the high-dimensional one-way ANOVA and the matrix A is where Ī r is an identity matrix of dimension r, J r is a rxr matrix of 1's and ⊕ is the Kronecker sum or direct sum.

Remark 1.
Simulations suggest that the choice of the window size p has little influence on the performance of the test, as long as it is not too small or too large. Choosing p < 5 tends to make the test procedure liberal, while a large value of p has the opposite effect. In simulations we used p = 11. A way to gain confidence in the choice of p in any practical situation is to run the test after randomly permuting the observed response variables among the covariate values, in order to induce the validity of the null hypothesis.
To obtain insight on the properties of T k for ranking the utility of X k in the nonparametric regression, we recall the following theorem Theorem 1. (Zambom and Akritas, 2014 [20]) Assume that σ 2 k (x k ) is Lipschitz continuous, sup x σ 2 k (x k ) < ∞, the marginal density f X k of X k is uniformly continuous and bounded away from 0 and E( 4 k ) < ∞. Then under H 0 in (2), the asymptotic distribution of the test statistic in (4) is given by In order to estimate v k , assume that the response values Y i , i = 1, . . . , n are sorted according to X k , in other words, assume that Y i is the observation corresponding to X k(i) , where X k(i) is the i-th largest observation of the sample X k1 , . . . , X kn . Then, a consistent estimator of v k (see Lemmas 2 and 3) iŝ Note that both MST k and MSE k are averages and converge to constants. Under the null hypothesis (2), both converge to the same constant. Under local alternatives, Zambom and Akritas [20] showed that the asymptotic distribution of the test statistic is Normal with mean given by pVar(m k (X k )). The hypothesis is hence rejected for large values of the test statistic, so that it is expected that T k is a useful statistic to rank the utility of each predictor.

The screening procedure and main results
The Hypothesis Testing Nonparametric Independence Screening (HT-SIS) procedure consists of selecting a superset of indicesÎ that contains the index set I 0 with probability increasing to one as the sample size increases. The challenge addressed in screening is to deal with the situation where the number of predictors d greatly exceeds the sample size n. Define the supersetÎ aŝ where c and α are threshold parameters defined in condition C8 below and p is the window size defined in (3). Note that in Section 4 we set cpn −α = λ n and provide a method for choosing λ n . In order to establish the sure screening properties of HT-SIS, consider the following conditions. For any 1 ≤ i, j ≤ n and some s > 0 : m k (·) and σ k (·) are Lipschitz continuous for k = 1, . . . , d C7 : f X k (·), k = 1, . . . , d, are bounded away from 0.
where f X k is the density of X k , with support in X k . Conditions C1-C7 are necessary for the derivation of Theorem 2 and supporting Lemmas 1 -3. Conditions C1-C5 are similar to condition C1 in Li, Zhong and Zhu (2012) [15], which require finite expected values of exponential functions of σ k (X k ) k and m k (X k ). These conditions follow if σ 2 k (·) and m k (·) are bounded uniformly. Conditions C6 and C7 are usual conditions in nonparametric regression (see for example Fan, Feng and Song, 2011 [11]), where C7 for example follows for distributions with compact support.
In all theoretical results that follow, the constants in the O(·) notation may depend, as indicated, on the expected value of functions of σ k (X k ) and m k (X k ), and hence also on f X k . We denote these constants by C σ , C mσ . Their exacts expressions are suppressed for ease of notation. Note that these constants, although sometimes with the same subscript, may take different values at each appearance. In the following lemma, we establish the rate at which the test statistic T k converges in probability to its expected value.
Note that for an active predictor X k , k ∈ I 0 , we expect the value of T k not to be too small, or at least larger than most of those of inactive predictors. For the sure independence screening property of HT-SIS, we require the following condition for some constant c and 0 ≤ α < 1/2. Condition C8 is similar to condition 3 of Fan and Lv (2008) [8] where it is assumed that the true correlation between the predictor and the response is above a certain threshold. In the present case, we assume that the signal strength, measured by the variance of m k (·), is not too small, however, intuitively, it is 0 if the relationship of X k and Y is constant.
In Lemmas 2 and 3 we explore the rate of convergence ofv k , used to standardize the proposed raking utility T k (see Theorem 1). Note that Lemma 3 establishes the consistency ofv k as n goes to infinity. Using these lemmas and in connection with Lemma 1, we can show the sure screening property of HT-SIS, which is stated in Theorem 2. (7) be the estimator of v k . Under conditions C6 and C7 we have that (7) be the estimator of v k . Under conditions C1 and C7, there exists constants c 1 > 0 and c 2 > 0 such that

Theorem 2.
Under conditions C1-C8, for 0 < γ + α < 1/2, there exists constants c 1 > 0 and c 2 > 0 such that for any ε > 0 Because the true model is assumed to be sparse, where only a small number d 0 of the predictors have predictive significance, Theorem 2 demonstrates that all significant predictors will be retained with high probability. Note that the theorem holds even when the number of covariates in the model is allowed to increase with the sample size n at an exponential rate.

Remark 2.
All screening methods face challenges such as failing to identify important predictors that are marginally independent but maybe jointly correlated with the response or selecting spurious variables, that is, selecting unimportant predictors that are correlated with important predictors. An iterative version of HT-SIS, similar to the iterative versions of SIS, DC-SIS or NIS can easily be implemented in order to alleviate such issues. The asymptotic properties of the iterative versions of these methods is an interesting topic for further analysis.

The choice of the threshold parameter
Since Fan and Lv [8] introduced the notion of sure screening, several research studies have explored the theoretical and asymptotic properties of screening methods. However, in the majority of papers, the choice of the threshold parameter is not carefully addressed. Instead of setting a threshold for the ranking utility, most methods fix the maximum number of predictors to be kept after the screening procedure, for instance n/log(n) or even n − 1. These choices are ad-hoc and provide no meaningful interpretation, but do address the practical objective of ending up with fewer predictor than the sample size.
A characteristic only found in variable selection procedures based on test statistics is the possibility to control the False Positive Rate or the False Discovery Rate [1]. This idea was used by Zhao and Li (2012) [21] for the case of screening in linear Cox models based on a test statistic for the coefficients β j 's. For the linear regression model where the number of covariates is allowed to grow with n, Bunea, Wegkamp and Auguste (2006) [3] proposed a variable selection method based on FDR and showed that it is consistent in selecting the set of significant predictors. The p-values of these test statistics can be used to guarantee that the expected false positive rate will be below a chosen level. In this section we establish theoretical support for the choice of the threshold parameter when applying HT-SIS based on FDR. The asymptotic normality of the test statistic T k provides a direct choice of the threshold parameter in connection with the cumulative distribution function.
Recall that I 0 is estimated byÎ in (8). WriteÎ aŝ so that λ n is the threshold parameter to be chosen. If the true model I 0 has size |I 0 | = d 0 , the expected false positive rate is 774

A. Z. Zambom and M. G. Akritas
By Theorem 1 and the consistency of the estimatorv k , n 1/2 T k / √v k has an asymptotic standard Normal distribution, and the expected false positive rate is controlled at (1−Φ(n 1/2 λ n )), where Φ is the cumulative function of a standard Normal.
In order to have the false positive rate (# false positives)/(#negatives) decrease when the sample size increases, fix the number of false positives r we are willing to tolerate in the screening procedure. Then the false positive rate r/(d−d 0 ) decreases with the sample size since d is allowed to increase with n (see the rate in Theorem 3). Now by conservatively setting the expected false positive rate as ( A similar idea was also used in Zhao and Li (2012). Theorem 3 establishes the bounds for the expected false positive rate of the proposed screening method using the Berry-Essen-type bound for T k derived in Lemma 5.

Lemma 4.
Under assumption C1, for k ∈ I c 0 we have

Theorem 3. Under conditions C1-C8, for the choice of threshold parameter
while the sure independence property (Theorem 2) holds.
Theorem 3 establishes that the false positive rate is maintained close to the nominal level chosen r/d, while retaining all active predictors with high probability. The rate at which the number of predictors d is allowed to increase with the sample size is comparable to those of Fan and Lv (2008) [8], where log(d) = O(n ξ ), for some ξ > 0 (Condition 1).
Note that the False Discovery Rate is defined as the expected value of |Î ∩ I c 0 |/|Î|. Moreover, |Î ∩ I c 0 |/|Î| can be written as the product of the false positive rate , the False Discovery Rate can be controlled at r/|Î| conditionally on |Î|, as long as the false positive rate is controlled at r/d.

Simulation study
In this section we analyze the performance of HT-SIS with simulation studies for 7 different models. For comparison purposes, the well known Sure In-dependence Screening (SIS) [8], the Distance Correlation Sure Independence Screening (DC-SIS) [15] and the Nonparametric Independence Screening (NIS) [11] are also evaluated. All results were obtained in R (www.r-project.org), using packages NonpModelCheck, SIS and energy for HT-SIS, SIS and DC-SIS respectively.
We follow the simulation scenarios of Li, Zhong and Zhu (2012) [15], where we generate X = (X 1 , . . . , X d ) from a Normal distribution with zero mean and covariance matrix Σ = (σ ij ) d×d with σ ij = 0.8 |i−j| , and error term ∼ N (0, 1). Because Normally distributed covariates are used in most variable selection literature, they are used in this simulation section despite the fact that they do not meet condition C7. In consequence, m 1 (X 1 ) = E(Y |X 1 ) does not meet conditions C1-C5 for all models considered except Model 5. This is because the expected value of exponential functions of terms with order higher than 2 do not exist for Normal random variables (E(e sX 3 ) diverges for X Normally distributed). Hence, the results of this simulation section demonstrate the robustness of the proposed method against departures from conditions C1-C5 and C7. We consider n = 200 and d = 1000 or 3000 and repeat the experiment 1000 times. The following criteria is used to evaluate the performance of the screening methods: 1. S: the minimum model size to include all active predictors. We report the 5%, 25%, 50%, 75% and 95% quantiles of S out of 1000 replications. 2. P s : the proportion that an individual active predictor is selected for a given model size |Î| in the 1000 replications. 3. P a : the proportion that all active predictors are selected for a given model size |Î| in the 1000 replications.
As expected, SIS has low performance in capturing the significance of predictors with nonlinear effects and hence its minimum model size S is in general lower in average than those of DC-SIS and NSIS. This suggests that the high 95-th percentile of S using HT-SIS for the first three models is due to the fact that one of the important predictors may have been assigned a very low rank 5% of the generated datasets. It is important to notice that for models 5 and 6 SIS, NSIS and DC-SIS fail to identify any of the important predictors in their top ranked probably due to the high frequency of the sine and cosine functions. On the other hand, it can be seen from Tables 2 and 3 that HT-SIS captures their significance at least 99% of the time, keeping an extremely low model size S at all percentiles. Finally, the proportion of time that the two-peak effect of X 12 in model 7 is selected by HT-SIS is on average 83.5%, considerably higher than the 31.7% on average achieved by DC-SIS and 24% buy NSIS.

Real data application
In this section we apply the proposed screening method to the cardiomyopathy dataset. This dataset has been studied in Segal, Dahlquist, and Conklin (2003) [16], Hall and Miller (2009) [13] and Li, Zhong and Zhu (2012) [15] and is composed of n = 30 observations of d = 6319 gene expressions in mice. The objective is to identify which genes contribute the most for the overexpression of Ro1, a G protein-coupled receptor. For comparison and visualization purposes, we only display the top 8 ranked predictors using HT-SIS and DC-SIS. Note that if one wishes to keep the size of the superset |Î| smaller than n = 30, any choice of the number of false positives (less than 30) would correspond to keeping the false positive rate less than 0.5%. Figures 1 and 2 show the scatterplots of Ro1 and expression levels of the 8 most influential genes (left to right and top to bottom) ranked according to HT-SIS and DC-SIS respectively. In order to help visualize the relationships between Ro1 and the predictors, we added to each graph a cubic spline fit curve and the lowess (locally weighted polynomial regression) fit curve. Note that, according to HT-SIS, the most influential gene is Msa.2400.0, which is ranked seventh with DC-SIS. On the other hand, DC-SIS ranks first gene Msa.2134.0, which is ranked second according to HT-SIS. To compare the  This criterium suggests that the most influential gene is in fact Msa.2400.0. Since a fully nonparametric model suffers from the curse of dimensionality, it is unfeasible to fit a nonparametric (or even an additive) model using all the top eight ranked genes with only n = 30 observations in this dataset. In that case, for an elementary insight into the results of the screening methods, we look at the fits resulting from a nonparametric additive model with the top 3 ranked genes for each method using package mgcv from the R software (the addition of a fourth predictor is unfeasible due to the lack of degrees of freedom). HT-SIS obtained an adjusted R 2 = 0.944 and deviance explained 0.975 while DC-SIS achieves 0.98 and 0.992 for the same measures respectively. Although DC-SIS achieves somewhat better results, it is clear that both methods perform comparably in ranking the most influential genes, with very high deviance explained. Note that the addition of more genes to the additive model would surely increase the R 2 and the explained deviance. Hence, the supersets obtained by HT-SIS and DC-SIS, although slightly different, consist of genes with high predictive significance with respect to Ro1.

Discussion
In this paper we propose a screening method based on a test statistic for the hypothesis that a covariate is influential in the prediction of the response variable. The sure independence screening property is demonstrated using a nonparametric heteroscedastic regression model. Simulations suggest that the proposed method performs well even with highly correlated predictors. However, improved versions of screening methods have been widely studied in the literature. The original idea proposed by Fan and Lv (2008) [8] is to first choose a smaller set of predictors with high predictive significance, and then iteratively, choose a subsequent small set of predictors that is significantly related to the residuals obtained from the modeling of the previous set with the response. Following such idea, an iterative HT-SIS can be easily adapted to screening nonparametric models, improving the inclusion of predictors that have little or no marginal predictive significance, but jointly with other predictors yield a significant model. Theoretical aspects of such iterative method need a more detailed appraisal.
A meaningful choice of the threshold parameter is derived and theoretically justified through the control of the false positive rate of the selection. It is interesting to note that the proposed procedure for choosing the threshold parameter is based only on the number of predictors d and the allowed false discovery rate. This fundamentally differs from the ad-hoc choices used in the literature, which are based solely on the sample size n. As observed in the microarray analysis in Section 6, for real situations with ultra-high predictor space and very small sample size, the proposed method for choosing the threshold parameter may suggest a screened superset with size larger than n. Depending on the objective of the screening, a lower false positive rate might be selected in order to keep the size of the screened superset below n. Overall, choosing the number of predictors to retain when performing variable screening is a difficult challenge that still needs further investigation.

Appendix
Throughout the appendix and the proofs herein, the notations C, c, c 1 and c 2 are generic constants, which may take different values at each appearance. Moreover, we use C σ , C m and C mσ , which may take different values at each appearance, to denote a constant that depends on the functions σ k (·), k = 1, . . . , d. C k may be different for instance when depending on different moments of σ k (·).

A.1. Auxiliary lemmas
Lemma 6. Let X 1 , . . . , X n be i.i.d. random variables with distribution F X satisfying condition C7 and let X (1) , . . . , X (n) be the corresponding order statistics. Then Proof. Note that F X (X (1) ), . . . , F X (X (n) ) are order statistics of a Uniform distribution on (0,1), and hence F X (X (k) ) ∼ Beta(k, n + 1 − k). Thus, for any > 0 where M will be specified later. By the Markov Inequality and the fact that ξ i are i.i.d., for any ε > 0 and t > 0 where the last inequality follows from Lemma 5.6.1A in Serfling (1980). Choosing t = 4εn/M 2 we have and by the symmetry of Note that for any c 2 > 0, In view of assumptions C1-C5 and C8, if we choose M = cn γ for 0 < γ < 1/2−k, then E(T * 3b ) ≤ ε/2 when n is sufficiently large. Consequently where C σ is a constant that depends on the moments of σ k (·), and hence For T * 2 we write Note that E(T * 2 ) = 0. Since T * 2 is a (symmetric) U-statistic of second order, using the fact that P (T * 2a ≥ ) ≤ e −t E(e i=1 j =i ξiξj I(0≤ξiξj ≤M )/(n(n−1)) ), with steps similar to those for T * 3 , for constants c 1 > 0 and c 2 > 0 and because all windows W i are of finite size (p), we have Consider now T * 12 . Write where U ni , i = 1, . . . , n/(6p), are independent and also V ni , i = 1, . . . , n/(6p), are independent. Thus, by the Markov and Cauchy Schwarz inequalities and the choice of t = 4εn/M 2 and a constant c 3 = 1/(12p 3 (p − 1) 2 ), = e − tε c 3 exp n 6p Using steps similar to those for T * 3b , under assumptions C1-C5 and C8, with the choice of M = cn γ , for a constant c 2 > 0 and hence Using similar steps, it is easy to show that the second and last terms on the right hand side of (10) have the same convergence rates, that is

Proof of Lemma 2
Using Lemma 7, we have where C σ k is the Lipschitz constant for σ k (·), and c f k = inf x∈X k f k (x). Taking the expected value with respect to X k completes the proof of Lemma 2, since the expected value of the O p (·) term is O Cσ k c f k n by steps similar to those in Lemma 6.

Proof of Lemma 3
First note that for any > 0 where L is the lower bound for v k , that is v k ≥ L.
Let v k = v k1 +v k2 , where v k1 and v k2 are the decomposition of v k corresponding to the decompositionv k =v k1 +v k2 . Using steps similar to those for term T 12 in the proof of Lemma 1, one can show that, for constants c 1 > 0 and c 2 > 0, there exists a constant C σ such that This completes the proof of Lemma 3.

Proof of Theorem 2
By Lemma 1 we have P (|T k − E(T k )| ≥ cn −α ) ≤ O(exp(−c 1 n 1−2(γ+α) ) + nC σ exp(−c 2 n γ )), and by Lemma 3, we have √ v k has the same form. Using condition C8 and Lemma 3.0.9 in Zambom and Akritas (2014) we have, for any ε > 0, taking ε = ε /d 0 , where d 0 is the cardinality of I 0 , and the last inequality follows from Lemma 1 and the definition of O p (n −1/2 ) for a constant c.

Proof of Lemma 4
We omit the proof of this Lemma, as it follows using arguments similar to those in Wang, Akritas and Van Keilegom (2008).

Proof of Lemma 5
Assume without loss of generality that the constant C k in (2) is equal to 0. Note that . Since D i are dependent (on only a few other D i ), we will make use of the block Markov techinique to show normality of the test statistic. Write E ni = D (i−1)(n β +3p)+1 + . . . + D (i−1)(n β +3p)+n β F ni = D (i−1)(n β +3p)+n β +1 + . . . + D i(n β +3p) , where 0 < β < 1 is a constant. The choice of beta determines the rate of convergence of the test statistic to the normal distribution and the rate at which the small blocks composed by F ni go to 0. Now we have where r n ∼ n/(n β + 3p). Note that where the last inequality follows from the Markov's inequality and assumptions C1 -C5. Hence It is easy to establish the Lyapunov condition for rn i=1 E i / √ n (see Zambom and Akritas 2014). Note that E(E i ) = 0. Write Using the Berry Essen theorem (Berry, 1941), the first term in (13) is bounded by a term of order O(r which is only different from 0 if Y k1 Y k2 Y 1 Y 2 consists of two pairs of equal observations. Hence, the order of V ar(E ni ) is O(n β C σ ). Using similar steps, and the fact that (Cauchy Schwarz) it can be shown that E(|E ni | 3 ) is of order O(n 3β/2 C σ ). Hence, the Berry Essen bound for the first term in (13) is O(C σ n −(1/2)(1−β)+3β/2−(3/2)β ) = O(C σ n β/2−1/2 ). For any > 0, the second term in (13) is equal to For a choice of 0 < β large enough say β = 9/10 and ε = n −(β−3/5) = n −3/10 , we have convergence of P (|(1/n) rn i=1 F ni | ≥ ) of order O(C σ n 4(3/10)+5(1−β)−2 = O(C σ n −3/10 ).

Proof of Theorem 3
For the proof of this Theorem, we follow Zhao and Li (2012). We have where the last inequality follows from the fact that 2cn −α − |T k / √v k | ≤ |T k / √v k − Var(m k (x))/ √ v k |, which follows from assumption C8. For any λ n ≤ cn −α , Theorem 2 holds. For the choice of λ n = n −1/2 Φ −1 (1 − r/d), this entails Using the fact that 1 − Φ(x) ≤ x −1 exp(−x 2 /2), this inequality is satisfied if d ≤ r exp{c 2 n 1−2α /2}. Without loss of generality, consider the constant C k in (2) to be equal to 0. Note that for k ∈ I c 0 , n 1/2 T k /v k = n 1/2 Y T W k AY W k /v k we can use Lemma 4 and Lemma 5, to find sup for a constant C σ . Then (9) implies that The theorem follows if we choose γ n = n −1/2 Φ −1 (1 − r/d).