Dimension reduction-based significance testing in nonparametric regression

A dimension reduction-based adaptive-to-model test is proposed for significance of a subset of covariates in the context of a nonparametric regression model. Unlike existing local smoothing significance tests, the new test behaves like a local smoothing test as if the number of covariates were just that under the null hypothesis and it can detect local alternatives distinct from the null at the rate that is only related to the number of covariates under the null hypothesis. Thus, the curse of dimensionality is largely alleviated when nonparametric estimation is inevitably required. In the cases where there are many insignificant covariates, the improvement of the new test is very significant over existing local smoothing tests on the significance level maintenance and power enhancement. Simulation studies and a real data analysis are conducted to examine the finite sample performance of the proposed test.


Introduction
Consider the nonparametric regression model: where Y is a scale dependent variable with the covariates Z = (X , W ) , X = (X 1 , · · · , X p1 ) ∈ R p1 , W = (W 1 , · · · , W p2 ) ∈ R p2 and p 1 + p 2 = d, the regression function m(·) : R d → R is unknown in its form and is the error term with zero conditional expectation when Z is given: E( |Z) = 0. As well known, the success of any further statistical analysis hinges on the correction of the working model. Note that part of the covariates are often redundant in regression models. The subset of the covariates W is said to be insignificant for the response variable Y given X if The equality (1.2) means that W does not provide more information to predict Y . W should be removed from the regression model (1.1), otherwise, such redundant variables cause statistical analysis more complicated and less accurate and efficient. This step, particularly in the first stage of regression analysis, is necessary. In the literature, relevant testing problem called significance testing has attracted much attention. There exist several proposals that are based on prevalent locally and globally smoothing methodologies in the literature. For the former, Lavergne and Vuong (2000)[10] extended the idea introduced by Fan and Li (1996) [6], proposed a test based on a second conditional moment to check the significance of a subset of covariates. Further, Li (1999) [15] developed a nonparametric significance test that is based on the idea of Fan and Li (1996) [6] for nonparametric and semiparametric time-series models. Racine et al. (2006) [18] suggested a test for the significance of categorical variables in fully nonparametric regression models. It is noted that these locally smoothingbased test statistics converge to the respective limiting null distributions at the typical rate O p (n −1/2 h −d/4 ), where d is the number of all the covariates and h is the bandwidth in kernel estimation. Further, these tests can also only detect the local alternative hypotheses distinct from the null hypothesis at the rate O p (n −1/2 h −d/4 ). When d = p 1 + p 2 is large, the convergence rate is very slow because h converges to zero at a certain rate. This implies that these locally smoothing methodologies severely suffer from the curse of dimensionality. This problem is caused by using nonparametric estimation for the models under both the null and alternative hypothesis that assume the significance of all covariates. However, this clearly has not yet fully used the model structure under the null hypothesis. For globally smoothing tests, Racine (1997) [17] advised a test based on nonparametric estimation of partial derivative. Delgado and González Manteiga (2001) [4] introduced a consistent test based on a stochastic process. Both the tests have a fast rate of order √ n. Racine's (1997) [17] test involves the nonparametric estimation of partial derivative ∂E(Y |X, W )/∂W and thus, estimation inefficiency can seriously deteriorate the performance of the test. Delgado and González-Manteiga's (2001) [4] test involves the multivariate nonparametric estimation of E(Y |X) and all the covariates in the process. Thus, the data sparseness in high-dimensional space still causes negative impact for the power performance of the resulting test. We can see from the following example in Section 4 that the power drops down very quickly as d increases. Finally, when the dimension d of the covariates is large, the computational burden is also an issue because the Monte Carlo approximations to their sampling null distributions are computationally intensive.
We now present some simulation results of Example 1 in Section 4 to illustrate those disadvantageous phenomena. Here the sample size is n = 200, the dimension of X is set to be p 1 = 2, 4 and the dimension p 2 of W is chosen to be 2 to 10 in this numerical simulation. The empirical powers are simulated across through 2000 replications for each experiment at the significance level α = 0.05. For the presentation purpose in the figure, the power is set to be 1 when p 2 = 0. We use this example to examine how the powers of existing tests are negatively affected by increasing the dimension p 2 . Figure 1 depicts the empirical power curves of Fan and Li's (1996) [6] test and Delgado and González Manteiga's test (2001) [4] that are regarded as the representatives of local smoothing tests and global smoothing tests. More details about the bandwidth selection and other details for this simulation can be found in Section 4. Obviously, the empirical powers of these two tests rapidly decrease as p 1 and p 2 increase. This indicates that both the tests severely suffer from the curse of dimensionality. Relevant discussions on the significance level maintainance will be discussed below. To this end, how to overcome the aforementioned problems caused by dimensionality is of great importance.
Recently, Lavergne et al. (2015) [11] devised a new kernel-based test that is based on a suitable equivalent Fourier transformation. Their test is based on an idea that combines locally smoothing and globally smoothing test in construction. Namely, they locally smooth the regression function with covariates under  Fan and Li's (1996) test and Delgado and González Manteiga's (2001) test against the dimensions of X and W , where X ∼ N (0, Σ 1 ) and W ∼ N (0, Σ 1 ) with sample size n = 200 and a = 2 at the significance level α = 0.05 for Example 1. the null hypothesis, and globally smooth by using the average over the regression function with the other covariates. The nice feature is that it inherits the merit of locally smoothing tests that can be sensitive to high-frequency alternative models and is still omnibus. Theoretically, the test statistic has a faster rate of order O(n −1/2 h −p1/4 ) to its weak limit rather than the rate O(n −1/2 h −d/4 ) that classical locally smoothing tests achieve. But even under the null hypothesis, it still uses all covariates. Thus, to maintain the significance level, the test still encounters the data sparseness issue even under the null hypothesis.
How to construct a test that involves all covariates only under the alternative hypothesis is of interest. Guo et al. (2016) [8] advocated an adaptive-to-model strategy to construct test statistics. They first proposed a dimension-reduction locally smoothing test which used to test generalized linear regression models. Under the null hypothesis, the test solely uses the information provided by the hypothetical model such that it can well maintain the significance level, and under the alternative hypothesis it can automatically adapt to the alternative model such that the test is omnibus. Zhu et al. (2017a) [30] followed the similar idea to develop a dimension reduction globally smoothing test for more general regression models. It is of interest to apply this strategy in test statistic construction in significance testing. However, such an application is far from trivial. As the key, identifying the projected covariates in the hypothetical and alternative models plays a crucial role in this method. In Guo et al. (2016) [8], the projected covariates in the hypothetical models are contained in the alternative models. In the present setup, we cannot impose this assumption and thus, the identification procedure will be different from the model adaptation step of Guo et al. (2016) [8]. Therefore, we need a new method to tickle this issue. The details will be seen from the following model and the test statistic construction in the next section.
As is known, the objective of significance testing focuses on choosing the significant covariates X in the nonparametric regression setting. Let g(x) = E(Y |X = x). Then the significance testing becomes the following hypothesis: To facilitate a more general model structure we want to test, consider the following reformulation of the above model. Note that g(·) is an unknown function, the nonparametric regression model Y = g(X) + U can be reformulated as: where B 1 is an orthonormal p 1 ×p 1 matrix. This means that the above regression model can be viewed as a special multi-index regression model with p 1 indices corresponding to the covariates X. Thus, in the present paper, we consider a general null hypothesis as: where B 1 is a p 1 × q 1 orthonormal matrix for a given q 1 with 1 ≤ q 1 ≤ p 1 . The hypothetical regression model covers many popularly used models in the literature, including the single-index model, the multi-index model and the partially linear single-index model. When the above regression model is a single-index model or partially linear single index model, the corresponding number q 1 of the indices becomes one or two, respectively. Particularly, when q 1 = p 1 , the hypothetical model becomes the classical model (1.2). Consider a general alternative model. Let B be a d × q 2 orthonormal matrix with B B = I q2×q2 and q 2 being an unknown number with d ≥ q 2 > q 1 . The alternative hypothesis is Note that model (1.2) can also be written in this format. That is, under H 0 , we have a matrix (B 1 , 0 p2×q1 ) and We develop a test that solely depends on the linear combinations of the significant covariates X under the null hypothesis H 0 , and automatically uses the linear combinations of all the covariates under the alternative hypothesis H 1 such that the test is omnibus. We will call it a Dimension Reduction Adaptive-to-Model test (DREAM). Compared with existing locally smoothing tests, we will show that, rather than at the rate of order O(n −1/2 h −d/4 ), DREAM converges to their weak limit at the rate of order O(n −1/2 h −q1/4 ) and may detect the local alternative models distinct from the null model also at this rate. Further, it is worthwhile to mention that almost all existing locally smoothing tests require the assistance of Monte Carlo approximation to determine critical values, although the limiting null distribution is tractable, otherwise, the significance level is difficult to maintain. In contrast, the dimension reduction structure of our test alleviates this difficulty, the critical values of the new test can be determined by its limiting null distribution even when the sample size is moderate. Similarly as the conventionally local smoothing tests, the proposed test has the subjective constraint choice of bandwidth, which is done by the rule of thumb in this paper. It naturally hopes that an optimal bandwidth be selected. However, as we know that choosing optimal bandwidths for local smoothing tests is still an open problem in theory (Stute and Zhu, 2005) [20]. To see the impact from the bandwidth selection, we implement several simulation studies and find that this issue is not critical for the proposed test when the rule of thumb is applied to select the bandwidth. More details will be discussed in the next sections.
This paper is organized as follows. In Section 2, the test statistic construction is described. Because dimension reduction technique plays a very important role, we first briefly review a promising method: the discretization-expectation estimation. To make the test have the model-adaptation property, a ridge-type eigenvalue ratio estimate (RERE) for the dimension q of B is recommended, which can consistently estimate q 1 and q 2 under the null and alternative hypotheses accordingly. The asymptotic properties of the test statistic are presented in Section 3. Further, the test statistic tends to infinity at the certain rate under the global alternative hypothesis in Section 3. In Section 4, we examine the finite sample performance of our test and also apply it to a real data analysis for illustration. All the technical conditions and the proofs of the theoretical results are postponed to the Appendix.

Basic test statistic construction
It is worth noticing that in the above formulations, B and B 1 are usually not identifiable. Actually, under H 1 , for any Hence, what we can identify is BC for a matrix C and B 1 C 1 for a matrix C 1 . Note that it is enough to have such a weaker identification because under the null hypothesis, In the following subsection, we will briefly introduce a method to identify BC for a matrix C and B 1 C 1 for a matrix C 1 . Without notational confusion, we use B and B 1 to write BC and B 1 C 1 , respectively, throughout the rest of the present paper.
Recall U = Y − g(X). Under the null hypothesis H 0 , we have where W (B Z) is some positive weight function that will be discussed below. Under the alternative hypothesis H 1 , since The above argument implies that the empirical version of the left hand side in (2.1) can be used as a base to construct a test statistic. Further, the null hypothesis H 0 is rejected for large values of the test statistic. This motivates a naive construction as any one in the literature. However, we also note that under the null hypothesis, for a selected weight functionW , This means that under the null hypothesis, the matrix B of dimension q 2 is reduced to the matrix of dimension q 1 . To have a uniform version, we definẽ The key is how to construct a test statistic that fully uses this piece of information and can automatically adapt the model structure under the alternative hypothesis such that the test is still omnibus. We will present our idea in the following construction. When a sample {(z 1 , y 1 ), · · · , (z n , y n )} is available, the residual term u i is estimated asû i = y i −ĝ(B 1 x i ), whereĝ(B 1 x i ) is a kernel estimate of g(B 1 x i ) as follows: being a q 1 -dimensional product kernel function from the univariate kernelQ(·), h 1 being a bandwidth andB 1 being an estimate of B 1 . Then we obtain the following kernel estimate of E(U |B Z) (or E(U |B Z)) as:Ê .

1475
In this formula,B is an estimate of B (orB) with an estimated structural dimensionq in a certain sense that will be specified in the next subsection, where Kq h = K(·/h)/hq with K(·) being aq-dimensional kernel function and h being a bandwidth. If we choose the weight W (·) (orW (·)) to be the density function p B (·) (or pB(·)) of B Z (orB Z), a non-standardized test statistic can be constructed as V n by: (2.4) Remark 2.1. Note that the test statistic developed by Fan and Li (1996) is:  Fan and Li's (1996) test statistic. The results are reported in Section 3.

A brief review on discretization-expectation estimation
As we commented above, we need to identify BC and B 1 C 1 . To this end, a method is discussed in this subsection. The method is to identify the spaces spanned by B and B 1 automatically under the null and alternative hypothesis. In other words, the method is to identify the basis vectors in the respective subspaces. This is an estimation issue for the central mean subspaces in sufficient dimension reduction (e.g. Cook, 1998) [1]. The respective central mean subspaces are respectively denoted as S E(Y |X) and S E(Y |Z) that is related to the conditional independence between Y and X given B 1 X and between Y and Z given B Z. Also, q 1 = dim(S E(Y |X) ) and q 2 = dim(S E(Y |Z) ) are respectively called the structural dimensions of S E(Y |X) and S E(Y |Z) . Here we assume that q 1 is given, but q 2 is unknown.
From Zhu et al. (2010a) [24], to identify and estimate B (andB), the DEE estimation procedure can be summarized as the following steps. Without notational confusion, we consider identify B first and later we clarify that such an identification can automatically be applied to that forB.
where M n (y i ) is the estimating matrix of M (y i ) by some certain sufficient dimension reduction method such that SIR. Then the estimateB consists of the eigenvectors associated with the largest q 2 eigenvalues of M n .B is root-n consistent to the matrix B when q 2 is given.
Note that under the null hypothesis, S E(Y |Z) is reduced to S E(Y |X) because the conditional independence between Y and Z is equivalent to this independence between Y and X. In Step 2 above, sufficient dimension reduction theory tells us that the central subspace is reduced to S Υ(t)|X and the matrix in Step 3 above. The matrix M = E{M (Ỹ )} could have q 1 nonzero eigenvalues, and the estimatê q could automatically be consistent to q 1 . The above estimation procedure is called discretization-expectation estimation (DEE) in Zhu et al. (2010a) [24], more details are referred to this paper. The following proposition states the consistency of the estimated matrixB under H 0 .

Proposition 2.1. Under H 0 and Conditions A1 and A2 in the Appendix, the DEE-based estimateB is consistent to
Proposition 2.1 indicates that under H 0 , in probability the test statistic only uses those variables that are significant. The curse of dimensionality can then be largely alleviated when nonparametric estimation is inevitably required.
However, as q 2 is unknown, to accommodate the alternative hypothesis, we should estimate it and B consistently. Let q be q 1 under the null and q 2 under the alternative. Thus, we want to define an estimate of q, which could be adaptive to the underlying model and thus be consistent to q 1 and q 2 under the respective hypotheses. The following subsection provides an estimate and its consistency.

Structural dimension estimation
We define a criterion to estimate q in an automatic manner. Although the BICtype criterion in Zhu et al. (2010a) [24] that was motivated from Zhu et al. (2006) [27] is a candidate, choosing an appropriate penalty is a difficult issue.
In this paper, we recommend a ridge-type eigenvalue ratio estimate (RERE). Based on our experience in practice, it is not very sensitive to the ridge choice. Letλ d ≤λ d−1 ≤ · · · ≤λ 1 be the eigenvalues of the estimating matrix M n . Define Theoretically, we can use the originalλ i to define the following ratio-based criterion. However, we found that a couple of largest eigenvalues tend to be much larger than the other non-zero eigenvalues. Then some ratios of estimated non-zero eigenvalues could be smaller than the minimizer, and then the structure dimension q is often underestimated. Thus, we 'standardize' the eigenvalues λ * i to define a criterion and estimator: This method is motivated by Xia [31]. This algorithm is easily implemented.
The following proposition states the estimation consistency.

Proposition 2.2.
In addition to Conditions A1, A2 and A3 in the Appendix, assume c × log n/n ≤ c n → 0 with some fixed c > 0. Then the estimateq by (2.7) has the following consistency.
From Proposition 2.2, the choice of c n can be in a relatively wide range to guarantee the estimation consistency under the null and global alternative hypotheses.
Altogether, the final estimateB can have the model adaptive property in the sense that under H 0 , it is consistent toB = (C 1 B 1 , 0 q1×p2 ) for a q 1 × q 1 orthogonal matrix C 1 and under H 1 , to B.
Asq is equal to q 1 under the null and larger than q 1 under the alternatives, it seems that we could simply use it to perform a test. However, based on it, neither type I error nor power can be evaluated. Thus, it can only be an estimate rather than a test statistic.

Limiting null distribution
First, define some notations. Let where p(·) denotes the density function of B 1 X.
Further, s 2 can be consistently estimated byŝ 2 .
Therefore, according to Theorem 3.1, we can get the standardized test statistic as: Further, applying the Slutsky's Theorem yields that under H 0 , the test statistic T n is asymptotically normal:

Power study
We now study the power performance of the test statistic T n . Consider the following sequence of local alternative hypotheses as: Fixed C n corresponds to the global alternative model and when C n goes to zero, the sequences are the local alternative hypotheses.
To obtain the main results about the power performance under H 1n of (3.3), we first present the asymptotic behavior of the estimateq when C n → 0. It is noted that the estimateq may be not consistently converge to q 2 . This is caused by the convergence of the local alternative models to the hypothetical model as n → ∞, the following lemma states the result. (3.3) and the regularity conditions in Proposition 2.2 except that c × n −1 h −q1/2 log n ≤ c n → 0 for some fixed c > 0, as n → 0,q determined by (2.7) has the following properties.
Comparing this proposition with Proposition 2.2, we can see the main difference is that when C n has slower rate than O(n −1/2 h −q1/4 ), the criterion we are using cannot determine to which value the estimate is consistent. From the proof, it seems that with this rate C n converges to zero, the estimation consistency ofq is difficult to satisfy. This is an ongoing research topic beyond the scope of this paper. As we do not know whetherq converges, and furthermore, even it could be convergent, whether it converges to a constant. It affects the power study described here. The following theorem states how sensitive the test is to the alternative models in other cases.

Theorem 3.2.
Under the regularity conditions in the Appendix, we have the following results.
with the density function p B1 (·) of B 1 X. and s 2 is given by (3.1). Further, s 2 can be consistently estimated byŝ 2 .
Remark 3.1. The results in this theorem confirm our claim in the first section. The convergence rate of the test statistic is nh q1/2 and the test can detect the local alternative models converging to the hypothetical model also at the rate of order C n = n −1/2 h −q1/4 under certain constraint on the alternative model structure. Fan and Li's (1996) [6] test, which is also the case for existing locally smoothing tests, can have the respective rates where q 1 is replaced by d, which causes a much slower rate. However, as we mentioned right before the theorem, when C n has slower rate to zero than the rate of order O(n −1/2 h −q1/4 ), the property of the estimateq is not clear and thus the asymptotic property of the test statistic is not clear although it would reasonably guess that the power could be higher or be equal to 1 asymptotically. In the numerical studies, we will consider several values in a range of C n to see whether the power could be higher with increasing the value of C n . When the estimatorq is less than the true structural dimension q, we can also check whether the test statistic V n based onq can detect the corresponding alternatives with positive probability.

Simulations
In this subsection, we conduct the simulations to investigate the finite sample performance of our proposed test.  [4] test is defined as: As its limiting null distribution is intractable, the critical values are determined by the wild bootstrap. The bootstrap observations are from: is a sequence of i.i.d. random variables from the two-point distribution as: The bootstrap critical values are computed by 1000 bootstrap replications. The Gaussian-based kernel of order 4, Q(u) = (u 4 −7u 2 +6)φ(u)/2, is used to estimate the nonparametric function g(·), where φ(·) denotes the standard normal density, see Fan and Hu (1992) [5]. For both DREAM and Fan and Li's (1996) [6] test, we use the Quartic kernel function as K(u) = 15 16 (1 − u 2 ) 2 , if |u| ≤ 1 and 0, otherwise, in constructing the test statistic such as that in (2.4). To determine the structural dimension q, c n = 0.1 log n/nh , respectively, from multivariate normal distribution N (0, Σ 1 ), N (0, Σ 2 ), or N (0, Σ 3 ) and independent of the standard normal errors, in which Σ 1 = (σ In this section, we design 4 examples. The first example is to show that when the dimensions p 1 and p 2 are small and the model is low-frequent, how the performances of the three competitors are. In Example 2, the dimensions grow up to higher under a high-frequent model, we then check the impact from the dimensionality. Example 3 is to examine whether DREAM is still omnibus even when the test statistic fully uses the information of low dimensionality under the null hypothesis. The model in Example 4 is with higher dimension of B Z and then we can see whether, like existing locally smoothing tests, DREAM also fails to work. The details are in the examples.

Example 1.
Consider the linear regression model: In this example, p 1 = p 2 = 2 and p 1 = p 2 = 4 are considered, where the hypothetical and alternative models respectively respond to a = 0 and a = 0. To check the sensitivity of the bandwidth selection, we choose the different bandwidths h = c × n −1/4+q for c ∈ {1, 1.25, 1.5, 1.75, 2}. Figure 2 reports the empirical sizes and powers with the above bandwidths when n = 200. The empirical power is relatively robust against the different bandwidths that we use. The empirical size is not very sensitive to the bandwidth. Thus, the bandwidth h = 1.75 × n −1/(4+q) is recommended throughout the simulations. The results of the three tests under different combinations of sample sizes, dimensions of covariates X and W and covariance matrices Σ are reported in Tables 1 and 2.
From Table 1, we can clearly observe that when the dimensions p 1 and p 2 are lower, all the tests have similar empirical powers and can control the empirical sizes well. The power performances of the competitors are very good. However, from Table 2, we can see that with increasing the dimensions p 1 and p 2 , T DEE n is significantly and uniformly more powerful than T F L n and T DM n . Meanwhile T DEE n can still well maintain the significance level. Comparing Table 1 with Table 2, we can see that the dimensions of X and W have less influence for T DEE n than they do for T F L n and T DM n . When p 1 = p 2 = 4, Fan and Li's (1996) [6] test completely fails to detect the alternative hypothesis with a power similar to the significance level even when n = 200. Further, DREAM is robust against the correlation structure of (X, Z) whereas it significantly influences the power performance of Delgado and González Manteiga's (2001) [4] test, particularly when p 1 = p 2 = 4. Example 2. In this example, consider a nonlinear high-frequency regression model as: , 0, · · · , 0) / p 1 /2 and β 2 = (0, · · · , 0, 1, · · · , 1 p2/2 ) / p 2 /2.
We also consider two cases of dimensions: p 1 = p 2 = 4 and p 1 = p 2 = 6. Again a = 0 responds to the hypothetical model. The results are presented in Tables 3 and 4. Comparing Table 2 with Table 3-4, we find that when the dimensions of X and W grow up to p 1 = p 2 = 6, the empirical powers of T F L n and T DM n are close to 0. This result means that both the competitors completely fail to detect the alternative hypothesis. We can also find that the empirical power of DREAM is similar to that when p 1 = p 2 = 4 in Table 2. This again suggests that the dimensions of X and W have much less impact for T DEE n than they do for T F L n and T DM n . The next example is to confirm that DREAM is still omnibus rather than  Example 3. The data are generated from the following model: where p 1 = 4, p 2 = 4, β 1 = (1, 1, 0, 0) / √ 2 and β 2 = (0, 0, 1, 1) / √ 2. Thus, q 1 = 2, and q 2 = 3. In this model, the conditional expectation E(Y |B T 1 X) is the same under the null and alternative hypotheses. If we simply use this function to define a test, the alternative hypothesis cannot be detected at all from the theoretical point of view. However, DREAM can automatically adapt to the alternative model with the matrix B T Z, thus it still works under the alternative hypothesis.
By the comparison between Tables 2, 3 and 5, we observe that the power performances of DREAM in Example 3 are similar to those in Examples 1 and 2. This means that the test can have the advantage of dimension reduction and is still omnibus.
We have the following observations from the results of Table 6. First, T DEE n works better than its competitor. The comparison between T DEE n and T F L n further substantiates the theoretical advantage that our test has the faster convergence rate than that of Fan and Li's (1996) [6] test. Second, in the limited simulations, the heavy tail (Cases 3) does not have significant impact for the performance of all the three tests. Third, when the dimension p 2 becomes larger, the performance of T DEE n is affected slightly, whereas T F L n and T DM n is significantly negatively affected. Therefore, the dimension is a key factor for power deterioration of T F L n and T DM n , while it is not for the proposed test T DEE n .
The following example considers higher dimensional B Z in a model.
Comparing them with the results in Examples 1-4, we can see that T DEE n can maintains the significance level well and its power reasonably becomes lower,  . Based on our very limited numerical studies, which are not reported, we find that it is difficult to estimate the structural dimension q = 5, but our test can still detect the alternative models with positive probability.
In summary, the above simulations sustain the aforementioned theoretical properties that the proposed test is significantly superior to existing tests among which Fan and Li's (1996) [6] test and Delgado and González Manteiga's (2001) [4] test are regarded as representatives of existing tests.

Baseball hitters' salary data
We now analyze the well-known Baseball hitters' salary data set, which was originally published for the 1988 ASA Statistical Graphics and Computing Data Exposition and is available at http://euclid.psych.yorku.ca/ftp/sas/sssg/ data/baseball.sas. The data set consists of information on salary and 16 performance measures of 263 major league baseball hitters. As always, the question of main interest is whether salary reflects performance. As displayed by Friendly (2002) [7], the 16 measures naturally belong to three performance categories: the season hitting statistics, which include the numbers of times at bat (X 1 ), hits (X 2 ), home runs (X 3 ), runs (X 4 ), runs batted in (X 5 ), and walks (X 6 ) in 1986; the career hitting statistics, which include the numbers of years in the major leagues (X 7 ), times at bat (X 8 ), hits (X 9 ), home runs (X 10 ), runs (X 11 ), runs batted in (X 12 ) and walks (X 13 ) during the players' entire career up to 1986; and the fielding variables, which include the numbers of putouts (X 14 ), assists (X 15 ) and errors (X 16 ) in 1986. Further, the covariates from different groups have weak correlations. The logarithm of annual salary in 1987 is used to be the response variable (Y ) and the new covariates from the career totals by dividing totals by years in the major leagues are constructed. Let X * j = X j /X 7 for j = 8, · · · , 13. As remarked by Hoaglin and Velleman (1995) [9], the analyses working with ln(salary) and with the annual rate covariates fared better than those worked with the raw forms of these covariates. Below, we use X * j 's instead. All the covariates are standardized to have mean zero and unit length. We write V 1 = (X1, · · · , X6), V 2 = (X 7 , X * 8 , · · · , X * 13 ) and V 3 = (X 14 , X 15 , X 16 ). In this application, we consider two cases: Under the two cases, the values of the test statistics are respectively 1.1171 and 6.5831 and the corresponding p−values are 0.1320 and 0.0000.
From these results under Cases (I) and (II), we can conclude that the career hitting statistic of the group V 2 has positive impact for the annual salary. The results are consistent with those advised by Xia et al. (2002) [22] who found that the variables X 7 , X 9 and X 13 in the group V 2 are prominently to affect the annual salary. The coefficients of the fielding covariates in the group V 3 are closed to 0 in the estimated directions suggested by Xia et al. (2002) [22]. Therefore, for the annual salary, the group V 2 contains the significance covariates while the group V 3 does not.

Conclusions
In this paper, we develop a dimension reduction adaptive-to-model test to determine significant covariates under the nonparametric regression framework. The approach employs a dimension reduction technique to reduce the dimension such that the constructed test can well maintain the significance level and be more powerful than existing tests in the literature. This methodology can be applicable to check other semi-parametric regression models, for example partially linear models, single-index models and partially linear single-index models. The research is on-going. Further, as the test involves nonparametric estimation under the null hypothesis, when the dimension of X (or B 1 X) is high, none of tests could work well. Thus, it deserves a further study. Another issue is about the estimation inconsistency for the structural dimension q 2 under the local alternative hypotheses, which leads to the destroy of the omnibus property. As the problem is not caused by the model adaptation idea, while is caused by the estimation accuracy, defining a good estimate of q under the local alternative hypotheses is a very interesting topic in a further study.

A.1. Regularity conditions
To prove the asymptotic properties in Sections 2 and 3, we provide the following regularity conditions: A1 M n (t) has the following expansion: where E n (·) denotes the average over all sample points, E{ψ(Z, Y, t)} = 0 and E{ψ 2 (Z, Y, t)} < ∞. A2 sup t ||R n (t)|| F = o p (n −1/2 ), where || · || F denotes the Frobenius norm of a matrix. A3 The estimateM n (t) has the following expansion: A5 The density function p B1 (·) of B 1 X exists with support C and has a continuous and bounded first-order derivative on the support C. The density p B1 (·) satisfies A6 The function g(·) is η-order partially differentiable for some positive integer η, and the ηth partially derivative of g(·) is bounded. A7Q(·) is a symmetric and twice order continuously differentiable kernel function satisfying where δ ij is the Kronecker's delta and η is given in Condition A6. A8 K(·) is a bounded, symmetric kernel function and it is a first order continuously differentiable kernel function satisfying K(u)du = 1.  Fan and Li (1996) [6]. Conditions A5 and A8 guarantee the asymptotic normality of DREAM statistic and make the test well-behaved. Condition A9 about the choice of bandwidth h is reasonable because the estimationq is different under the null and alternative hypotheses.

A.2. Proof of the theorems
Proof of Proposition 2.1. Note that under the null hypothesis, where the notation ⊥ ⊥ stands for independence. This is equivalent to  [24] argued that {γ 1 , · · · , γ q1 } ∈ S E(Y |Z) , we have {γ 1 , · · · , γ q1 } ∈ Span(B). This implies that γ j for j = 1, · · · q 1 can be denoted as a linear combination of the columns ofB. Thus, for j = 1, · · · q 1 , γ j has the similar form as γ j = (γ j , 0 1×p2 ) withγ j being a p 1 × 1 vector. This implies that any element in S E(Y |Z) can also be written as γ j = (γ j , 0 1×p2 ) . Further, the structural dimension of S E(Y |Z) is smaller than or equal to q 1 . Further, we note that under Under Conditions A1 and A2, Theorem 2 in Zhu et al. (2010a) [24] shows that M n − M = O p (n −1/2 ). From the arguments in Zhu and Fang (1996) [26] and Zhu and Ng (1995) [28], under some regularity conditions,λ i − λ i = O p (n −1/2 ), whereλ d ≤λ d−1 ≤ · · · ≤λ 1 are the eigenvalues of the matrix M n and λ i are the eigenvalues of the matrix M . The estimateB that consists of the eigenvectors associated with the largest q 1 eigenvalues of M n is consistent for (C 1 B 1 , 0 q1×p2 ) for a q 1 × q 1 orthogonal matrix C 1 .
Proof of Theorem 3.1. Define the events A n = {T n ≥ c} for any constant c, and B n = {q = q 1 }. According to Proposition 2.2, under the null hypothesis, q = q 1 with a probability going to one, namely lim n→∞ P (B n ) = 1, where P (A) denotes the probability that the event A happens. The facts that if P (A n ) converges, lim n→∞ P (A n ) = lim n→∞ P (A n ∩ B n ), implies that under the null hypothesis, in an asymptotic sense it is only needed to consider the caseq = q 1 for the asymptotic properties of the test statistic T n throughout the proof of Theorem 3.1. For notational convenience, denote 1 i is a kernel estimate of the density function p B1 (·) of B 1 x i given bŷ Throughout this appendix, E i (·) = E(·|z i ). To have a uniform version in this proof, under the null hypothesis, define B =B = (B 1 , 0 q1×p2 ) .
Note thatû i ≡: We then decompose the term V n as: First, consider the term Q 1n . By taking a Taylor expansion for Q 1n with respect to B, we have Define a function as We have f (1) = Q 1n and f (0) = Q 11n . By an application of mean value theorem, we have f (1) − f (0) = f (t) = Q 12n witht ∈ (0, 1). Thus, we can conclude that Because ||B − B|| = O p (1/ √ n) and the first derivative of K B (·) with respect to B is a bounded continuity function of B, we conclude that replacing B * by B does not affect the convergence rate of Q 12n .
In the present paper, as the dimension of B Z is fixed, the term Q 11n is an U −statistic. Since under H 0 , q = q 1 , following a similar argument as that for Lemma 3.3 in Zheng (1996) [23], it is easy to obtain: with σ 2 (B z) = E(u 2 |B Z = B z) and B = (B 1 , 0 q1×p2 ) . We then omit the details. We turn to discuss the term Q 12n in (A.4). Since E(u i |z i ) = 0, we have E(Q 12n ) = 0. We then calculate the second order moment of Q 12n as follows: By changing variables as v 1 = (z 1 − z 2 )/h, a further computation yields By taking Taylor expansion of p(B (z 1 − hu)) around z 1 and using Conditions A4, A7, A8 and A9, we have The application of Chebyshev's inequality leads to |Q 12n | = o p (n −1 h −q1/2 ). Using the above results for the terms Q 11n and Q 12n , it is deduced that nh q1/2 Q 1n ⇒ N (0, s 2 ). (A.10) Here, using the similar argument as the justification for the term Q 1n , we also conclude that As proved for the term Q 12n , we also assert that replacing B * and B * 1 by B and B 1 , respectively, do not influence the convergence rate of the term Q 22n .
Similarly as the proof of Proposition A.1 in Fan and Li (1996) [6], when we want to finish the proof of this theorem, what we need to prove is that E(Q 2 21n ) = o p (n −2 h −q1 ). It is obvious that the calculation of E(Q 2 21n ) would be very tedious. We first prove that E(Q 21n ) = o p (n −1 h −q1/2 ).
Consider two cases with different combinations of indices i, j, l, k.
Case II: A 2 = {i, j, l, k take no more than three different values}. Denote the term as Q 212n . It is easy to derive that E(Q 212n ) = o p (n −1 h −q1/2 ).
Hence, altogether, we have E(Q 21n ) = E(Q 211n ) + E(Q 212n ) = o p (n −1 h −q1/2 ). Now we turn to compute E(Q 2 21n ). It can be decomposed as: Firstly, when the indices i, j, l, k are all different from i , j , l , k , the two parts in two different braces are independent of each other. Define the related sum as LA1. Applying the same argument as that for proving Secondly, consider the case where exactly one index from i, j, l, k equals one of subscripts i , j , l , k . By symmetry, we only need to compute Case (i): i = i ; Case (ii): i = l and Case (iii) l = l . The three cases respond the related sums defined as LA2, LA3 and LA4, respectively.
Under Case (i), we have For Case (ii), we have

Dimension reduction-based significance testing in nonparametric regression
. Similarly, it is easy to prove that for Case (iii), LA4 = o p (n −2 h −q1 ).
Finally, when the indices i, j, l, k, i , j , l , k take no more than six different values whose sum is defined as LA5, it is easy to see that LA5 = o p (n −2 h −q1 ).
The similar arguments can be applied to handle the first and second moment of Q 22n defined in (A.7). The first moment is very similar to that for Q 21n . Note that Q 22n is the weighted sum of Q 221n , Q 222n and Q 223n in (A.8)-(A.10). Thus, we only need to respectively handle Q 221n , Q 222n and Q 223n . Also, they can be treated similarly as those for Q 21n . When Conditions A4-A9 hold, it is easy to derive the following convergence rates: together with (A.11)-(A.13), we derive Q 22n = o p (n −1 h −q1/2 ) by an application of Chebyshev's inequality. Therefore, altogether, from the definition of Q 2n in (A.5), we have that Q 2n = o p (n −1 h −q1/2 ).
In summary, we conclude that: The variance s 2 can be estimated bŷ Since the proof is straightforward, we only give a very brief outline. Under the null hypothesis, the estimatesB andĝ are consistent toB and g, and some elementary calculations result in an asymptotic presentation as: Applying a similar argument used for proving Lemma 2 in Guo et al. (2016) [8], one can derivê with B = (B 1 , 0 q1×p2 ) . Ass 2 is an U −statistic and it is easy to proves 2 → s 2 in probability. The more details can be referred to Zheng (1996) [23].
Further, we note that Therefore, the matrix M (t) can also be reformulated as where Y n = g(B 1 X) + C n G(B Z) + ≡: Y + C n G(B Z). Thus, for all t, we have Under Condition A2, we can conclude that n −1 n i=1 z i I(y ni ≤ t) − E{ZI(Y ≤ t)} = O p (max {C n , n −1/2 }) = O p (C n ). By the parallel argument for justifying Similarly as those in Zhu and Fang (1996) [26] and Zhu and Ng (1995)[28], we get thatλ i − λ i = O p (C n ), whereλ d ≤λ d−1 ≤ · · · ≤λ 1 are the eigenvalues of the matrix M n . Note that M corresponds to the null hypothesis and then satisfies that Span{M } = S Y |E(Y |Z) and rank(M ) = q 1 . This implies that λ d = · · · = λ q1+1 = 0 and 0 < λ q1 ≤ · · · ≤ λ 1 that are the eigenvalues of the matrix M . It is clear that for any l > q 1 , λ l = 0, so we get (λ * l ) 2 = O p (C 2 n ), where λ * is defined in (2.6). Recall the definition ofλ (A.1) in the beginning of the proof of Proposition 2.2. For any 1 ≤ l ≤ q 1 , we have (λ * l ) 2 = (λ l ) 2 + O p (C n ). Prove part (I). We consider the case C n = O(n −1/2 h −q1/4 ). Since c × n −1 h −q1/2 log n ≤ c n → 0 with some fixed c > 0 where c n is the ridge value in the criterion RERE. we have C 2 n = o p (c n ). When l > q 1 , as c × n −1 h −q1/2 log n ≤ c n → 0, we have, Therefore in probability When 1 ≤ l < q 1 , we derive that: .
Then we have in probability Therefore, altogether, we can conclude thatq = q 1 with a probability approaching 1.
Then we have in probability This implies thatq determined by (2.7) is greater than q 1 or equals to q 1 with a probability approaching 1, namely P (q ≥ q 1 ) = 1.

Proof of Theorem 3.2. Prove Part (I).
Since the details of the proof are similar to those for proving Theorem 3.1, we only sketch it. First, according to Proposition 2.2, under the global alternative hypothesis, P (q = q 2 ) → 1, using the similar argument as that in Theorem 3.1, we can get that for any fixed constant c, lim n→∞ P (T n ≥ c) = lim n→∞ P (T n ≥ c,q = q 2 ). Therefore, under the global alternative hypothesis, in an asymptotic sense we can only need to consider the conditional probability when the eventq = q 2 is given. By using ||B 1 −B 1 || = O p (1/ √ n) and Conditions A6 and A8 in the Appendix, g(B 1 x) is an uniformly consistent estimate of g(B 1 x) = E(Y |B 1 X = B 1 x), see Powell et al. (1989) [16] or Robinson (1988) [19]. We then have where u i = y i − g(B 1 x i ) with g(B 1 x i ) = E(y i |x i ). Let (z i ) = m(B z i ) − g(B 1 x i ). Therefore, by using the U −statistics theory, we get that Similarly, we can also prove that in probabilityŝ 2 converges to a positive value which may be different from s 2 defined by (3.1). Therefore, we can obtain T n /(nh q1/2 ) ⇒ Constant > 0 in probability.
Consider Part (II). Under the local alternative hypotheses with C n = n −1/2 h −q1/4 , according to Lemma 3.1, we haveq = q 1 with a probability tending to one. By applying the same argument as that in Theorem 3.1, we can get that for any constant c, we have lim n→∞ P (T n ≥ c) = lim n→∞ P (T n ≥ c,q = q 1 ). This implies that under the local alternative hypotheses with C n = n −1/2 h −q1/4 , Further, apply the changing variables u = ti−tj h to get that: Again using the element characteristics of U −statistic, we derive that: Thus, we can deduce that As the variance s 2 can be consistently estimated and thus, the expected result can be proved. In summary, invoking Slutsky's theorem, it is concluded that The proof of Theorem 3.2 is concluded.