A dimension reduction based approach for estimation and variable selection in partially linear single-index models with high-dimensional covariates

: In this paper, we formulate the partially linear single-index models as bi-index dimension reduction models for the purpose of identifying signiﬁcant covariates in both the linear part and the single-index part through only one combined index in a dimension reduction approach. This is diﬀerent from all existing dimension reduction methods in the literature, which in general identify two basis directions in the subspace spanned by the parameter vectors of interest, rather than the two parameter vectors themselves. This approach makes the identiﬁcation and the subsequent estimation and variable selection easier than existing methods for multi-index models. When the number of parameters diverges with the sample size, we then adopt coordinate-independent sparse estimation procedure to select signiﬁcant covariates and estimate the corresponding parameters. The re-sulting sparse dimension reduction estimators are shown to be consistent and asymptotically normal with the oracle property. Simulations are con-ducted to evaluate the performance of the proposed method, and a real data set is analysed for an illustration.


Introduction
The single index model is an important generalization of the multiple linear regression model with an unknown link function. It has been widely studied and used to explore the complicated relation between the response and covariates of interest (Horowitz, 2009), and may reflect the interaction within covariates. To further effectively combine the interpretability of the multiple linear model and the flexibility of the single index model, their hybrid, partially linear single model (PLSIM), has been studied and applied for various complex data generated from biological and economic studies in the literature (Xia and Härdle, 2006;Yu and Ruppert, 2002). To the best of our knowledge, the first remarkable work on PLSIM was done by Carroll et al. (1997), who proposed a backfitting algorithm to estimate parameters of interest in a more general case; i.e., generalized PLSIM. Yu and Ruppert (2002) argued that the estimators proposed by Carroll et al. (1997) may be unstable, and suggested the penalized spline estimation procedure. Xia and Härdle (2006) applied the minimum average variance estimation (MAVE, Xia et al., 2002) to PLSIM and developed an effective algorithm. More recently, Wang et al. (2010) studied estimation in PLSIM with the additional assumptions imposed on model structure. Liang et al. (2010) proposed a profile least squares (PrLS) estimation procedure. However, when these methods are applied to deal with the case with diverging number of covariates, one may encounter some challenges. For example, MAVE may meet the sparseness problem as noted by Cui et al. (2011), and the PrLS estimation procedure is not easy to implement in high-dimensional settings because this method needs to minimize a high-dimensional nonlinear objective function. In this paper, we propose a method for estimation and variable selection in PLSIM when the dimensions of the covariates diverge. We integrate dimension reduction principle with a testing based variable selection approach.
There has been much work on the penalty based variable selection methods for semiparametric models with a diverging number of covariates. For example, Xie and Huang (2009) and Ni et al. (2009) studied variable selection for partially linear models (PLM), a special case of PLSIM, and established the selection consistency and the asymptotic normality for their estimators. They used respectively polynomial splines and smoothing splines to approximate the nonparametric function. Ravikumar et al. (2009) investigated highdimensional nonparametric sparse additive models, developed a new class of algorithms for estimation and discussed the asymptotic properties of their estimators. Meier et al. (2009), Huang et al. (2010, and Li et al. (2012) studied variable selection for high-dimensional nonparametric sparse additive models. Wang and Zhu (2011) derived almost necessary and sufficient conditions for the estimation consistency of parameter estimators for single-index models in "large p, small n" paradigms. See Fan and Li (2006) for a review on variable selection for high-dimensional data. Only Liang et al. (2010) carried out variable selection in the context of PLSIM using the smoothly clipped absolute deviation penalty (SCAD, Fan and Li, 2001) to simultaneously select significant covariates and estimate the corresponding parameters of interest. However, this method is limited to the case with the fixed dimension of covariates.
As an effective way to deal with the problem of "curse of dimensionality", dimension reduction techniques overcome this problem through identifying the subspace spanned by a few convex combinations of covariates, which can capture full information between response and covariates. This subspace is called central dimension reduction space (CS, Cook, 1998). The focus is therefore on the convex combinations, rather than the original covariate vector. If the convex combinations are all forms of mean regression functions, this subspace is called central mean subspace (Cook, 1998). For instance, the multiple linear model has only one convex combination of covariates to affect response. A rich list of literature includes Li (1991) for the sliced inverse regression (SIR), Cook and Weisberg (1991) for sliced average variance estimation (SAVE), Li (1992) for principal Hessian directions, Li and Wang (2007) for directional regression (DR), Wang and Xia (2008) for sliced regression, Zhu, Wang, Zhu and Ferré (2010) for discretization-expectation estimation, and Zhu, Zhu and Feng (2010) for simple cumulative slicing estimation (CUME). There has also been interest in investigating dimension reduction with a diverging number of covariates. As the first attempt in this direction, Zhu et al. (2006) revisited SIR, whereas Zhu and Zhu (2009b) suggested a weighted partial least squares method to cope with the highly correlated covariates in semiparametric regressions. It was known that these dimension reduction methods are usually unable to identify significant covariates that sometimes are of most interest, because these methods can identify only the central subspace or central mean subspace for general cases with more than one index. More recently, efforts have been made to incorporate dimension reduction into variable selection procedure. Important results of these efforts are the least squares approach for general multi-index models by Wu and Li (2011) with the SCAD penalty, the least squares formulation by Yin and Cook (2002), and coordinate-independent sparse estimation (CISE) by Chen et al. (2010), in which the authors introduced a coordinate-independent penalty to a least squares objective function formulated by Li (2007). The CISE is shown to produce sparse solution with the oracle property.
In this paper, we first formulate the PLSIM in a dimension reduction framework so that we can identify the direction of the nonzero coefficients. We then invoke the sufficient dimension reduction principle and incorporate a coordinateindependent penalty (Chen et al., 2010) to achieve a sparse dimension reduction. In theory, we justify that our method is capable to correctly identify significant covariates with probability approaching one. The selection helps us further derive asymptotically normally distributed estimators of the nonzero coefficients.
There is an interesting feature of the method that is of independent importance in dimension reduction. Note that in this model, there are two corresponding parameter vectors in the linear and single-index parts. When we formulate this model as a bi-index model in a dimension reduction framework that will be seen below, all existing dimension reduction approaches are to identify a CS (Cook, 1998) spanned by these two vectors. In other words, any basis vector in this space is a linear combination of them, and then in general, these two parameter vectors themselves cannot be identified. However, interestingly, we find that the partially linear single-index model has a particular dimension reduction framework. With it, we can identify the two parameter vectors of interest using only one basis vector in the CS rather than identifying the entire space. This is very different from all existing dimension reduction methods in the literature because for bi-index models we usually have to determine two basis vectors to identify the CS. This identification plays a key role in our procedure for variable selection.
We conduct Monte Carlo simulation experiments to examine the performance of the proposed procedures with moderate sample sizes, and compare the performance of the proposed methods based on two popular dimension reduction procedures: SIR and CUME. Our simulation results advocate our theoretical findings. The paper is organized as follows. In Section 2, we present the models and the basic framework. In Section 3, we describe the rationale of the proposed method, and present the asymptotic results including the selection and estimation consistency and the asymptotic distributions of the estimators. Simulation studies are reported in Section 4. In Section 5, we illustrate our proposed method through a real data set. All the technical proofs of the asymptotic results are postponed to the Appendix.

Models and dimension reduction framework
Let Y be the response variable and (X τ , Z τ ) τ be the vector of covariates in R p × R q whose relationship with Y following PLSIM can be described as where (β 0 , θ 0 ) is an unknown vector in R p × R q equipped with the Euclidean norm · 2 , ε is the error with mean zero and finite variance, and g(·) is an unknown univariate link function. For the sake of identifiability, we assume, without loss of generality, that θ 0 is unit and its first component is positive, i.e., the parameter space of θ 0 is Θ = θ = (θ 1 , θ 2 , . . . , θ q ) τ , θ 2 = 1, θ 1 > 0, θ ∈ R q . PLSIMs contain two important special cases. When q = 1, model (2.1) reduces to a partially linear model (PLM), for which there is much work in the literature, for example, Chen (1988); Engle et al. (1986);Heckman (1986), and Speckman (1988). Härdle et al. (2000) gave a comprehensive review for PLM. When β 0 = 0, model (2.1) reduces to the single-index model. Ichimura (1993) proposed a semiparametric least squares estimation and Härdle et al. (1993) investigated the asymptotic normality of a kernel smoother based estimation. Naik and Tsai (2001) investigated the model selection. Wang and Yang (2009) proposed a regression spline based estimation method.
Write T = (X τ , Z τ ) τ ∈ R pn+qn . The dimensions of both β 0 and θ 0 , say p n and q n respectively, may diverge with the sample size n. Note that model (2.1) can be broadly formulated as a sufficient dimension reduction (SDR) model (Zhu and Zhu, 2009a) where ⊥ ⊥ indicates independence. That is, conditional on S τ T, Y and T are independent. β 0 and θ 0 can be estimated with the help of SDR principle, whose major purpose is to seek a minimum CS subspace spanned by the columns of S. So a SDR method does not provide estimators of β 0 and θ 0 , instead two basis vectors in the subspace in general which cannot distinguish the covariates of the respective nonparametric and parametric components. Nevertheless, the two directions β 0 / β 0 2 and θ 0 in our setting may be identifiable since the central subspace is two-dimensional and generated by S. More specifically speaking, any vector in the central subspace is of form (c 1 β τ 0 , c 2 θ τ 0 ) τ . That means that the subvector consisting of the first p n components is related only to β 0 , while the subvector consisting of the rest q n components is related only to θ 0 . Consequently, when we use a SDR method to identify the central subspace, we can use such a vector with some nonzero components in these two parts to respectively identify β 0 / β 0 2 and θ 0 . Moreover, such a subspace uniquely exists and contains all regression information of Y |T under the mild conditions (Cook, 1996a,b). Hence we proceed identifying β 0 / β 0 2 and θ 0 as follows.
As shown by Li (2007), most of the commonly used SDR methods can be formulated as the following eigen-decomposition problem: where Σ is the covariance matrix of T, λ is the eigenvalue and b is the associated eigenvector, M is a nonnegative definite method-specific symmetric kernel matrix. See Li (2007) for details on choices of M for various SDR methods. Let λ max and u 0 be the largest eigenvalue and the associated eigenvector of Σ −1/2 MΣ −1/2 . Note that if λ max is nonzero, then v 0 := Σ −1/2 u 0 ∈ span(S) under some method-specific conditions on the marginal distribution of T such as the linearity condition (Li, 1991). This statement implies that there exists a vector ϕ = (ϕ 1 , ϕ 2 ) τ with ϕ 1 and ϕ 2 being nonzero such that v 0 = Sϕ; that is, the first p n elements of v 0 is proportional to β 0 , and the last q n elements of v 0 to θ 0 . Once v 0 is obtained, the estimates of the directions β 0 / β 0 2 and θ 0 are obtained. In Appendix A.3, we discuss an identifiability assumption under which ϕ 1 and ϕ 2 can be nonzero. Hence, the first eigenvector obtained by a dimension reduction method can identify the two directions: β 0 / β 0 2 and θ 0 . Furthermore, selecting significant components of X is equivalent to identifying nonzero element of β 0 / β 0 2 . Thus, we achieve variable selection procedure, and obtain estimated value of β 0 by estimating the scalar β 0 2 and the direction β 0 / β 0 2 .
Note that all the SDR methods aforementioned involve the whole original covariates T. As a consequence, if p n → ∞ and q n → ∞, the estimated linear combination v τ 0 T may be difficult to interpret and the significant covariates may be hard to identify because all insignificant covariates are also included in the estimated linear combination. To overcome this difficulty, we use the idea of CISE to penalize v 0 for obtaining a sparse estimator of v 0 as follows.
Let {(X i , Z i , Y i ); 1 ≤ i ≤ n} be a sequence of independent and identically distributed (i.i.d.) samples from model (2.1). Denote byT and Σ n the sample mean and covariance matrix based on (T 1 , . . . , T n ), which is defined similar to T. Letũ n be the following minimizer; that is, Any dimension reduction based kernel matrix can be applied such as SIR or SAVE. In this paper, we choose , the sample version of the CUME based kernel matrix (Zhu, Zhu and Feng, 2010) is the rth element of Σ −1/2 n u, and {α r ≥ 0, r = 1, . . . , p n + q n } are the penalty parameters. Then, the estimator of v 0 is defined asṽ n = Σ −1/2 nũn . We choose matrix M n because it is easy to implement and avoids selecting other turning parameters in estimation such as the number of slices in SIR, SAVE and DR. A theoretical justification of CUME has been provided by Zhu, Zhu and Feng (2010) even when the dimensions p n and q n diverge with the sample size.

Estimation and main results
3.1. Estimation Procedure for β 0 and θ 0 We formulate the estimation procedure in following steps.

In
Step 1, one may consider other SDR methods, such as SIR, SAVE or DR. See Zhu, Zhu and Feng (2010) for a discussion on advantages and disadvantages of these SDR methods. In Step 3, one can estimate the parameter κ with the commonly used partially linear model techniques such as the kernel method (Liang et al., 1999(Liang et al., , 2004Speckman, 1988) or spline method (Chen, 1988;Cuzick, 1992;Wahba, 1984). It is remarkable that the proposed procedure does not need any iteration, neither initial value. In contrast, spline method (Yu and Ruppert, 2002) and MAVE (Xia and Härdle, 2006) need to delicately choose initial values or iteration. Thus the proposed method is particularly computationally efficient compared to its competitors. The gain is substantial when p n and q n diverge. The proposed procedure still has appealing asymptotic properties (see Sections 3.2-3.4 for details). Moreover, our numerical studies suggest the good performance of our method.
It is noteworthy that if we study only estimation for model (2.1), we can still use the dimension reduction principle to obtain the estimatorṽ n = (ṽ τ n,I ,ṽ τ n,II ) τ , Steps 2 and 3 to obtain the estimators of β 0 and θ 0 , which are consistent and asymptotically normal under mild conditions. This estimation method is of an independent interest in dimension reduction area, and provides an alternative way different from MAVE (Xia and Härdle, 2006) or profile likelihood based (Liang et al., 2010) methods, which need iteration for implementation.
With slight notation abuse, we redefineṽ n in the Algorithm asṽ n = (ṽ τ n(0) , v τ n(1) ) τ , whereṽ n(0) = (ṽ τ n(0),I ,ṽ τ n(0),II ) τ ,ṽ n(0),I andṽ n(0),II are the nonzero components ofṽ n,I andṽ n,II respectively. Let X I , Z I be the subset of the X, Z with respect toṽ n(0),I ,ṽ n(0),II , and p 0 and q 0 be the lengths ofṽ n(0),I and v n(0),II respectively. So p 0 and q 0 are estimates of p and q instead of constants. Analogously, define T I = (X τ I , Z τ I ) τ , T iI = (X τ iI , Z τ iI ) τ for i = 1, . . . , n. In what follows, we write A ⊗2 = AA T for any matrix or vector A. λ min (A) and λ max (A) stand for the smallest and largest eigenvalues of A for any square matrix A. For any integer s, 0 s and I s denote the zero and identity matrices of size s.

Asymptotic property ofṽ n
We first present the asymptotic results for the eigenvectorṽ n .
Theorem 3.1. Let d n = max{p n , q n }. Assume that Conditions (A1) and (A2) in the Appendix are satisfied, and furthermore √ n max r≤p0+q0 {α r } → 0 and d 3 n /n → 0, then the estimatorṽ n satisfies Remark 1. This theorem indicates that by properly choosing the penalty parameters {α r } pn+qn r=1 , the estimator is still consistent when p n and q n diverge at a rate of o(n 1/3 ), which is the same as that in the context of variable selection for parametric (Fan and Peng, 2004) and semiparametric regressions (Zhu and Zhu, 2009a). Furthermore, we can observe that if p n and q n are fixed, we obtain a root-n estimatorṽ n . This conclusion coincides with Theorem 1 of Chen et al. (2010).
To investigate the oracle property of the estimatorṽ n , we define the following quantities. By replacing T i with T iI , we define Σ nI ,T I , G nI , and then Q nI (u; G nI , Σ nI ), ρ nI , M nI , m nI (Y i ) in the same way as the corresponding quantities for (2.3). Writev I n = Σ −1/2 nIû I n , whereû I n is the following minimizer; i.e.,û I n = arg min u∈R ( p 0 + q 0 )×1 Q nI (u; G nI , Σ nI ) subject to u τ u = 1.
Theorem 3.2(i) indicates that the estimatorṽ n can consistently select relevant covariates. That is, with probability approaching 1, the estimators of all zero elements of v 0 go to zero. Theorem 3.2(ii) is different from Theorem 2(ii) in Chen et al. (2010), in which the authors established the oracle property of the CISE procedure under the assumptions that the number of relevant covariates, q, is an unknown constant, while our p 0 and q 0 are both estimators of p 0 and q 0 . Accordingly,ṽ n(0) andv I n are two estimators on the basis of the variable selection procedure. As a result, we can further useṽ n(0) for estimating β 0 and θ 0 , as required in Steps 2 and 3 in Section 3.1.
We further consider the asymptotic distribution ofṽ n(0) , which is generally ignored in the literature of dimension reduction. Because Σ (0) is positive definite, it has an orthogonal decomposition such as ) > 0, and the columns of P (0) are the eigenvectors corresponding to Λ (0) . Let B be a square matrix of size p 0 + q 0 , whose (s, t)th element is equal to −1/2λ where ⊙ is the Hadamard product operator. Furthermore, write We now present the asymptotic distribution ofṽ n(0) .
Theorem 3.3. Under the conditions of Theorem 3.2, the estimatorṽ n(0) is asymptotically normally distributed with covariance matrix Ω 0 , where with I(D) being the indicator function of the set D.

Asymptotic distributions of the estimators of κ and β 10
We first state the estimation procedure for κ and its asymptotic distribution.
and the local linear estimators of these elements are respectively denoted aŝ satisfying the conditions in the Appendix, and h being a bandwidth.
In the following, The estimatorκ can be obtained through the local linear smoothing in Step 3; that is, Theorem 3.5. In addition to the conditions of Theorem 3.2, assume that Conditions (A3)-(A5) are satisfied. The estimatorκ is asymptotically normal with variance (3.4) Finally we define an estimator of β 10 asβ 10 =κ × (ṽ n(0),I ) =κ × (π 2ṽ n(0) ), and present its asymptotic distribution in the following theorem.
Theorem 3.6. Under the conditions of Theorem 3.5, the estimatorβ 10 is asymptotically normal with covariance matrix:

Simulation studies
In this section, we report simulation results to evaluate the performance of the proposed procedure. Two dimension reduction methods, SIR and CUME have been adopted for a comparison. The number of slices for the SIR method was chosen as 5. The experiments were repeated 500 times, each consisting of n = 150 samples from the following two models: (4. 2) The first model has a monotonic link function for the single-index part and the second link function is of high frequency. The dimensions of X and Z are (10, 10), (30, 20), (20, 30), (50, 30) and (30, 50), respectively. The following three cases were considered: • Case 1. (X τ , Z τ ) τ follows normal distribution N (0 (pn+qn)×1 , I pn+qn ), and ε follows N (0, 0.1 2 ); • Case 2. (X τ , Z τ ) τ follows normal distribution N (0 (pn+qn)×1 , Σ), where Σ = (σ ij ) with σ ij = 0.5 |i−j| , and ε is the same as in Case 1; • Case 3. (X τ , Z τ ) τ are the same as in Case 2, while ε is generated from N 0, 0.1 2 × (|X 1 | + |Z 1 |) , correlated with (X τ , Z τ ) τ . Here X 1 , Z 1 are the first elements of X and Z, respectively.
To estimate the parameter κ in Step 3, we used the local linear smoother as mentioned in Section 3.3 to obtain nonparametric estimatorsÊ(Y |Λ) andÊ(X|Λ) with the Epanechnikov kernel function K(t) = 3/4(1 − t 2 )I(|t| ≤ 1). For selecting bandwidth h, the cross-validation criterion was applied (Fan and Gijbels, 1996, page 149). Following Chen et al. (2010), let whereû n = arg min u∈R (pn +qn ) (−u τ G n u) subject to u τ u = 1, and G n was defined in (2.3). That is,û n is the first eigenvector of G n with respect to it largest nonzero eigenvalue, and [Σ nûn . α 0 and ̟ are positive tuning parameters that were selected by minimizing the following BIC-type criterion (Chen et al., 2010): whereṽ n(α0,̟) denotes the estimator of v 0 through (2.3) for a given pair (α 0 , ̟), N (α0,̟) stands for the number of the non-zero elements ofṽ n(α0,̟) , log n/n is the BIC-type factor, and M n is the sample version of either the CUME kernel matrix defined in (2.3) or the SIR kernel matrix: Cov[E{T − E(T|Y)}] when these two methods are applied. This minimization can be easily solved by a two-dimensional grid search. To simplify this minimization, Chen et al. (2010) fixed ̟ = 0.5 in their simulation. But our numerical experience suggests that the data-driven strategy performs better with a slight increase of computational burden.
To measure the selection and estimation accuracy, we define ω u,β0 , ω c,β0 and ω o,β0 as the proportions of underfitted, correctly fitted and overfitted models. In the case of overfitted, the labeled "1", "2" and "≥ 3" are the proportions of models including 1, 2 and more than 2 insignificant covariates. Denote by Medse β0 the median of the squared error β 0 −β 0 2 2 , "C β0 " and "IN β0 " the average number of the zero coefficients that were correctly set to be zero, and the average number of the non-zero coefficients that were incorrectly set to be zero, respectively. In the same way, define the quantities ω u,θ0 , ω c,θ0 , ω o,θ0 , Medse θ0 , "C θ0 ", and "IN θ0 ". Tables 1-4 report the values of these quantities under various configurations when the true parameters are chosen as β 0 = (3, 1.5, 0.5, 0, . . . , 0) τ and θ 0 = (1/ √ 2, 1/ √ 2, 0, . . . , 0). Overall, the SIR and CUME based procedures successfully distinguish significant and insignificant covariates in the three cases. That is, the values of "C β0 " and "IN β0 " are respectively close to the true values (p n − 3) and 0, and the values of "C θ0 " and "IN θ0 " close to the true values (q n − 2) and 0. For the linear components of X, the proportion of which the model is correctly fitted (column ω c,β0 ) is above 70% in all the three cases even when the number of the covariates increases to 50. The average proportions of which the model is correctly fitted for the SIR and CUME based methods are 99.23% and 99.52%, respectively. The proportions of which the model is underfitted (column ω u,β0 ) and overfitted (columns under ω c,β0 ) are about 20% and 10%, respectively. In the overfitted case, the proportion of models including 1 insignificant covariate dominates the ones including 2 or more insignificant covariates. The latter is nearly 0% in most situations. This indicates that our method most likely selects model that is very close to the true one. Compared with the SIR-based procedure, the CUME-based procedure performs better regarding model complexity with slightly higher proportions of correctly selected model in most situations. However, it also more often inclines to underfitting. A similar but better pattern can be observed from the results for the singleindex components. For instance, the proportions of correctly fitted models are all about 80%, and the values of ω u,θ0 , ω c,θ0 , ω o,θ0 are smaller, larger, and smaller than the corresponding values of ω u,β0 , ω c,β0 , ω o,β0 , respectively. Table 1 Simulating results for model (4.1) when β 0 = (3, 1.5, 0.5, 0, . . . , 0) τ . The performance of β.
The It is worth mentioning that the smaller value of β 0 increases the chance of choosing underfitted model. This may be common in variable selection procedure in that the smaller parameters are hard to detect and easily to be penalized to zero. To confirm this observation, we increased the third element of β 0 to 1.5 but keep θ 0 the same as before. We run additional simulations and report the results for the linear components when (p n , q n ) = (30, 50), (50,30) in Tables 5 and 6, which indicate that, for the linear components β 0 , the proportions of underfitted models and overfitted models decrease, while the proportion of correctly fitted models increases and the estimation accuracy of β 0 also gets improved.

Real Data Analysis
Now we illustrate the proposed method by analyzing a real dataset from an economic growth study. The data include 59 potential covariates that describe economic, political, social, and geographical characteristics of the countries from 1960. Sala-I-Martin (1997 analyzed the data using a linear regression model and found that 22 out of the 59 variables appear to be "significant". As Table 2 Simulating results for model (4.1) when β 0 = (3, 1.5, 0.5, 0, . . . , 0) τ . The performance of θ.
The true C θ 0 value is equal to (qn − 2) a consequence, he had to fit 30,856 regressions per variable or a total of nearly 2 million regressions, which poses a computational challenge. Another concern is whether the linear regression is proper, since other investigators found some nonlinear structure between the covariates and the response (economic growth gamma). As an illustrative purpose, we used model (2.1) and the proposed procedure to examine the relationship between the response variable and 17 continuous covariates, which are listed in Table 7. We first fitted the response Y and each covariate with local linear smoothing and obtained a 95% pointwise confidence band, and also fitted a linear regression. If the linear straight line was encompassed in the confidence band, we classified that covariate as a linear component, and a single index one otherwise. As a result, we suggested "h60", "abslatit", "urb60", "lforce60" as single-index components. We Table 3 Simulating results for model (4.2) when β 0 = (3, 1.5, 0.5, 0, . . . , 0) τ . The performance of β.
The true C β 0 value is equal to (pn − 3) then applied our procedure to estimate and select nonzero elements of (β 0 , θ 0 ). The final estimated values of (β 0 , θ 0 ) and the standard errors based on 1000 bootstrap resamples are reported in Table 7, which show that the SIR-based and CUME-based procedures select out the variables X 1 , X 2 , X 4 and X 6 , and the SIR-based procedure selects two more variable X 11 and X 12 . The procedures also distinguish two single-index variables: Z 2 and Z 4 . We estimated the nonparametric function g(·) by using the estimated values (β 0,SIR ,θ 0,SIR ), and (β 0,CUME ,θ 0,CUME ) and show the estimated curves of g(·) in Figure 1, which show a similar pattern but difference in magnitude.
As a referee suggested, we use the additive model and adaptive COSSO method proposed by Lin and Zhang (2006) to select significant component. The selected covariates are listed in Table 7, from which we can see that 5-fold CV adaptive COSSO tends to select more covariates than our two dimensionreduction based methods. All the unimportant covariates identified by the adaptive COSSO are also identified as unimportant covariates by the CUME dimension reduction based method. Moreover, the leave-one-out prediction error by Table 4 Simulating results for model (4.2) when β 0 = (3, 1.5, 0.5, 0, . . . , 0) τ . The performance of θ.
The true C θ 0 value is equal to (qn − 2) the adaptive COSSO procedure is 5.3597 × 10 −4 , while the leave-one-out prediction error by the dimension-reduction based method are 2.0303 × 10 −4 with SIR, and 1.7052 × 10 −4 with CUME. Consequently, the CUME based dimension reduction procedure has the smallest prediction error.

Discussion
We have proposed a dimension reduction based procedure for estimation and variable selection in PLSIM when the dimensions of the covariates diverge with the sample size. The procedure naturally inherits the advantages of sufficient dimension reduction and PLM, and avoids computational complexity and limitations in the existing estimation and variable selection methods for PLSIM. However, the corresponding theory for the procedure is subject to the assumption d 3 n /n → 0. The difficulty mainly comes from estimating the covariance matrix. Like most dimension reduction methods, our method is limited to continuous covariates. Further investigations for the discrete covariates would be of great value. Table 5 Simulating results for model (4.1) when β 0 = (3, 1.5, 1.5, 0, . . . , 0) τ . The performance of β.
We thank a referee for raising the question on which covariates for the linear part and which for the single index part. There is little literature on linear versus nonlinear forms for additive regression models (Zhang et al., 2011). Whether their procedure works for PLSIM needs additional efforts and beyond the scope of this paper. Currently, we use a guideline as follows. The effects of all the continuous covariates are put in the single-index part and those of the discrete covariates in the linear part. If the estimation results show that some of the continuous covariate effects can be relocated in the linear part, then a new model can be fitted with those continuous covariate effects moved to the linear part. The procedure is iterated several times if needed.

Appendix
In this Appendix, we state the assumptions and give the proofs of the main results.

A.1. Assumptions
The following are the regularity conditions for our asymptotic results.
(A1) sup 1≤i≤pn EX 4 i < C 0 , sup 1≤j≤qn EZ 4 j < C 1 for some constants C 0 > 0, C 1 > 0. (A2) Σ = Cov(T) is positive definite, and all of its eigenvalues are bounded between two positive c andC for all p n and q n . (A3) The function E(X 0 |θ τ 10 Z 0 = θ τ 10 z 0 ) and the density function f θ τ 10 Z0 (θ τ 10 z 0 ) of the random variable θ τ 10 Z 0 are both three times continuously differentiable with respect to z 0 . The third derivatives are uniformly Lipschitz continuous on T θ10 = {θ τ z 0 : θ ∈ Θ, z 0 ∈ Z 0 ⊂ R p0 }, where Z 0 is a compact support set. Furthermore, the density function f θ τ 10 Z0 (θ τ 10 z) is bounded away from 0 on T θ10 . (A4) The kernel function K(·) is a bounded, continuous and symmetric probability density function satisfying ∞ −∞ |u| j K(u)du < ∞ for j = 1, 2, and ∞ −∞ u 2 K(u)du = 0. Moreover, K(·) satisfies a Lipschitz condition on R 1 . (A5) The bandwidth h satisfies h → 0, and log 2 n/nh 2 → 0, and nh 3 → ∞. Condition (A1) is a technical condition imposed on the moments of X and Z in the context of diverging parameters. See more detailed discussions in Zhu and Zhu (2009a); Zhu et al. (2006). Condition (A2) is imposed on the covariance matrix of (X τ , Z τ ) τ to avoid the ill-conditioned problem of estimator Σ −1/2 n . Because Σ −1/2 n is needed in the CISE estimation procedure, the full rank condition of Σ guarantees that even when the dimensions of β 0 , θ 0 diverge. Once n is large enough, the estimator Σ −1/2 n is of full rank, and Σ n is invertible. See more details in Remark 2 of Zhu et al. (2006). Condition (A3) entails the density function f θ τ 10 Z0 (·) is positive, which implies that the denominators involved in the nonparametric estimators bounded away from 0. The three times continuously derivatives of E(X 0 |θ τ 10 Z 0 = θ τ 10 z 0 ) and f θ τ 10 Z0 (θ τ 10 z 0 ) are standard smoothness conditions in nonparametric estimation. Condition (A4) is a standard assumption in the nonparametric regression literature. The Gaussian and quadratic kernels satisfy this condition. Condition (A5) indicates that the "optimal" bandwidth can be routinely used.

A.2. Notations and Definitions
As Chen et al. (2010) mentioned, the CISE procedure hinges operationally on Grassmann manifold optimization. In order to prove the results of Theorems 3.1 and 3.2, we introduce some notations and definitions for ease of illustration.
Define the Stiefel manifold St(p, d) as Let ⌊v⌋ be the subspace spanned by the columns of η, then η ∈ G r (p, d), where G r (p, d) stands for the Grassmann manifold. The projection operator Next, we define the neighborhood of ⌊η⌋. For any matrix ω ∈ R p×d and δ ∈ R, the perturbed point around η in the Stiefel manifold can be expressed by L(η + δω), and the perturbed point around ⌊η⌋ in the Grassmann manifold can be expressed by ⌊L(η + δω)⌋. According to Lemma 8 of Manton (2002), w can be uniquely decomposed as ω = ηJ + η c K + ηD, where J ∈ R d×d is a skew-symmetric matrix, K ∈ R (p−d)×d is an arbitrary matrix, and D ∈ R d×d is a symmetric matrix. As Chen et al. (2010) showed that, for a sufficiently small δ, ⌊L(η + δω)⌋ = ⌊L(η + δη c K)⌋, which indicates that the movement from ⌊η⌋ in the near neighborhood only depends on the η c K. In other words, it suffices to consider perturbed points like L(η + δϑ) in the following proofs, where ϑ = ηJ + η c K and K t = tr(K τ K) = C for some given constant C, tr(·) is the trace operator.

A.3. Identifiability
In this section we provide an identifiability condition to guarantee that we have nonzero ϕ 1 and ϕ 2 . To this end, let Y 0 be an independent copy of Y . Write , where E X,Z,Y and E Y 0 stand for the expectation over (X, Z, Y ) and Y 0 , respectively. Let Σ X,Z = Cov(X, Z), and Σ X and Σ Z be the covariance matrices of X and Z, respectively. Then we have the following proposition.
Remark 3. (A.1) actually depicts the correlation relationship between . The proposition means that if we hope the eigenvector corresponding to the largest eigenvalue λ max to recover two directions β 0 and θ 0 , then β τ 0 XI(Y ≤ Y 0 ) and θ τ 0 ZI(Y ≤ Y 0 ) cannot be linearly related to each other. This requirement is reasonable.
Proof of Proposition A.1. Note that the CUME based kernel matrix (Zhu, Zhu and Feng, 2010) can be expressed as We have the following two equations: Note that v τ 0 Σv 0 = 1, then at least one of ϕ 1 , ϕ 2 is non-zero. Without loss of generality, we assume ϕ 1 = 0, but ϕ 2 = 0. A direct simplification from (A.2) and (A.3) yields that Multiplying β τ 0 and θ τ 0 from the left of (A.4), respectively, we have λ max = β τ 0 MX β 0 β τ 0 ΣX β 0 and further which contradicts condition (A.1). Then it is not possible that ϕ 2 is zero. In the same way, we can prove that when ϕ 2 = 0, ϕ 1 = 0 too. We complete the proof.
Step A.1. From the definition ofṽ n in Section 3.1, we can derive that For any s × s symmetric matrix A and any s × 1 vector x, Note that Condition (A2) indicates λ min (Σ) > 0. Then by Cauchy-Schwarz inequality and the equalityξ τ nξ n = 1, we have In the following, we show that λ max (Σ For any symmetric matrix A and any positive semi-definite matrix B, we have the following inequality: Taking A = Σ 1/2 n − Σ 1/2 , B = (Σ 1/2 n + Σ 1/2 ) ⊗2 , we then have Note that λ max (Σ n − Σ) ⊗2 ≤ Σ n − Σ 2 F . We know that all the elements of (Σ n − Σ) are of order O P (n −1/2 ). It follows that Σ n − Σ Observe that Then we have Step A.2. In this step, we show that ξ n −ξ * 0 2 2 = O P (d 2 n /n) and then we finally obtain the result ṽ n − v 0 2 2 = O P (d 2 n /n) together with the result of (A.6). It suffices to show that, for any given small ǫ > 0, there is a large constant C such that, for large enough n, Then we conclude that there exists a local minimizerξ n of Q n (ϑ; G n , Σ n ), with probability approaching one, such that ξ n − ξ * 0 2 = O P (d n / √ n). Since ϑ ∈ T ξ * 0 (p n + q n , 1), we have ϑ = ξ * c 0 K, K ∈ R (pn+qn−1) . Thus, as d 2 n /n → 0, applying Lemma 1 of Chen et al. (2010), we have We first deal with Υ 1n . Denote by 1 n the (p n + q n ) vector of ones.
A.5. Proof of Theorem 3.2 Step B.1. To prove the first conclusion, we first show that there exists an r > p 0 + q 0 such thatṽ n,r = 0. Suppose that allṽ n,r are non-zero for r > p 0 + q 0 . Then, according to the proof of Theorem 2 in Chen et al. (2010),ṽ n satisfies the following equation: nṽ n and ι n = (α 1 sign(ṽ n,1 ), α 2 sign(ṽ n,2 ), . . . , α pn+qn sign(ṽ n,pn+qn )) τ . It follows from the expression ofH n thatH n is an idempotent matrix with rank p n +q n −1, and Σ nṽn is the eigenvector ofH n corresponding to its eigenvalue 0. Let (l 1 , . . . , l pn+qn−1 ) be the eigenvectors ofH n corresponding to its eigenvalue 1. Thus, (Σ nṽn , l 1 , . . . , l pn+qn−1 ) is an independent basis of the space R pn+qn .By using this independent basis, there exist two sequences of constants {a r } pn+qn r=1 , {a ′ r } pn+qn r=1 such that M nṽn = a 0 Σ nṽn + pn+qn−1 r=1 a r l r , and ι n = a ′ 0 Σ nṽn + pn+qn−1 r=1 a ′ r l r . Plugging these two expressions into (A.8), we obtain that From (A.9), we obtain that for r > p 0 + q 0 , where e r is a (p n + q n ) vector with 1 in the rth position and 0 elsewhere for r > p 0 + q 0 . We have supposed thatṽ n,r are non-zero for all r > p 0 +q 0 , then [sign(ṽ n,r )] 2 = 1. Note thatṽ τ n Σ nṽn = 1, By the result of Theorem 3.1 and the definition of v 0 = (β τ 0 φ 1 , θ τ 0 φ 2 ) τ , we know that ṽ n − v 0 2 2 = O P d 2 n /n and pn+qn r=p0+q0+1 |ṽ n,r | 2 = O P d 2 n /n . It follows from (A.11) and (A.12) that Together with (A.12) and that √ n min −→ 0, a contradiction, which means that, as n → ∞, there is at least r 0 > p 0 + q 0 such thatṽ n,r0 = 0 with probability going to 1. We now further confirm that, for all r > p 0 + q 0 ,ṽ n,r = 0 with probability going to 1. This can be proved in a way similar to the proof of Theorem 2 in Chen et al. (2010). The details are omitted.
Similar to the proof of Theorem 3.1, we first prove that, for large n and arbitrarily small ǫ > 0, there exists a sufficiently large constant C such that P inf ϑ I ∈TûI n ( p0+ q0,1): Applying Lemma 1 (ii) of Chen et al. (2010), we have Partially linear single-index models
Next, we show that R n1 is of order o P (1/ √ n). First, we deal with R (1) n1 on the set {A n = A 0 }.
Thus, (A.18) entails that R (1)B n1 = o P (1/ √ n). Now, we consider the first term R (1)A n1 on the set {A n = A 0 }, which can be expressed as follows.
Step C.2. We now decompose the denominator ofκ on the set {A n = A 0 } as follows.