Variable selection for partially linear models via learning gradients

Partially linear models (PLMs) are important generalizations of linear models and are very useful for analyzing high-dimensional data. Compared to linear models, the PLMs possess desirable flexibility of nonparametric regression models because they have both linear and non-linear components. Variable selection for PLMs plays an important role in practical applications and has been extensively studied with respect to the linear component. However, for the non-linear component, variable selection has been well developed only for PLMs with extra structural assumptions such as additive PLMs and generalized additive PLMs. There is currently an unmet need for variable selection methods applicable to general PLMs without structural assumptions on the non-linear component. In this paper, we propose a new variable selection method based on learning gradients for general PLMs without any assumption on the structure of the non-linear component. The proposed method utilizes the reproducing-kernel-Hilbert-space tool to learn the gradients and the group-lasso penalty to select variables. In addition, a block-coordinate descent algorithm is suggested and some theoretical properties are established including selection consistency and estimation consistency. The performance of the proposed method is further evaluated via simulation studies and illustrated using real data. ∗Supported by NIH grants P30 CA016087 and P30 AG008051. The authors thank the editor, the associate editor and the referees for insightful comments and suggestions.


Introduction
The partially linear model (PLM) is an important generalization of the linear model [8]. During the past decades, it has become a useful tool in statistical analysis for parsimoniously modeling high dimensional data while reflecting nonlinear trend of some continuous covariates [17,24,35,38]. And it has been applied to analyze data in many fields such as econometrics [51], biomedicine [20,23,55], and environmetrics [34].
The PLM has the flexibility of a nonparametric regression model, while it contains a linear component whose estimators have desirable asymptotic properties with simple interpretability. These features make it a very useful model for analyzing high-dimensional data where variable selection plays an important role. Variable selection for the linear component of the PLM has been well studied [3,11,33,46]. However, existing variable selection methods for the non-linear component usually rely on some extra structural assumptions, e.g., additivity is a common assumption imposed on the structure of the non-linear component. Variable selection procedures have been developed for both additive partially linear models [27] and generalized additive partially linear models [45]. In addition, many other variable selection procedures proposed for additive models or generalized additive models can be also extended to conduct variable selection in additive partially linear models [18,19,26,37,47].
In this paper, we will develop a novel variable selection procedure, based on the idea of gradient learning, for partially linear models without imposing any assumption on the structure of the nonparametric component. The method of learning gradients can be traced back to Mukherjee and Zhou [30]. Other de-velopments include Mukherjee and Wu [29] and De Brabanter et al. [7], which mainly focus on estimating the gradient functions to do regression or classification. Recently, gradient learning technique has been employed to conduct variable selection in non-parametric regression, such as Ying et al. [53], Ye and Xie [52] and Yang et al. [49]. Although gradient learning procedures are modelfree, the computational cost for variable selection in the nonlinear component is very high. Therefore, the model-free gradient learning procedures have been only used for low dimensional data. In this paper, using partially linear models as extensions of linear models for high dimensional data with the added flexibility of nonparametric regression models for selected covariates, we can make the model-free gradient learning procedures applicable for high dimensional data.
In the literature, many nonparametric variable selection procedures [2,6,22,28,36] have been proposed via imposing structural assumptions. Lafferty and Wasserman [22] proposed a greedy method called "rodeo", for simultaneously performing local bandwidth selection and variable selection in nonparametric regression, which starts with a local linear estimator with large bandwidths, incrementally decreases the bandwidth of variables for which the gradient of the estimator with respect to bandwidth is large, and then conducts a sequence of hypothesis tests. Bertin and Lecue [2] proposed an l 1 -penalized procedure in the non-parametric Gaussian regression model, but they only considered variable selection at a fixed point. Miller and Hall [28] proposed their "LABAVS" algorithm, along with several variable selection criteria including the local lasso, hard thresholding, and backward stepwise, but its computational cost is very high. Rosasco et al. [36] proposed a general nonparametric variable selection procedure, via modeling the regression function and penalizing its gradients. The difference between [36] and [49] is that the former models the gradients indirectly while the latter models the gradients directly. The current paper improves computational efficiency and convergence on the method proposed in [49] to conduct variable selection in partially linear models without assuming structural constraints on the nonlinear component. Similarly, we should be able to extend the method proposed in [36] to conduct variable selection in partially linear models. Finally, Comminges et al. [6] studied the asymptotic analysis of variable selection in nonparametric regression and revealed two different regimes. The setting considered in this paper belongs to the first regime.
The rest of the paper is organized as follows. In Section 2, we propose a variable selection procedure based on gradient learning for partially linear models, along with an algorithm for implementing the procedure and a method for selecting the tuning parameters. In Section 3, we study some asymptotic properties of the proposed procedure. In Section 4, we evaluate the performance of the proposed procedure via simulation studies and real data applications. We conclude the paper with some summary and discussion in Section 5.

Method
Assume data (y i , x i ), i = 1, . . . , n, are independently generated from the partially linear model, where y is the response variable, x = (z T , w T ) T consists of d = p + q predictors with z = (z (1) , . . . , z (p) ) T and w = (w (1) , . . . , w (q) ) T , E( ) = 0, and V ( ) = σ 2 . By convention, vectors are denoted by bold letters and their elements by nonbold letters. In this model, we assume that the effects of predictors in z are linear and the effects of predictors in w are non-linear, with β * and f * as their true effects, respectively. There is no assumption about the structure of f * , which is only assumed to be twice differentiable. As discussed in [12], statistical accuracy, model interpretability, and computational complexity are three important pillars of any statistical procedures. When the number of predictors d is large, variable selection plays a crucial rule in strengthening these three pillars. For this aim, we assume that the true partially linear model (2.1) is sparse in the sense that some elements of z and some elements of w have no effect on the response variable. Specifically, in the linear component, a predictor z (j) is noninformative if the corresponding effect β * j = 0, j = 1, · · · , p, where p can be very large. In the nonlinear component, a predictor n depends on the distance between x i and x j and some pre-specified parameter τ n . In order to conduct variable selection in both the linear and nonlinear components, we propose to consider a penalized procedure, minimizing the following objective function over the vector of parameters β = (β 1 , · · · , β p ) T and functions g = (g 1 , · · · , g q ) T , E z,w (β, g ) + λ 1 p l=1 π z,l |β l | + λ 2 q l=1 π w,l g l K , (2.2) where · K is the norm of some reproducing kernel Hilbert space (RKHS) with kernel K(·, ·). In addition, π z,l and π w,l are some weights to be discussed in Subsection 2.1, and λ 1 and λ 2 are tuning parameters that control the compromise between goodness-of-fit and parsimony of the selected model to be discussed in Subsection 2.2. The RKHS method is a very popular tool for modeling nonparametric function. By the representer theorem [43], the minimizer of (2.2) Here are some remarks on the objective functions (2.2 and 2.3). First, the proposed partially linear model is a special case of the nonparametric model considered in Yang et al. [49], if we write the RKHS considered in Yang et al. [49] as the direct sum of an RKHS of linear functions with respect to z and an arbitrary one with respect to w. However, this special case is still worth investigating, because (1) the number of parameters is reduced from (p + q)n in a general RKHS in to p + qn in this special RKHS. Therefore, compared with Yang et al. [49], our proposed method can be implemented much faster and can be applied to high dimensional data, (2) the resulting penalty term can be written as the summation of the well-known adaptive lasso penalty [56] and the adaptive group-lasso penalty [44] and (3) the tuning can be respectively considered for the linear component and the non-linear component.
Second, that the above penalized objective function can be used to conduct variable selection is due to the two sparsity-inducing penalties that are incorporated. The first penalty is the lasso type of penalty, which is the lasso penalty without weights π z,l [41] and is the adaptive lasso penalty with weights [56]. The second penalty is the group-lasso type of penalty, which is the group lasso penalty without weights π w,l [54] and is the adaptive group-lasso penalty with weights [44]. In particular, the group-lasso type of penalty on the nonparametric gradient functions has the so-called "all-in-all-out" property. That is, when λ 2 is large, some individual terms of the group-lasso penalty will become zero, that is g l K = 0 for some l's, implying that g l (w) ≡ 0, for any w.
Third, the kernel weights w ij are introduced, because f * (w j ) can be locally approximated by f * (w i )+g * (w i ) T (w j − w i ). This idea of approximation comes from kernel weighted regression [9]. For simplicity, the parameter τ n in the kernel weights is not considered as a tuning parameter, and can be set as the median over the pairwise distances among all the sample points [30].

Implementation
First we consider weights π z,l and π w,l . As in [56], we select them as π z,l = 1/| β l | γ1 and π w,l = 1/ g l γ2 2 , where β l and g l are the initial estimates of β l and g l via the following, where tuning parameters λ 1 and λ 2 can be determined using cross validation or generalized cross-validation [14]. The computation of the above minimization problem with ridge type of penalties is fast, because the solutions have explicit formulae. Now we are ready to describe an algorithm to implement (2.3). Although the proximal algorithm [1,32] can be used, here we use the coordinate descent algorithm [13], which was also used in [49]. For this aim, we alternatively update α and β. When α is given, we update β via the following minimization, is the loss function of β given α. This minimization can be solved by using R package "glmnet". Next we consider updating α given β.
Given β, we can updateᾱ via the following minimization, This minimization can also be solved by using the R package "gglasso".

Tuning
Let θ = (α T , β T ) T and λ = (λ 1 , λ 2 ) T . Assume that some appropriately tuning parameters, λ = ( λ 1 , λ 2 ), are selected, say, by the procedure discussed in . . , p} is for the linear component. We propose to select tuning parameters λ 1 and λ 2 using the selection stability procedure proposed by [39]. Here we briefly describe this tuning procedure, and the reader is referred to [39] for more details. Given λ = (λ 1 , λ 2 ), let S λ (D) denote the subset of variables selected for the partially linear model under consideration based on training dataset D. Randomly partition the original dataset D into two halves D 1 and D 2 , we have S λ (D 1 ) and S λ (D 2 ). Then we use Cohen's kappa [5] to measure the agreement of these two subsets and denote it as κ(S λ (D 1 ), S λ (D 2 )). When B random partitions are repeated, we obtain B copies of kappa measures, and denote their average as stab(λ), which measures the selection stability given λ. Finally, we consider its maximizer, λ = argmax λ {stab(λ)}, as an estimate for the tuning parameters.

Asymptotic theory
In this section, we derive the estimation consistency and variable selection consistency of the proposed method under the following assumptions. Assumption A1. The support Z of Z and support W of W are non-degenerate compact subsets of R p and R q , respectively. Also, sup w H * (w) 2 ≤ c 1 for some constant c 1 , where H * (w) = ∇ 2 f * (w) and · 2 is the l 2 norm, denoted as the largest eigenvalue of the matrix.
Assumption A2. For some constant c 2 , the probability density p(x) of x exists and satisfies |p( Assumption A3. There exist constants c 3 and c 4 such that c 3 ≤ lim n→∞ min 1≤l≤p0 π z,l ≤ lim n→∞ max 1≤l≤p0 π z,l ≤ c 4 , c 3 ≤ lim n→∞ min 1≤l≤q0 π w,l ≤ lim n→∞ max 1≤l≤q0 π w,l ≤ c 4 , λ 1 π z,l → ∞ for l > p 0 and n −3/2 λ 2 π w,l → ∞ for l > q 0 . About the above three assumptions, Assumption A1 is often used in the literature of partially linear models [16], which can simplify the technical proof significantly. Assumption A2 specifies the smoothness of the density function of x by the regular Lipschitz condition. Under Assumptions A1 and A3, the density p(x) is continuous and bounded on X , and therefore there exists some constant , then there exists some constant c 6 such that Prob(X t ) ≤ c 6 t for any t. Assumption A3 provides the convergence rate of adaptive Lasso weight, which will guarantee the estimation and selection consistency.
in probability, as n → ∞.
Compared with weak estimation consistency in Yang et al. [49], the strong estimation convergence rate is established under Kernel norm regularization in this paper. However, we only consider the estimation consistency in Theorem 3.1, which is the common practice in literature [10,49,56] and the sparsity level will be discussed in Theorem 3.2.
Theorem 3.2 assures that, with probability tending to 1, the selected variables is exactly the same as the truly informative variables.

Simulation studies
We examine the performance of the proposed variable selection method for partially linear models (referred to as PL), comparing against some other popular variable selection methods in literature, including the variable selection method for additive models proposed by [47], Cosso by [26], model free gradient learning method by [49] and regular gradient learning method [30], referred to as Add, Cosso, MF-GL and GL respectively. Note that most of the partially linear model based variable selection procedure, such as Cheng et al. [4], Liang and Li [25], Liu et al. [27], Ni et al. [33] and Wang et al. [45], do not select variables in the nonlinear part. Therefore, we will not include these methods because their performance of nonlinear part selection is inferior than our proposed method.
In all the numerical studies, Gaussian kernel n is set as the median over the pairwise distances among all the sample points [30]. For all the four methods considered, the tuning parameters are selected using the same criteria, the variable selection stability criteria proposed by [39], as discussed in Subsection 2.2. For simplicity, we set λ 1 = λ 2 in all the simulation examples. We consider the following two simulation examples. The data generating model is additive partially linear in Example 1, while is non-additive partially linear in Example 2.
Example 1. In this example, first generate predictors z i = (z are independently generated from U (−0.5, 0.5), j = 1, · · · , p and l = 1, · · · , q, for each i = 1, · · · , n. Then set f * (w i ) = 2sin(πw Example 2. In this example, first generate predictors z i = (z are independently generated from N (0, 1), j = 1, · · · , p and l = 1, · · · , q, for each i = 1, · · · , n. Then set f * (w i ) = (3w Each scenario is replicated 50 times, and the results are reported in Table 1  for Example 1 and Table 2 for Example 2. Specifically, column "Size" reports the averaged number of selected variables, "TLP" and "TNP" reports the number of selected truly informative variables in the linear and nonlinear components respectively, and "FLP" and "FNP" reports the number of selected truly non-informative variables in the linear and nonlinear components respectively. Columns "C", "U", and "O" report the times, out of 50 times, of correct-fitting, under-fitting, and over-fitting, respectively. From Table 1, we see that our newly proposed method (PL) is comparable with the others when the data generating model is additive partially linear model. From Table 2, we see that our method outperforms the others when the data generating model is nonadditive partially linear model. First, the average size of selected subsets by PL is closer to 7 than the other methods. Second, most of TLP and TNP from our method are close to 5 and 2, while both FLP  0.000 24 20 6 and FNP are close to zero, which means that our method is good in selecting the true subsets in both linear component and non-linear component. Third, from column "C', we see that our method has the largest number of times when the true subset is selected exactly. Fourth, from columns "U" and "O", we see Cosso and Add are often under-fitting when n = 150, are often over-fitting  0.000 24 22 4 when n = 300, and are often over-fitting or under-fitting when n = 225. Fifth, we find that GL always tends to over-fitting or under-fitting. This may be due to the fact that GL selection procedure depends on the prespecified truncated value. From both tables, we see that although the performance for MF-GL is promising, the computational cost is relatively high, especially as the sample size and dimension increase. In addition, we see that as variance σ 2 becomes larger, it is more challenging to select the true subset.

Real data applications
We further examine the performance of the proposed variable selection method using two real data applications, the digit recognition data [40] and Japanese industrial chemical firms data [48], both of which are publicly available.

Digit recognition data
In the digit recognition data, each digit is described by an 8×8 gray-scale image with each entry ranging from 0 to 16. Due to their similarity, it is challenging to distinguish digits 3 and 5. Therefore, in this real data application, we only consider digits 3 and 5, and the resultant dataset consists of 365 observations and 64 attributes. Consider outcome variable y = 3 for digit 3 and y = 5 for digit 5. For the purpose of demonstration, we consider the partially linear model as the predictive model, although other classification models may be more appropriate. By some descriptive analysis, we find that these two digits mainly differ on dimensions 19, 21, 22, 27 and 54. Therefore, in the partially linear model, we put these 5 dimensions in the nonlinear component and the others in the linear component.
In this analysis, all the variables are standardized, all the missing value observations are ignored, and all the four aforementioned variable selection procedures (PL, Cosso, Add and MF-GL) are applied. The performance of the variable selection procedures are compared based on the averaged prediction errors using only the selected variables. The averaged prediction errors are estimated as what follows. First, the dataset is randomly split into two parts, m = 35 observations for testing and the remaining 330 observations for training. Second, partially linear model is fitted based on the training dataset, one submodel is selected, and the average prediction error estimate is obtained. The results are summarized in Table 3.
From Table 3, we see that the proposed variable selection procedure, PL, selects less variables and has smaller prediction error than the other variable selection procedures. Figure 1 shows two randomly selected digits of 3 and 5 in the right and middle panels respectively, and the two finally selected contributes are displayed in the left panel. We can see the the two digits are clear different at the two selected contributes. Although MF-GL provides competitive performance, its computational cost is about 10 times higher than PL.

Japanese industrial chemical firms data
The Japanese industrial chemical firms dataset consists of 186 Japanese industrial chemical firms listed on the Tokyo stock exchange, and the goal is to check whether concentrated shareholding is associated with lower expenditure on ac-  For the purpose of demonstration, we consider the partially linear model. From some descriptive analysis, we find that variables LEVERAGE, VARS, OPER2 and BDA are highly correlated with the MH5, and that the marginal relationships of LEVERAGE, OPER2 and BDA with the response variable are strongly linear. For other less correlated variables, the marginal scatter plot shows that associations between MH5 with ASSETS, AGE, TOP5 and TOP10 are seemly linear. Therefore, we put LEVERAGE, OPER2, BDA, ASSETS, AGE, TOP5 and TOP10 in linear part, while others in nonlinear part.
We apply the same strategy to compare those four variable selection procedures. The only difference is that now we consider 100 times random split of the dataset into a test dataset of m = 24 observations and a traning dataset of the remaining observations. The results are summarized in Table 4. From Table 4, we can see that PL has the smallest prediction error. Although Cosso, Add and MF select less variables, their prediction errors are larger. Especially, the PL method selects variable OWNIND, which are not selected by the other methods. Figure 2 displays the scatterplot of MH5 against OWNIND. It seems that the mean of MH5 does not change much with OWNIND, while its variance appears to shrink as OWNIND increases. Table 4 The

number of selected variables and the prediction errors by various selection methods in
Japanese industrial chemical firms dataset.

Summary
There has been an unmet need for developing a novel variable selection procedure for partially linear models without imposing structural constraints on the non-parametric component. Both computational efficiency and model flexibility are important for analyzing high dimensional data. Recently, the idea of gradient learning has become popular for variable selection, because it is model free [49]. However, model-free gradient learning is computationally expensive for highdimensional data in pure non-parametric setting. Therefore, gradient learning is particularly suitable for partially linear models. In this paper, we propose a variable selection procedure based on gradient learning for partially linear models. The proposed procedure incorporates two sparsity-inducing penalties, one for variable selection in the linear component and the other for variable selection in the nonlinear component. Since the computational cost of variable selection in the linear component is cheap, the proposed procedure is computationally feasible and applicable for the analysis of high-dimensional data. At the same time, the variable selection in the nonlinear component is model-free, thus without the need to impose any assumption on the structure of the nonlinear component.
The proposed procedure is formulated in terms of reproducing kernel Hilbert space, which provides a general framework for developing efficient implementation algorithm and deriving desirable theoretical properties. Specifically, a block-wise coordinate decent algorithm is developed and estimation consistency and selection consistency are both established. Two issues are crucial in applying the proposed methods. One issue is selecting the tuning parameters. We propose to use stability selection for the tuning parameters, and other tuning methods such as cross-validation can also be used. The other issue is the specification of the partially linear model, that is deciding which variables are in the linear component and which variables are in the nonlinear component. In this manuscript, we focus on variable selection and assume that the partially linear model has been pre-specified. The reader is referred to [16] for the discussion of partially linear model specification. Under similar assumptions, some existing general variable selection procedure, such as Rosasco et al. [36], can also be extended to partially linear models in a similar fashion.

Appendix: Technical proofs
Denote the first term in objective function (2.2) as E z,w (β, g ), and denote its expectation as E(β, g ), which is equal to E{w(x, x )}, and the expectation is over random predictors. w ( β, g ), and J(β, g ) = λ 1 p l=1 π z,l |β l | + λ 2 q l=1 π w,l g l K . Then the following inequality holds, Partially linear model variable selection
The first inequality holds because β and g are the minimizer of E z,w (β, g ) + J(β, g ), the last equality is due to the fact that β * l = 0 for any l > p 0 and g * l = 0 for any l > q 0 .