BS-SIM: An eﬀective variable selection method for high-dimensional single index model

: The single index model is an intuitive extension of the linear regression model. It has become increasingly popular due to its ﬂexibility in modeling. Similar to the linear regression model, the set of predictors for the single index model can contain a large number of irrelevant variables. Therefore, it is important to select the relevant variables when ﬁtting the single index model. However, the problem of variable selection for high-dimensional single index model is not well settled in the literature. In this work, we combine the idea of applying cubic B-splines for estimating the single index model with the idea of using the family of the smooth integration of counting and absolute deviation (SICA) penalty functions for variable selection. We propose a new method to simultaneously perform parameter estimation and model selection for the single index model. This method is referred to as the B-spline and SICA method for the single index model, or in short, BS-SIM. We develop a coordinate descent algorithm to eﬃciently implement BS-SIM. We also show that under certain conditions, the proposed method can consistently estimate the true index and select the true model. Simulations with various settings and a real data analysis are conducted to demonstrate the estimation accuracy, selection consistency and computational eﬃciency of BS-SIM.


Introduction
Consider a univariate response Y and a p-dimensional predictor X. The single index model takes the following form where T indicates the transpose of a matrix, θ 0 is a vector of length p and referred to as the index, f is an unknown smooth function, and ε denotes the random error term. The single index model generalizes the linear model by incorporating a non-parametric link function f , and it has applications in a wide range of fields. A number of methods have been proposed to estimate the true index θ 0 in the literature. Härdle and Stoker [5] introduced the Average Derivative Estimation (ADE) method. It relies on an intrinsic property of the single index model that θ 0 is proportional to the gradient ∂f /∂X. Several modified ADE methods have been proposed later, including the density-weighted ADE method [16] and the out-product of gradients method [26]. Another category of estimation methods consist of methods that simultaneously estimate θ 0 and f . The Minimum Average Variance Estimation (MAVE) method proposed by Xia et al. [27] enjoys the most popularity among these methods. One major drawback of the ADE-based methods and the MAVE method is that they all use high dimensional kernels in estimation, and thus suffer from the curse of dimensionality. Consequently, they do not perform well in estimation even when the dimension p is moderate. To overcome this, Xia et al. [27] also proposed the refined MAVE (rMAVE) method by replacing the high dimensional kernel with a lower dimensional projection kernel. However, the computational complexity of MAVE and rMAVE still grows rapidly with the sample size n, and they can become unstable when p increases. Recently, Wang and Yang [23] proposed the Single-Index Prediction (SIP) estimator by using the cubic B-splines to estimate θ 0 and f simultaneously. The application of the cubic B-splines circumvents the drawbacks suffered by high dimensional kernels, and as expected, simulation studies showed that SIP is considerably faster than MAVE, especially in the high dimensional case.
In practice, a large number of variables among the predictors may not be related to the response. Similar to the linear regression, it is important to select the relevant variables when fitting the single index model. Various traditional variable selection methods have been extended to the single index model; for example, AIC [11] and cross-validation [6]. However, these methods suffer from the same drawbacks as in the linear regression. They are intensive in terms of computation. Furthermore, it is infeasible to develop the large sample properties for the resulting estimates. Tibshirani [20] introduced the least absolute shrinkage and selection operator (LASSO) as a regularization method for simultaneous parameter estimation and variable selection in the linear models. LASSO has gained huge popularity since it was proposed, due to its succinctness and computational efficiency. Zhao and Yu [29] studied the sufficient and almost necessary condition, namely the Irrepresentable Condition, under which LASSO can consistently select the true model. Several attempts have been made to incorporate LASSO or its variants into the single index model; see [24,14,28,25]. All of these methods combine some penalty function with MAVE, thus they inherit the drawbacks of MAVE. They are computationally inefficient for increasing sample size and become unstable when the dimensionality is high.
There are various extensions or variants of LASSO proposed in the literature; see [3,31,1] among others. Lv and Fan [9] considered a unified framework for regularized least squares methodology with a family of concave penalty functions. This family of penalty functions forms a smooth homotopy between the L 0 and L 1 penalities and thus is referred to as smooth integration of counting and absolute deviation (SICA) penalty functions. It includes LASSO as a limiting case. Lv and Fan [9] also developed the properties of the resulting estimator under the linear model and the SICA penalty function. More specifically, they obtained the conditions on the design matrix under which the estimator can recover the true model. These conditions on the design matrix are less restrictive than the Irrepresentable Condition, which may make the SICA penalty more appealing in cases where the Irrepresentable Condition does not hold and LASSO is not consistent in variable selection.
In this paper, we propose a new method to simultaneously perform parameter estimation and model selection for the single index model. This method combines the idea of using B-splines for estimating the single index model with the idea of using the SICA penalty for variable selection. We refer to it as the B-spline and SICA method for the single index model, or in short, BS-SIM. We develop a coordinate descent algorithm to efficiently implement our method for both low and high dimensionality. We prove that under mild regularity conditions, our method is consistent in estimation and can achieve the optimal estimation rate. We further show that with more conditions on the structure of f and the design matrix X, our method also has the ability to correctly identify the true model. As mentioned in the previous paragraph, the LASSO penalty is a limiting case of the family of the SICA penalty functions. Therefore, the algorithm and the properties of BS-SIM are also applicable to LASSO. When LASSO is applied, the method is referred to as the B-spline and LASSO method for the single index model, or BL-SIM. The simulation studies and a real data example demonstrate that BS-SIM provides excellent computational efficiency, estimation accuracy and selection consistency for data of low to high dimensionality.
The rest of the paper is organized as follows. Section 2 describes BS-SIM and BL-SIM, and outlines an efficient algorithm to implement them. Section 3 presents the theoretical properties of the proposed methods. Section 4 and Section 5 illustrate the performance of the proposed methods in various simulation studies, as well as a real data example. The technical proofs are given in the Supplementary Material [32].

Spline estimation and regularization
Suppose a random sample of n observations is generated from the single index model n, where θ 0 = (θ 0,1 , θ 0,2 , . . . , θ 0,p ) T is the true index, and ε i 's are i.i.d random variables with mean 0 and a common variance σ 2 . Let Y = (y 1 , · · · , y n ) T denote the n × 1 reponse vector, and X = (x 1 , x 2 , · · · , x n ) T be the n × p matrix with x i representing its i-th row. The true index θ 0 is only identifiable up to a scale constant without further constraint. In the literature, there are two popularly used identifiability constraints: Identifiability Constraint 1: θ 0,1 = 1; Identifiability Constraint 2: θ 0 2 = 1 and θ 0,1 > 0.
In this work, we consider any general and feasible constraint on the scale of θ 0 . We work with the nontrivial case that there is at least one non-zero component in θ 0 . Thus, for any constraint, it is important to first identify one component θ 0,k that is non-zero. This component θ 0,k can be assumed as known or identified by methods such as marginal correlation. Without loss of generality, we assume k = 1. Although a large number of general identifiability constraints can be used, in Section 4, we show with simulation studies that different constraints can have different impacts on the performance of the used method in various aspects.
Suppose one specifies the following identifiability constraint: C(θ) = 1, where θ = (θ 1 , θ 2 , . . . , θ p ) T , and C is an explicit function on the scale of θ. Then θ 1 can be expressed as a function of the remaining components, that is, θ 1 = C 1 (θ 2 , θ 3 , . . . , θ p ). Let φ = (θ 2 , θ 3 , . . . , θ p ) T be the (p−1)-dimensional sub-vector of θ by excluding the first component, and let t θ = X T θ. Let φ 0 denote the last (p − 1) components of θ 0 . Let Φ be the space for φ. With an appropriate identifiability constraint imposed, φ and θ have a one-to-one association. Then the goal of inference under the single index model is to estimate φ 0 (and thus θ 0 ) and the true link function f . For a given θ, let t i θ = x T i θ be the projected data onto the direction of θ, i = 1, 2, . . . , n. Let t θ (min) = min i t i θ and t θ (max) = max i t i θ . The interval [t θ (min), t θ (max)] is partitioned into (N + 1) subintervals. Let T N be the sequence of the N interior knots that separate the subintervals. Let B 4 = (B 4,1 , B 4,2 , . . . , B 4,N +4 ) T be the cubic B-spline basis functions on [t θ (min), t θ (max)] with knots T N . The explicit form of B 4 can be derived recursively [2]. Here we slightly abuse the notations in the sense that θ and T N are omitted in the representation of the basis functions. The evaluations of the basis functions on the projected data points are denoted as B θ . That is, denotes the evaluation of the cubic B-spline basis functions at t.
The cubic B-spline estimator of f is defined asf θ (·) =α T B 4 (·), whereα = (α 1 , . . . ,α N +4 ) T , and can be obtained by solving the following least-squares problem Note thatf θ (·) depends on θ. Wang and Yang [23] further proposed to use the following least-squares method to estimate θ 0θ whereθ un denotes the unpenalized estimator of θ 0 , and Θ = {θ : θ 2 2 = 1, θ 1 > 0}. As discussed in Section 1, the dimension p can be high in practice, and the set of predictors can include a large number of irrelevant variables. Therefore, it is of interest to produce a sparse estimator of θ 0 , and thus achieve automatic variable selection. This motivates us to utilize the spline estimatorf θ (·) for f described above, coupled with the regularized least squares method for estimating θ 0 to achieve efficient and simultaneous parameter estimation and variable selection.
Since θ 0,1 is assumed to be non-zero, we penalize φ instead of θ. We further use the family of the SICA penalty functions. That leads us to the following objective function R(φ; λ).
wheref θ is the cubic B-spline estimator of f for a given θ, λ is a tuning parameter, and ρ a (u) denotes the SICA penalty function with the following form where I is the indicator function. For simplicity, we do not include a in the notation of R, and write R(φ; λ) as R(φ) when there is no confusion. For a fixed λ, we define the following estimator of φ 0 , The corresponding estimator for θ 0 is denoted asθ, and is referred to as the BS-SIM estimator. As discussed in [9], the SICA family of penalty functions provides a smooth homotopy between the L 0 and L 1 penalties, and we have That means, the LASSO penalty is the limiting case of the SICA penalty. In some applications, the LASSO penalty can also be of interest, and the estimator based on LASSO is defined separately below. We denote the objective function when a = ∞ as R L (φ; λ). That is, where · 1 denotes the L 1 norm. We write it as R L (φ) when there is no confusion. For a fixed λ, we define the following estimator of φ 0 , and the corresponding estimator for θ 0 is denoted asθ L . We refer toθ L as the BL-SIM estimator. It can be expected that the BS-SIM estimator can converge to the BL-SIM estimator as a approaches ∞.

Coordinate descent algorithm
For ease of representation, we define we develop a coordinate descent algorithm to findφ (orφ L ) for any given λ on a dense grid.
In addition, we use a local approximation to the SICA penalty function suggested by [9] as follows. where p−1 ) T . These two approximations entail that for a given φ (0) , Problem (2) can be approximated by where w j = ρ a (|φ (0) j |) for j = 1, 2, · · · , p − 1. To solve Problem (6), we cyclically update each component of φ while holding the other components fixed. That means, for j = 1, 2, · · · , p − 1, we solve the following univariate problem where h kl denotes the component in the kth row and the lth column of H (2) (φ (0) ), and β j denotes the jth element of H (2) ). Notice that Problem (7) is essentially a univariate LASSO problem, and the solution can be written down explicitly as where a j = β j − k =j h jk φ k . We repeatedly iterate through j and update the estimate of φ 0 , until some convergence criterion is met. When implementing Algorithm 1, there are two issues that require further attention. First, during the sth cycle of j, line search method is applied [12]. We start withφ (s) , and obtain a tentative updateφ (s+1) . Before settingφ (s+1) as the most current estimate of φ 0 , we need to check that the objective function R is indeed decreasing. If it is not, the step δ =φ (s+1) −φ (s) is repeatedly multiplied by 0.8, until the amount of movement along the direction δ that can result in a decrease in R is obtained. Here, 0.8 is chosen for the purpose of convenience, and may not be optimal. A more sophisticated choice can be further explored; see the previously mentioned reference on line search. The other issue faced during the implementation is that the optimization over φ should be carried out in the space Φ. However, the algorithm described above does not consider any constraint on the space over which the optimization is executed. For some identifiability constraints, such as the Identifiability Constraint 1 mentioned earlier, Φ is actually R p−1 ; for other identifiability constraints, such as the Identifiability Constraint 2 in the previous section, Φ is a constrained subspace of R p−1 . In the former case, no adjustment is needed; in the latter case, there requires an additional step that ensures that the updatedφ is in the constrained space Φ. For instance, it needs to be checked that the updatedφ satisfies φ 2 < 1, for Identifiability Constraint 2. If it does not, the step δ needs to be shortened such thatφ falls within Φ. Algorithm 1 outlines the search forφ at a given λ in more detail. Problem (3) can be solved in a similar fashion. The only difference is that for Problem (3), there is no need to use the local linear approximation to the penalty function. Therefore, the algorithm of searching forφ L is not separately displayed.

Tuning parameter selection
For regularization-based approaches, it is crucial to choose the tuning parameters, namely λ and a in our case. We start with the discussion of the selection of λ. We consider two types of methods for determining λ. The first one is mfold cross-validation, denoted as CV hereafter. In CV, the sample is randomly partitioned into m subsamples of equal size. Among these m folds, m − 1 of them are treated as the training set, and the remaining one is treated as the validation set. At each given candidate value for λ, the proposed approach is applied to the training set, and a fitting is obtained. Subsequently, the test set is used to assess the predictive accuracy of the obtained model. The residual sum of squares can be used as the assessment. This process is repeated m times until each fold of the sample is used as the test set exactly once. For a given λ, the m results on the assessment are then averaged. The value of λ that yields the smallest average is regarded as optimal.
The second type is the Bayesian Information Criterion (BIC) and its variants [19]. For variable selection under the linear model Y = X T β + , we examine the following four BIC-based criteria (9)- (12).
where RSS λ denotes the residual sum of squares at a given λ, σ 2 denotes the error variance, and d is the size of the identified model at a given λ. Furthermore, for criteria (11) and (12), k n represents the additional penalty imposed on the size of the model. In practice, σ 2 is rarely known. On the other hand, according to [18], under certain conditions, the BIC defined in (9) has the same asymptotic behavior as the one defined in (10). Thus, it is more convenient to rely on logBIC in (10) to select the tuning parameter λ. It has been previously proved that, when the number of predictors p is fixed as the number of observations n grows, one can identify the true model with probability tending to 1 in the linear models by using the logBIC criteria [22]. Nevertheless, when p diverges, the logBIC criterion (10) tends to yield a model that contains many irrelevant predictors. Several adjustments have been proposed in the literature to circumvent this issue [22,4]. The common approach these adjustments take is to place more penalty on the model complexity d. This idea naturally leads us to consider the GIC criterion in (11) and the logGIC criterion in (12). It is clear that GIC and logGIC include BIC and logBIC as a special case, respectively. Thus, GIC and logGIC can be regarded as the unified criteria to achieve the selection of λ for any p, and they can be extended to models other than the linear regression models. It is also worth noting that GIC involves σ 2 . When σ 2 is unknown, there are various ways to obtain an estimateσ 2 and replace σ 2 withσ 2 in GIC. We will elaborate on it in the next paragraph.
In order to choose a proper type of method for determining λ under our framework, we carry out extensive simulation studies under both the linear model and the single index model. We try different settings of p and the size of the true model. In the simulation studies, we useσ 2 = RSS 0 /(n − p) when n > p, andσ 2 = RSS λcv /(n − d cv ) otherwise, where λ cv denotes the value of λ selected by CV, and d cv denotes the size of the model selected by CV. For all settings, CV generally leads to an overfitted model. When the true model is sparse, logGIC with an appropriate k n performs the best in terms of identifying the true model for any p. GIC is a close second. As the number of relevant variables grows, the performance of GIC surpasses that of logGIC, and GIC becomes the most preferable. For the moderately sparse scenario, logGIC starts to break down as p increases. When the size of the true model is large, logGIC fails to work in the sense that it leads to either a very large model, or a very small model. Meanwhile, GIC can still produce significant improvement over CV when p is not large. When p also becomes large, the problem itself becomes too difficult that all of the methods rarely perform satisfactorily.
Based upon these observations, we propose the following rule of thumb principle for the selection of λ under our framework. When sparsity of the true model is assumed, we use logGIC; when the size of the true model is relatively large, we use GIC. An example illustrating the breakdown of logGIC and the advantage of using GIC under the violation of the sparsity assumption is given in Section 4 of the Supplementary Material [32].
As for the selection of a, it can generally be accomplished by m-fold crossvalidation. Since the focus of this work is to study the properties ofθ andθ L , we do not intensively examine the selection of a.

Theoretical properties
Before stating the theoretical properties of BS-SIM and BL-SIM, we first need to impose the following regularity conditions (A1)-(A3).

(A1)
The link function f has continuous and bounded second order derivative.
The number of interior knots N satisfies N ∼ n 1/5 .

Estimation consistency
To begin with, we show that, under mild conditions,θ is consistent in terms of estimation, and can achieve the optimal √ n rate for a well-selected λ. Moreover, as a special case,θ L share the same property on parameter estimation. We will also show the theoretical property off in terms of estimating f .

Theorem 1. Suppose the regularity conditions
As a special case, the BL-SIM estimatorθ L possesses the above properties.
Theorem 1 is expected and standard. Part (b) of Theorem 1 also facilitates the derivations on the selection consistency given in the next subsection. The next theorem characterizes the convergence rate off as an estimator of the regression function f .

Intuition and notations for selection consistency
By the definition of g ij , it is apparent that F is a weighted design matrix. That is, F is computed by multiplying row i of X with the corresponding derivative of f at t i θ0 , h(t i θ0 ), for i = 1, 2, · · · , n. When f is flat at t i θ0 , this data point does not contain much information on θ 0 , and the weight placed on row i is small; on the other hand, when f is steep at t i θ0 , this data point is informative, and the corresponding row is scaled with a larger weight. In the special case of the linear models, F reduces to X.
However, θ 0 is not free of identifiability constraint, and only the last p − 1 elements of θ 0 are of interest. Consequently, we consider Here, F 0 depends on the design X, the true link function f , and the true index θ 0 . To some extent, F 0 can be treated as the design matrix in the single index models, and it can play a crucial role in the subsequent analysis. For a given identifiability constraint, we can express θ 1 as a function of the rest (p − 1) components of θ, that is θ 1 = C 1 (θ 2 , . . . , θ p ). Let J be the corresponding Jacobian matrix for θ 0 , that is, .
And it follows that F 0 = F J. For simplicity, here we omit the dependence of J on the identifiability constraint in the notation. The forms of F 0 for the two popular identifiability constraints are illustrated in Section 1 of the Supplementary Material [32]. Notice that F 0 is essentially a scaled and adjusted version of the design matrix X.

Selection consistency
As detailed earlier, we use the cubic spline function to estimate the true link function f . For any θ, let Γ(θ) be the cubic spline space defined according to Section 2.1. We denote the projection matrix onto Γ(θ) as for i = 1, 2, · · · , n. Then, for any given θ, we can similarly defineF θ andC θ as 2,...,n;j=2,3,...,p , andC θ = 1 nF T θF θ . For succinctness, we writeF θ0 andC θ0 asF 0 andC 0 . Different from F 0 ,F 0 not only depends on X, f and θ 0 , it also relies on the spline approximation of the link function. We decomposeC 0 into four blocks in the same way we decompose C 0 . With the notations introduced above, we can impose the following crucial conditions onC 0 to establish the selection consistency of BS-SIM.
Note thatC 0 is related to the spline estimator of f , and thus it depends on the number and the location of the knots. That means the conditions given above are not free of the sample size n. On the other hand,F 0 is a scaled and adjusted version of the design matrix X. Hence, the Irrepresentable Conditions for BS-SIM are similar to the conditions by [9] in the sense that the above conditions replace the design matrix X in [9] withF 0 . With the Irrepresentable Conditions for BS-SIM, we are ready to state our theorem next. Theorem 3 characterizes the behaviour of BS-SIM in recovering the true model. It suggests that, if the Irrepresentable Conditions for BS-SIM hold, then the probability that BS-SIM is able to identify the true model converges to 1 exponentially. It can be easily shown that ρ (0+) = 1 + a −1 . As noted by [9], the conditions for SICA to identify the true model in the linear regression becomes less restrictive as a decreases, at the sacrifice of computational convenience. This statement also holds in the context of the single index model. That means, with smaller a, the Irrepresentable Conditions for BS-SIM are less restrictive, but it is harder to findφ. As pointed out earlier, LASSO is a limiting case of the SICA penalty. Therefore, it is expected that the BL-SIM estimatorφ L would possess the similar properties as given in Theorem 3. To present the properties forφ L , we start with the following assumption onC 0 .

Condition 2 (Irrepresentable Condition for BL-SIM). There exists a positive constant vectorη, such that the following inequality holds component-wise
where 1 p−q denotes a vector of 1's of length p − q.
Again, the Irrepresentable Condition for BL-SIM resembles the Irrepresentable Condition in [29], and the major difference is that the Irrepresentable Condition for BL-SIM replaces X withF 0 . Theorem 4 demonstrates that with the Irrepresentable Condition for BL-SIM imposed, the probability that BL-SIM selects the true model approaches 1 exponentially. Consistent with the monotonicity of the restrictiveness of the conditions, the Irrepresentable Condition for BL-SIM is more restrictive than the Irrepresentable Conditions for BS-SIM with finite a. This observation is also in line with that in the linear regression scenario, and it implies that BS-SIM may be able to recover the true model when BL-SIM fails.
Recall that the conditions presented previously rely on the sample size n. In what follows, we show that ifC 0 satisfies certain regularity condition, the selection consistency of the proposed methods can be achieved under conditions that are independent of n. From [23], we have sup j=2,3,...,p where h = 1/(N + 1) is the bandwith for the cubic B-spline functions. This means that (F 0 ) i → (F 0 ) i , as n → ∞, for any i, and (·) i denotes the ith row of a matrix. Based on this result, the following regularity condition can be imposed, for some matrix C free of n. We decompose C into four blocks in the same way we decompose C 0 . Next, we show that if the Irrepresentable Conditions on C are imposed, the proposed methods can consistently select the true variables.

Condition 3 (Limiting Irrepresentable Conditions for BS-SIM). C satisfies that
for some L and L 3 ∈ (0, ∞). Corollary 5 suggests that under the corresponding Limiting Irrepresentable Conditions, BS-SIM and BL-SIM can consistently recover the true model. On the other hand, same as the statements given in the last subsection, the Limiting Irrepresentable Conditions for BS-SIM become less restrictive as a decreases. As a result, the Limiting Irrepresentable Condition for BL-SIM is more restrictive than those for BS-SIM with finite a. The proofs of the theorems and the corollaries can be found in Section 2 of the Supplementary Material [32]. Remark 1. In the technical proofs provided in Section 2 of the Supplementary Material [32], we use the identifiability constraint that θ 0 2 = 1 and θ 0,1 > 0. Nevertheless, the properties presented above should hold for any reasonable constraint, and one should be able to derive the proof for other conditions without much difficulty.

Simulation results
In this section, we present the results from five simulation studies. We demonstrate that the proposed regularization approach used is indeed beneficial in several aspects. We also look at the impact of the tuning parameter a on the performance of the resulting estimator, and point out a reasonable choice of a in practice. Subsequently, we compare the performance of the proposed methods to other existing methods for moderate to large p. For the comparison when p is small, we refer to Section 3 of the Supplementary Material [32]. The last simulation example is concerned about the impact that the Irrepresentable Condition has on our proposed method's ability of recovering the true model. For the purpose of succinctness, we use V1 and V2 to denote the Identifiability Constraint 1 and Identifiability Constraint 2 in this section, respectively. For the link function, we consider the following three models: The models above are refered to as Model 1, Model 2, and Model 3, respectively. Furthermore, let Σ be a p-by-p matrix with the diagonal elements equal 1 and the off-diagonal element in kth row and lth column equal ρ kl . Each x i is sampled from N (0, Σ). The errors ε i 's are independently sampled from N (0, 1). We examine the following three forms of Σ: We denote these three types of correlation structure as COR1, COR2, COR3, respectively.
For the first four examples, four metrics are used to assess the performance of an estimator, which are Angle, False Positive Rate (FPR), Ture Positive Rate (TPR) and Computing Time (Time), respectively. Angle is defined as Angle = arccos(θ T 0θ ), where θ 0 is the true index andθ is an estimate, and they are standardized to have unit norm. FPR is defined as the ratio of the number of falsely identified predictors to the total number of identified predictor. TPR is the ratio of the number of correctly identified predictors to the total number of true relevant predictors. Finally, Time is the average time (in seconds) needed to obtain the estimate for one data set. In Examples 2-4, we search the best estimate on a dense grid of λ, and thus, Time represents the total amount of time consumed to find the estimate on the whole grid and yield the final estimate. On the other hand, in Example 1, Time refers to the amount of time used to find the estimate for a particular λ. In the tables presented in this section, the best performance on each metric is highlighted. Example 1. This example compares the performance of the proposed estimator to that of the unpenalized estimator. We consider a moderate dimension p = 70 with q = 8 and θ 0 = (2.0, −1.0, 0.5, 1.0, −1.5, 1.0, −0.3, 1.2, 0, · · · , 0) T . 100 samples of size n = 100 are generated from Model 1 with COR1. The coordinate descent algorithm described in Section 2.2 is used to implement BS-SIM with a = 0.1. The tuning parameter λ is chosen by three criteria, denoted as logBIC, logGIC1, and logGIC2, respectively. They correspond to three choices of k n for logGIC defined in Section 2.3, which are k 0 n = log(n), k 1 n = loglognlogp, and k 2 n = logp √ logn, respectively. Our method with λ = 0 is also applied to obtain the unpenalized estimate for θ 0 . In this example, only V2 is used. Table 1 shows the comparison results on the four aforementioned assessments. In terms of estimation accuracy and computing efficiency, both the BL-SIM estimators and the BS-SIM estimators are considerably better than the unpenalized estimator. It is a strong sign that the proposed regularization approach substantially helps with efficiently providing a more accurate estimator. Comparing the two proposed estimators, the BS-SIM estimators slightly outperform the BL- SIM estimators in estimation. In terms of the performance on variable selection consistency, the BS-SIM estimators are dramatically better. More specifically, the BL-SIM estimators have a more than 3-fold higher average FPR, indicating applying LASSO is more likely to lead to an overfitted model. In the computational efficiency aspect, BS-SIM is slightly faster than BL-SIM. As for the comparison among the three BS-SIM estimators, the estimator using logBIC has a noticeably higher average FPR than the estimators with λ chosen by log-GIC1 and logGIC2. Since the number of predictors is not that small (p = 70) in this example, this observation on FPR is consistent with the fact that logBIC yields a overfitted model when the dimension p increases. The performance of the two penalized estimators with λ chosen by logGIC1 and logGIC2 are similar in terms of the four metrics. The comparison results are shown in Table 2. It can be observed that as a increases, both Angle and FPR decrease first, then increase. Furthermore, when a continues to increase, the performance of the BS-SIM estimator approaches that of the BL-SIM estimator. In theory, the performance of the BS-SIM estimator in terms of variable selection should improve when a decreases. Nevertheless, the pattern shown in Table 2 implies that there exists certain computational difficulty in finding a consistent estimate when a is extremely small. On the other hand, the BS-SIM estimator with a = 0.1 outforms the rest in terms of selection consistency. When it comes to estimation accuracy, the performance of the BS-SIM estimator with a = 0.1 is also satisfactory. Therefore, we recommend to use a = 0.1 in practice. For the remaining examples, we fix a at 0.1, unless otherwise specified. Example 3. This example illustrates the performance of the proposed estimator for moderate p. We focus on the comparison between our method and other existing methods. In this example, we implement the proposed BS-SIM method  Table 3 Comparison between the proposed methods and the other existing methods in moderate dimensional scenario: Setting 1.
with a = 0.1, and the proposed BL-SIM method, as well as the SIM-LASSO method proposed by [28], the SMAVE method proposed by [24], and the MAVE method coupled with the Bridge penalty, proposed by [25]. The last method is denoted as SIM-Bridge hereafter. For SIM-LASSO, the tuning parameter is chosen by 10-fold cross-validation, and for SMAVE and SIM-Bridge, the tuning parameter is selected based on BIC, as suggested in the original papers. Moreover, all of these three methods only use V2. In this example, we let p be moderate and vary it from 50 to 70. 100 data sets of size 100 are simulated from the following settings:  Tables 3 -5 For both Setting 1 and Setting 2, the BS-SIM estimators outperform the rest in terms of both estimation accuracy and selection consistency. They are followed by the SIM-Bridge estimator in terms of selection performance. The other three methods do not produce satisfactory performance on variable selection, as they tend to result in overfitted models. In the computational efficiency aspect, the proposed BS-SIM method is also among the best. For Setting 3, the quadratic link function is used. Since X i 's are generated from a multivariate normal distribution, they concentrate around 0. However, the MAVE based methods rely on local linear expansion, thus they do not perform well around the origin, and break down for this quadratic link function. Hence, only the results from the proposed methods are presented for this setting. It can be observed that the proposed BS-SIM method exhibits acceptable performance in each aspect, and considerably outperforms the proposed BL-SIM method. Lastly, it is also worth pointing out that satisfactory performance can be maintained for the proposed methods under other combinations of model setting and correlation structure. for p = 400, the proposed method cannot produce acceptable results when there exists correlation among the predictors. Nevertheless, with more data points, the proposed BS-SIM method can still handle this high dimensional scenario with correlation among the predictors. However, we exclusively focus on COR1 and n = 100 for p = 400 here. The proposed BL-SIM method suffers greatly from overselection and is too time-consuming in the large p scenario, and SIM-LASSO and SMAVE break down in this example. Therefore, only the results from the proposed BS-SIM method and SIM-Bridge are presented. Since V1 poses no restriction on the magnitude of φ, the estimation with V1 becomes noticeably more unstable, and slower for some models, as p increases. Therefore, it is recommended to use V2 when p is large. Based on our simulation studies, V1 and V2 lead to comparable results under Model 1; whereas for Model 2, V2 is much more preferable. As for the choice of k n , it is recommended to use k 3 n = logplogn. Table 6 shows the results on the four metrics. In terms of estimation accuracy and selection consistency for Model 1 and p = 200, the proposed BS-SIM method yields reasonably accurate estimates, while SIM-Bridge does not perform well under all of the three correlation structures. For Model 2 and p = 200, comparable results on selection consistency are obtained. However, the proposed BS-SIM method produces more accurate estimate than SIM-Bridge, especially under COR3. When p = 400, SIM-Bridge fails, while the proposed BS-SIM method can still yield satisfactory results. In terms of computational capacity, for the proposed BS-SIM method, it takes about 20 minutes on average to complete one run for p = 200, and takes less than two hours for p = 400. Considering that this amount of time encompasses the search for the optimal λ on a dense grid, this computational efficiency is still acceptable. Moreover, the proposed BS-SIM method is noticeably more efficient than SIM-Bridge in this example. The simulation scheme is as follows. A covariance matrix Σ is first generated from Wishart(p,p), as done in [29]. Then we generate a sample of 100 observations of X from N(0,Σ), and standardize them. 100 normalized designs are generated in this way. Next, for each generated design, we run the following simulation 100 times. During each simulation, n copies of ε i are sampled from N (0, 0.3 2 ), and y i 's are calculated according to Model 1. Subsequently, the proposed method with the various choices of a specified above are applied, and the percentage of times that the applied method can identify the true model along the solution path is recorded.
Since it is difficult to quantify the Irrepresentable Conditions for BS-SIM, we computeη We first look at how the magnitude ofη ∞ affects the performance of the proposed BL-SIM method in selecting the true model. On the two top graphs in Figure 1, the percentage of times that the true model can be identified by the proposed BL-SIM method is plotted against the correspondingη ∞ , for the two Identifiability Constraints separately. It can be observed that the percentage increases asη ∞ increases, for both Identifiability Constraints. The increase is the sharpest around 0, as expected. On the two bottom graphs in Figure 1, the percentage of times of achieving selection consistency for the proposed BS-SIM method with a = 2 is plotted againstη ∞ , for the two Identifiability Constraints separately. It is obvious that the percentage for BS-SIM is larger than that for BL-SIM at anyη ∞ for both constraints. It is consistent with our expectation that BS-SIM with finite a should perform better in terms of variable selection than BL-SIM.
Next, we examine how a affects the proposed method in terms of selection consistency in more detail. The average percentages of times that the true model can be selected with various choices of a are shown in Table 7. In theory, the Irrepresentable Conditions become more restrictive when a increases. Thus, it is expected that it is less likely to choose the true model when a increases. However, as indicated in Table 7, when a gets larger, the percentage of runs that the true model can be identified increases slightly first, then decreases; and when a continues to increase, the percentage for the BS-SIM estimator approaches  that for the BL-SIM estimator. This particular pattern for the performance of BS-SIM implies that for extremely small a, it is computationally slightly more difficult to find a consistent estimator, although the Irrepresentable Conditions are relaxed. These observations on the impact of a are in line with those stated in [9].
The results in Table 7 also cast light on the role that the Identifiability Constraint plays. In most cases shown in Table 7, using V2 leads to a higher chance of recovering the true model. The difference of the chances becomes larger as a increases. This observation is consistent with the observation on the relative magnitude onη ∞ . Among the 100 designs generated above, 92% of them have largerη ∞ for V2. It is probably due to the fact that the Irrepresentable Conditions for V2 contains more information than those for V1.

Real data application
We then apply the proposed BS-SIM method to the Skin Cutaneous Melanoma data downloaded from the TCGA data portal (https://tcga-data.nci.nih. gov/tcga/findArchives.htm). On the clinical data, there are in total 433 patients. Their demographic information, tumor status, vital status and survival time are recorded. On a separate set of files, the expression levels of 181 proteins are measured for 207 patients using the M.D. Anderson Reverse Phase Protein Array Core platform (http://www.mdanderson.org/education-and-research /resources-for-professionals/scientific-resources/core-facilitiesand-services/functional-proteomics-rppa-core/index.html). The goal here is to study how the expression levels of the measured proteins influence the survival time of the patients. That means, we only retain those patients that failed to survive and had protein expression level measured for further analysis. After pre-processing, we have 94 patients, and the expression levels of 181 proteins. The expression levels are subsequently standardized and used as the predictors. The survival time is taken logarithm, and treated as the response. We apply the proposed BS-SIM method with a = 0.1 and 2 interior knots. Since we speculate there exists a relatively large number of relevant proteins, the GIC criterion introduced at the end of Section 2.3 with k n = log(n)loglog(p) is used to choose the tuning parameter λ. The behaviour of the logGIC criterion also to some extent confirms that the number of relevant proteins is relatively large, as it fails to effectively yield a reasonable model.
Out of these detected proteins, the irregular expression of the p21, p27, p53, PTEN, TAZ, Notch1, Caveolin, 53BP1, TSC1, Rb and Tuberin proteins have been shown to be related to the survival or occurence of the Skin Cutaneous Melanoma [10,15,17,8,7]. This partially demonstrates the effectiveness of the proposed method in selecting the relevant variables.

Conclusion
In this article, we propose a regularization based approach for variable selection in the single index model, named BS-SIM. It can achieve simultanoues and efficient parameter estimation and variable selection for low to high dimensionality. A coordinate descent algorithm is outlined to implement the BS-SIM method. The algorithm is implemented in R, and the associated codes are available free online at http://www.stat.purdue.edu/~cheng70/code.html. Extensive sim- ulation studies are carried out to validate the proposed method and the developed algorithm. Furthermore, we show the conditions under which the BS-SIM method can consistently estimate the true index and select the true variables. These conditions generalize the conditions developed under the linear model, and are novel for the single index model.
In Section 2.3, we briefly discuss the problem of choosing the tuning parameter under different settings, and the breakdown of BIC-type method under the violation of the sparsity condition. A systematic study on the tuning parameter selection for the linear model and the single index model with a finite sample would be an interesting future research topic. Furthermore, in this work, the location and the number of knots for the cubic B-spline functions are determined by rule of thumb. There are more sophisticated methods for choosing the knots in the literature, for instance, see [13] and [30]. How to develop a novel knots placement method for BS-SIM is another interesting future research topic.

Supplementary Material
Supplementary Material to "BS-SIM: An Effective Variable Selection Method for High-dimensional Single Index Model" (doi: 10.1214/17-EJS1329SUPP; .pdf). The supplementary material contains the technical proofs and additional simulation results.