Ensembling Variable Selectors by Stability Selection for the Cox Model

As a pivotal tool to build interpretive models, variable selection plays an increasingly important role in high-dimensional data analysis. In recent years, variable selection ensembles (VSEs) have gained much interest due to their many advantages. Stability selection (Meinshausen and Bühlmann, 2010), a VSE technique based on subsampling in combination with a base algorithm like lasso, is an effective method to control false discovery rate (FDR) and to improve selection accuracy in linear regression models. By adopting lasso as a base learner, we attempt to extend stability selection to handle variable selection problems in a Cox model. According to our experience, it is crucial to set the regularization region Λ in lasso and the parameter λmin properly so that stability selection can work well. To the best of our knowledge, however, there is no literature addressing this problem in an explicit way. Therefore, we first provide a detailed procedure to specify Λ and λmin. Then, some simulated and real-world data with various censoring rates are used to examine how well stability selection performs. It is also compared with several other variable selection approaches. Experimental results demonstrate that it achieves better or competitive performance in comparison with several other popular techniques.


Introduction
Variable selection is a classical problem in statistics and has enjoyed increased attention in recent years due to a massive growth of high-dimensional data across many scientific disciplines. In modern statistical applications, the number of variables or covariates often exceeds the number of observations . In such settings, the true model is often assumed to be sparse, in the sense that only a small proportion of the variables actually relates to the response. Thus, variable selection is fundamentally important in statistical analysis of high-dimensional data. With a proper selection method and under suitable conditions, we are able to build a good model to interpret the relationship between covariates and our interested outcome more easily, to avoid overfitting in prediction and estimation, and to identify important variables for applications or further study.
For variable selection, many researchers focus on multiple linear regression models. To emphasize that variable selection methods are useful for other statistical models as well, we use a different statistical model, that is, a Cox's proportional hazards model (abbreviated as Cox model) [1], as the platform in this context. The Cox model was first proposed for exploring the relationship between the survival of a patient and some explanatory variables. As a matter of fact, the Cox model [2,3] nowadays is one of the most commonly used forms in semiparametric models and it can not only solve the issues of censored data, but also analyze the influence of various factors on survival time simultaneously. A brief mathematical description of the Cox model is given as follows.
Suppose that there are observations {( , x , )} =1 of survival data. For an individual , denotes its survival time and x = ( 1 , 2 , . . . , ) stands for the observed data for the covariates. At the same time, ∈ {0, 1} is a censoring indicator variable, where = 0 means that is rightcensored. Let ℎ( ) be the hazard rate at time ; the generic form of a Cox's proportional hazards model can be expressed as 2 Computational Intelligence and Neuroscience where = ( 1 , 2 , . . . , ) is a -dimensional unknown coefficient vector and ℎ 0 ( ) is the baseline hazard function, that is, the hazard function at time when all covariates take value zero. In general, can be estimated by maximizing partial likelihood function. For convenience, we assume ℎ 0 ( ) = 1 below.
Like linear regression models, traditional methods such as subset selection [4,5], forward selection, backward elimination, and a combination of both are among the most common methods for selecting variables in a Cox model. However, these methods will have difficulty in computation when faced with high-dimensional data. Therefore, some other methods have been proposed to overcome this problem. After lasso (least absolute shrinkage and selection operator) [6] was first proposed for linear regression models, Tibshirani [7] extended it to the Cox model. Later on, many scholars [2,3,[8][9][10][11][12] developed some penalized shrinkage techniques like SCAD [13] and adaptive lasso [14] specially for Cox models.
Although the above-mentioned variable selection methods have been shown to be successful in theoretical properties and numerous experiments, their performance strongly depends on the proper setup of the tuning parameter. On the other hand, these approaches may be unstable (especially in the situation of high-dimensional data). Breiman [15] proved that uncertainty can lead to more prediction loss. What is more important, small changes in data can result in that the same method selects different models. This makes the subsequent interpretation difficult and unreliable. In order to obtain more stable, accurate, and reliable variable selection results, ensemble learning [16,17] is one kind of extremely potential technologies.
As a hot research topic in machine learning, ensemble learning is used more and more widely in many fields of natural science and social science in last two decades. The powerful advantages of ensemble learning lie in improving the generalization capacity and enhancing robustness in the process of learning. Its main idea is to obtain a number of different base learning machines by running some simple learning algorithm and then combine these base machines into an ensemble learning machine in some way. Generally, the base learning machines should have strong generalization capability on one side, and they should also complement each other on the other hand.
The ensemble approach for statistical modeling was first proposed for solving prediction problems, aiming to maximize prediction accuracy. Inspired by this idea, Zhu and Chipman [18] applied bagging ensemble approach to handle variable selection problems, aiming at maximizing selection accuracy. Meanwhile, they pointed out that there is much difference between "prediction ensembles" (PEs) and "variable selection ensembles" (VSEs). More recently, ensemble learning methods have attracted more attention on coping with variable selection problems since they can greatly improve the selection accuracy and lessen the risk to falsely select unimportant variables and simultaneously overcome the instability of traditional methods in the high-dimensional data analysis. Because of these benefits, there are more and more researches applying ensemble learning to variable selection and putting forward some novel approaches. As far as we know, existing VSE techniques mainly include PGA (parallel genetic algorithm) [18], stability selection [19], BSS (bagged stepwise search) [20], random lasso [21], ST2E (stochastic stepwise ensemble) [22], TCSL (tilted correlation screening learning) [23], RMSA (random splitting model averaging) [24], SCCE (stochastic correlation coefficient ensemble) [25], and PST2E (pruned stochastic stepwise ensemble) [26]. It is noteworthy that these algorithms are mainly designed for handling variable selection problems in linear regression models. Only Zhu and Fan [20] investigated the performance of BSS and PGA in the Cox model.
Through analyzing these VSE techniques, it can be found that their success primarily lies in producing multiple importance measures for each predictor. By simply averaging these measures across multiple trials, the noise variables can be more reliably distinguished from the informative ones. In this process, the strength to select important variables and the diversity between the importance measures need to be preserved simultaneously [20,22]. Stability selection applies subsampling (or bootstrap) to a selection method like lasso to improve its performance. In fact, it is an extremely general ensemble learning technique for identifying important variables. Due to the characteristics of lasso, it is very efficient in high-dimensional situations. Another good property of stability selection is that it provides an effective way to control false discovery rate (FDR) in finite sample cases provided that its tuning parameters are set properly. Due to its versatility and flexibility, stability selection has been successfully applied in many domains such as gene expression analysis [24,[27][28][29]. Nevertheless, we have not found any literature about applying stability selection to a Cox model. Therefore, in this paper we would like to extend it to the situation of Cox models. At the same time, we also discuss how to set appropriate values for the involved parameters so that stability selection achieves its best performance.
The remainder of the paper is described as follows. In Section 2, the details for applying stability selection to the Cox model are described. We also provide an explicit way to set its involved parameters. In Section 3, some numerical experiments were conducted to study the impact of min on the behavior of stability selection and to compare its performance with other variable selection approaches for the Cox model. In Section 4, some real examples are analyzed to further study the effectiveness of stability selection. Finally, some conclusions are offered in Section 5.

Stability Selection Algorithm for the Cox Model
In this paper, we consider stability selection with lasso as its base learner. Lasso [6] is one of the most effective techniques to deal with high-dimensional linear regression problems with > . With respect to its application in Cox models, the core idea is to maximize the partial likelihood minus the observation failing at time . The lasso algorithm needs to maximize under the constraint ∑ =1 | | ≤ . In (2), is the set of indices, , with ≥ (i.e., the observations are at risk at time ). Equivalently, the estimate of can be obtained aŝ where is the regularization parameter which controls the trade-off between the model fitting and the coefficient shrinkage degree. At present, there are several efficient algorithms [7,30] (such as cyclical coordinate descent) to get̂in (3). We refer readers to the related literature for more details about the optimization strategy.
In applications, we need to first set a sensible region, say, Λ = [ lower , upper ], for the regularization parameter in lasso. Notice that lasso will choose all variables (i.e., full model) for ≤ lower while choosing none of the variables (i.e., null model) for ≥ upper . By taking candidate values in Λ, that is, lower = 1 < 2 < ⋅ ⋅ ⋅ < = upper , lasso generally employs 5-fold or 10-fold cross-validation to select an optimal value of , say opt . Then, the variables which have nonzero coefficient estimation under opt are determined as important variables. Although lasso with opt being specified in this way has good prediction performance, much evidence [14,19,21] has shown that it tends to choose more variables than necessary (i.e., higher FDR).
To eliminate this drawback of lasso, Meinshausen and Bühlmann [19] developed stability selection which works by choosing variables whose selection probabilities are large as important ones. In reality, the selection probability can be estimated by running lasso on multiple different sets. These sets can be obtained via subsampling from the given set. Specifically, stability selection first estimates the probability that variable ( = 1, 2, . . . , ) is important for each regularization parameter 1 , . . . , , and then takes the maximum probability over Λ = { 1 , 2 , . . . , } as the important measure for . Eventually, it selects important variables by a preset threshold thr . The detailed steps of stability selection algorithm for the Cox model are listed in Algorithm 1.
As argued by Meishausen and Bühlmann [19], the prominent advantage of stability selection is to control FDR under finite sample size and simultaneously to weaken the theoretical assumptions that are required to achieve variable selection consistency (i.e., the probability that the fitted model includes only truly important variables is tending to one when → ∞). Let be the number of falsely selected variables with stability selection; Meinshausen and Bühlmann [19] have proved that, under some mild assumptions, for arbitrary thr ∈ (1/2, 1), the expectation of satisfies where Λ represents the average number of variables selected by base learner. Roughly speaking, we can set any two parameters of Λ , thr , and ( ) and determine the remaining one according to the above inequality. For example, let ( ) ≤ 4 and thr = 0.7; then Λ can be specified as Λ = ⌈(1.6 ) 1/2 ⌉ in which ⌈ ⌉ denotes taking the smallest integer larger than or equal to . As stated in [19], thr is recommended to take value in the range of thr ∈ [0.6, 0.9] and the results tend to be similar. As far as ( ) is concerned, it can be set by users according to the level of FDR that they would like to control. In general, small ( ) means to control FDR strictly so that less noise variables are falsely included. Nevertheless, too small ( ) may cause some truly important variables omitting in the final model. On the other hand, ( ) can be larger if one can accept a little higher FDR to make sure that all important variables can be included. Regarding Λ , it should be no less than the number of truly important variables. Because we have no means to know the number of truly important variables in advance, however, one can first specify ( ) and thr and let Λ be determined automatically.
As mentioned earlier, the crucial role of stability selection is to reduce the FDR of lasso (i.e., to exclude noise variables more reliably). Intuitively, it is still difficult to identify the true sparse model if too much noise variables are falsely included every time. Thus, a minimum value of (or min ) needs to be specified for stability selection so that every time at most Λ variables are chosen when ≥ min . Subsequently, only the 's lying in the interval [ min , upper ] are taken as candidate values of to implement lasso in each trial.
According to our experience, the setting of min as well as Λ is crucial to the success of stability selection. However, we cannot find any detailed instruction in related literature [19,27,28] about how to set them. Moreover, all the existing literature related to stability selection has not discussed how to apply it in Cox models. Here, we would like to provide an explicit way to cope with this problem in the framework of Cox models. According to the proposal in [30], we can first set upper for lasso in a Cox model as in which ] . 4 Computational Intelligence and Neuroscience Input y: an × 1 response vector containing survival times for observations. : an × 1 vector containing censoring indicators for observations. X: × design matrix. Λ: regularization parameter set, i.e., Λ = { 1 , 2 , . . . , }.
thr : a pre-set threshold. Output: an index set I for selected variables.

Main process of stability selection
(1) For = 1, 2, . . . , (a) Randomly draw a subset (X ( ) , y ( ) , ( ) ) of size ⌊ /2⌋ without replacement from (X, y, ). Here, ⌊ ⌋ stands for the largest integer less than or equal to . Here, is the number of subjects (observations) at risk at time and is the set of indices, , with < (i.e., the times for which observation is still at risk). Subsequently, we can set lower = upper with = 0.05 for < and = 0.0001 for ≥ . In order to create + 1 candidate values for ∈ [ lower , upper ], we can set = upper ( lower / upper ) / for = 0, . . . , .
Next, the parameter min in stability selection can be determined by Equation (7) implies that min must be chosen to ensure that lasso selects at most Λ variables for each ∈ Λ = [ min , upper ]. Specifically, one can begin with = upper and decrease gradually until lasso detects Λ variables as important (i.e., Λ variables having nonzero coefficients). The value of obtained at this point is exactly min defined in (7). Then, only the candidate values lying in [ min , upper ] are considered as the candidate values for in lasso to execute variable selection.

Experimental Studies
With simulated data, some experiments are conducted in this section to investigate the impact of min on the behavior of stability selection in a Cox model and to compare it with several other variable selection approaches. In order to maintain consistency and comparability, we set ensemble size as 200.
Each simulation was run 100 times to estimate the evaluation of a method. To simplify notations, we abbreviated stability selection as StabSel. Regarding lasso, we made use of 10fold cross-validation to determine its optimal regularization parameter.

Simulation 1: Influence of min . Meinshausen and
Bühlmann [19] stated that the threshold value thr is a tuning parameter whose influence is small as long as it is in the range of (0.6, 0.9). According to our experience, min has more significant effect in comparison with the parameter thr . When and thr are fixed, small min will make lasso select more variables in each path. As a result, some noise variables may be falsely considered as important ones (i.e., high false positive rate). On the other hand, the noise variables can be safely filtered out by setting a large min . However, this may lead us to miss some signal variables (i.e., high false negative rate). Thus, min plays a role in controlling the trade-off between false positive rate and false negative rate of StabSel. Due to this consideration, we fixed thr = 0.6 and report results for several values of min in the first experiment.
Suppose that there are = 8 variables, x 1 , x 2 , . . . , x 8 , with each generated from the standard normal distribution (0, 1). Furthermore, the variables are correlated with ( , ) = 0.5 | − | for all ̸ = ( , = 1, . . . , 8). The response y was generated from an exponential distribution whose hazard function is where the true coefficient vector = (3, 1.5, 0, 0, 2, 0, 0, 0) . Clearly, only three variables x 1 , x 2 , x 5 are truly important Computational Intelligence and Neuroscience 5 and the remaining ones are unimportant. We took = 50 and conducted three experiments with censoring rates 0%, 20%, and 40%, respectively. For the censoring mechanism, a censoring time is generated independently and uniformly from [0, ] for each observation. If > , we replaced with and then let = 0. Here, the parameter was chosen to achieve some desired censoring rates. For example, = 45 corresponds to 20% censoring rate and = 4 corresponds to 40% censoring rate. Aiming at evaluating the performance of StabSel for a given min , we computed the selection frequency of StabSel in each case. Specifically, the selection frequency was calculated as, among 100 simulations, the minimum, median, and maximum number of times that the important and unimportant variables ( and ) are selected by StabSel, respectively. Interested readers can refer to [26] for the detailed definition of selection frequency. Table 1 summarizes the results for the cases with different centering rates.
The results in Table 1 demonstrate that StabSel using a relatively large min performs slightly better in excluding unimportant variables. However, the side effect is that it more likely misses some truly important variables. In other words, StabSel controls false discovery rate (or false positive rate) quite effectively with a relatively large min , but this will cause it to behave poorly in terms of catching important variables. To improve its selection accuracy, we must reduce min . Nevertheless, this inevitably allows more false discoveries. In practice, it is worthy of choosing an appropriate value for min depending on whether our emphasis is more on false positive rate or false negative rate. Moreover, we need to pay more attention to the tuning of min if the censoring rate is high.

Simulation 2: Performance Comparison on a Cox Model with High-Dimensional Data.
In this subsection, we concentrated on applying StabSel and lasso to a Cox model with high-dimensional data. To generate the design matrix, the following two simulated datasets were generated by following the strategy in [19]. Moreover, we created sparse regression vectors by setting = 0 for all = 1, . . . , , except for a small variable set . For all ∈ , we chose the coefficient independently and uniformly in [0, 1] and let the size = | | varying between 4 and 10. Here, we employed the method used in Section 3.1 to achieve the censoring rates 0% and 20%. Then, a Cox model was constructed by (8).
To compare the power of StabSel and lasso to ranking variables, we adopted the strategy utilized by [19], that is, focusing on the probability that variables in can be recovered correctly, where ∈ {0.1, 0.3}. For lasso, this means that there is a regularization parameter such that at least ⌈ ⌉ variables in are selected while all variables in = {1, . . . , } \ are not selected. For stability selection, it stands for the fact that ⌈ ⌉ variables with highest selection frequency are all in . In this example, we fixed the threshold value thr = 0.6 and Λ = ⌈(0.8 ) 1/2 ⌉ to determine a proper value for min .
The top two subplots in Figure 1 correspond to the situation of = 0.1 while the bottom two subplots illustrate the results for = 0.3. Notice that the latter task is more challenging than the former one. When the covariates are independent in Case 1, lasso performs satisfactorily and the advantage of StabSel is not significant. In Case 2, the dominance of StabSel over lasso to identify important variables more correctly can be clearly seen, especially when faced with censored data. In the more challenging task in which more important variables are required to be ranked ahead (i.e., = 0.3), the superiority of StabSel is more significant.
In conclusion, this experiment shows that StabSel is indeed helpful to enhance the ranking ability of lasso.
As for the variables other than x 5 , x 10 , x 15 , the coefficient is zero. Altogether, three simulation studies were conducted with censoring rates 0%, 20%, and 40%, respectively. For StabSel, we fixed thr = 0.6 and Λ = ⌈(1.6 ) 1/2 ⌉. As mentioned in Section 2, the number of variables that lasso selects in each trial should be at least larger than the number of truly important variables. Thus, we increased the factor multiplying in Λ because is small in this simulation. We compared it with traditional stepwise search as well as some VSE techniques including BSS [20], PGA [18], RSMA [24], and ST2E [22]. The parameters involved in these methods were set according to the related literature. Table 2 summarizes the selection frequencies of IV and UIV for each approach. The results demonstrate that although PGA performs better to exclude unimportant variables, it may miss some truly important variables. On the other hand, RSMA, ST2E, and StabSel can identify almost the same number of important variables; the difference only lies in the exclusion of unimportant ones. In this aspect, StabSel is observed to behave the best. As for BSS, its ability to guard against noise variables seems to be worse than the others although it works well to identify IVs.
In order to see more clearly the differences among the considered approaches, we computed the average selection rate of IV and UIV. For IV, it was computed as the selection probabilities averaged over all important variables. The metric was similarly estimated for UIV. The results are illustrated in Figure 2. The top three subplots are IVs while the bottom three ones are for UIVs. From Figure 2, we can come to some conclusions similar to those drawn from Table 2.
At the same time, we also utilized several other metrics to extensively evaluate each method. First, we computed the selection success rate [13]. Given an algorithm, it refers to the fraction of times among 100 runs that the algorithm correctly identifies the true model (i.e., the model only includes {x 5 , x 10 , x 15 }). Second, the true positive rate (TPR) and true negative rate (TNR) of each method were considered. In particular, TPR and TNR are as follows:   wherê= (̂1 , ,̂2 , , . . . ,̂, ) is the estimated coefficient vector in the th simulation. In addition, |IV| and |UIV| represent the size of IV and UIV, respectively. The method "Oracle" corresponds to fitting a Cox model with only variables x 5 , x 10 , and x 15 . Usually, a good variable selection method should produce results as close as possible to those of Oracle. It can be seen from Table 3 that stepwise method is hopeless to select variables since it can hardly find the true model. Among the VSE algorithms, StabSel always reaches the largest selection success rate, especially when the censoring rate is high. On the other hand, StabSel tends to achieve a model size closest to that of Oracle. As far as the prediction performance is concerned, StabSel almost always outperforms the other approaches.

Real-World Applications
In this section, we applied the compared VSE techniques to three real-world datasets, that is, PBC [31], Lung [32], and Rats [33]. These real datasets were taken from the R package survival. For the original PBC and lung sets, we simply ignored the observations containing missing data. In these situations, there are no means to know which variables are truly important or not. Aiming at evaluating the selection behavior of each method, we took the original variables as truly important ones (i.e., IVs). Then, some irrelevant variables were artificially added to these sets by following the strategy used in [25,34]. These irrelevant variables were generated from a uniform distribution on the interval [0, 1]. Table 4 lists the main characteristics of the used three datasets.

8
Computational Intelligence and Neuroscience  Analogous to the situation of simulation studies, the ensemble size was set as = 200. The parameters involved in each method were set similarly to those used in simulations. For each dataset, the experiment was repeated 100 times. In each replication, a training set was randomly drawn from the given set with size being specified in Table 4. The rest of observations was then used as a test set to evaluate the prediction performance measured with C-index [35] (i.e., concordance index). In particular, we applied each algorithm to the training set to perform variable selection. Based on the selected variables, the parameters in the corresponding model were estimated and the C-index was estimated on the test set. Table 5 shows the results obtained with each algorithm.
In terms of selection rate, it can be observed from Table 5 that BSS performs well to identify IVs. Nevertheless, it behaves worse to exclude UIVs. On the contrary, PGA shows the lowest selected rate of UIVs while it has the lowest selected rate of IVs. Therefore, BSS and PGA are not ideal selection methods. For the remaining methods, StabSel, RSMA, and ST2E behave similarly in identifying IVs. But when compared with StabSel, RSMA and ST2E include more irrelevant variables. In conclusion, StabSel achieves better performance on variable selection when being evaluated with selection rate.
Furthermore, the results of C-index in Table 5 reveal that the prediction performance of StabSel is competitive although it is not the best one. Furthermore, almost all ensemble methods tend to have low TPR values in these three real datasets. This is largely due to the fact that we directly consider all the original covariates as IVs among which some are actually uninformative.

Conclusions
As an ensemble method, StabSel [19] is the marriage of subsampling with a variable selection algorithm such as lasso. Due to its property of controlling false discovery rate, StabSel has a flexible manner to choose a proper amount of regularization. Another superiority of StabSel over lasso is that it requires less assumptions to achieve variable selection consistency. In this article, we extended StabSel to the Cox model. The specification of min significantly affects the performance of StabSel since it controls the balance between false positive rate and false negative rate. We provide an explicit way to set a proper value for min in the situation of Cox models. In comparison with other VSE techniques including PGA, BSS, RSMA, and ST2E, StabSel exhibits better selection ability to correctly identify important variables in a high-dimensional Cox model. At the same time, StabSel has satisfactory prediction performance. When the censoring rate is high, its advantage is even more significant. Therefore, StabSel can be considered as an alternative to explore the relationship between covariates and survival times in survival analysis.

Conflicts of Interest
The authors declare that they have no conflicts of interest.