Survival Prediction and Feature Selection in Patients with Breast Cancer Using Support Vector Regression

The Support Vector Regression (SVR) model has been broadly used for response prediction. However, few researchers have used SVR for survival analysis. In this study, a new SVR model is proposed and SVR with different kernels and the traditional Cox model are trained. The models are compared based on different performance measures. We also select the best subset of features using three feature selection methods: combination of SVR and statistical tests, univariate feature selection based on concordance index, and recursive feature elimination. The evaluations are performed using available medical datasets and also a Breast Cancer (BC) dataset consisting of 573 patients who visited the Oncology Clinic of Hamadan province in Iran. Results show that, for the BC dataset, survival time can be predicted more accurately by linear SVR than nonlinear SVR. Based on the three feature selection methods, metastasis status, progesterone receptor status, and human epidermal growth factor receptor 2 status are the best features associated to survival. Also, according to the obtained results, performance of linear and nonlinear kernels is comparable. The proposed SVR model performs similar to or slightly better than other models. Also, SVR performs similar to or better than Cox when all features are included in model.


Introduction
In the literature, different models have been proposed for survival prediction, such as Cox proportional hazard model and the accelerated failure-time model [1]. Although Cox proportional hazards model is the most popular survival model, its proportional hazards assumption may not always be realistic. Also, this model is applicable when the factors affect the outcome linearly. To overcome these issues, other models such as artificial neural networks and Support Vector Machines (SVM) are applied in the literature [2,3].
The Support Vector Regression (SVR) model has been applied extensively for response prediction [4][5][6]. However, there are few studies that use SVM for survival analysis. The traditional SVM models require complete response values, whereas survival data usually include the censored response values. Different attempts have been made to extend the standard SVM model making it applicable in survival analysis. Shivaswamy et al. [7] proposed a modified SVR algorithm for survival analysis. Ding [8] discussed possible application of SVR for censored data. Van Belle et al. [3,9] proposed a new SVR model using ranking and regression constraints. Another proposal regarding application of SVR in survival analysis was made by Khan and Bayer-Zubek [10]. The authors compared SVR with Cox for some clinical data 2 Computational and Mathematical Methods in Medicine sets. Van Belle et al. [11] proposed a new survival modeling technique based on least-squares SVR. Their proposed model was compared with different survival models on a clinical BC data set. Mahjub et al. evaluated and compared performance of different SVR models using data sets with different characteristics [12].
SVM does not include automatic feature selection. Feature selection methods for SVM are categorized into two types: filter and wrapper [13]. Wrapper methods are based on SVM, while filter methods are general and not specifically based on SVM.
When dealing with uncensored data, various studies use SVM for feature selection. Many of these studies use feature selection methods for SVM in classification problems [14][15][16][17][18]. For regression problems majority of available approaches propose filter methods [19,20]. Few studies can be found which propose a wrapper feature selection method for regression [19,21,22]. Among these studies is the Recursive Feature Elimination (RFE) method proposed by Guyon et al. [23] for gene selection in cancer diagnosis. Camps-Valls et al. [24] used SVR for prediction and analysis of antisense oligonucleotide efficacy. They applied feature selection using three methods: correlation analysis, mutual information, and SVM-RFE. Becker et al. [18] developed some feature selection methods and used SVM-RFE as a reference model for comparison.
In survival analysis, different feature selection methods are applied in the literature [1,[25][26][27][28][29][30]. However, using SVR for identifying factors associated with response is rare and to the best of our knowledge, there is no previous study in which SVR is used for evaluating subsets of features. Liu et al. [31] applied SVR with L1 penalty for survival analysis on large datasets. They identified survival-associated prognosis factors for patients with prostate cancer. In costrast to current study, they directly used SVR weights to select the reduced features. Du and Dua [32] discussed the strength of SVR in BC prognosis. They used two feature selection methods neither of which are based on SVR.
The purpose of this study is two-fold. First, we predict the survival time for patients with BC using different SVR models including all available features. We propose a new SVR model using a two-sided loss function for censored data. Performance of Cox model is compared to SVR models by means of three performance measures.
Second, for feature selection, we assess the relationship between every feature and the predicted survival time using relevant statistical tests. We further employ univariate feature selection and the SVM-RFE method to select the best subset of features. Univariate feature selection based on concordance index is proposed in current study. To the best of our knowledge, in the field of survival analysis, there are no previous studies that employ SVM-based feature selection methods. In this paper, all employed feature selection methods are based on SVR.

Materials and Methods
In this section, first the BC dataset and two available datasets are introduced. Then, different SVR models, Cox regression for censored data, and three feature selection methods are described. April 2004 to April 2011, a  historical cohort study was performed on 573 patients with  BC, who visited the Oncology Clinic in Institute of Diagnostic and Therapeutic Center orphanage of Hamadan province, in west of Iran. Inclusion criteria were as follows: (i) the patient must be woman, (ii) treated with chemotherapy and radiotherapy, (iii) has undergone one of the related surgery types: breast conservative surgery (lumpectomy), simple mastectomy, radical mastectomy, and partial removal of breast tissue (a quarter or less of the breast) (segmental mastectomy). 31 patients are not eligible and are eliminated from the study. 11 features are selected as independent variables. Table 1  Two publicly available datasets (data are available at https:// vincentarelbundock.github.io/Rdatasets/datasets.html) are also used for further analysis. The first data set was derived from one of the trials of adjuvant chemotherapy for colon cancer [33]. The dataset includes data of 929 patients. In this experiment (CD), the event under study is recurrence. The variables are as follows: type of treatment, sex, age, time from surgery to registration, extent of local spread, differentiation of tumor, number of lymph nodes, adherence to nearby organs, perforation of colon, and obstruction of colon by tumor. The second data set was obtained from Mayo Clinic trial in primary biliary cirrhosis of the liver [34]. The data set includes 312 randomized participants. There are two events for each person. In a first experiment (PT), the event is transplantation and in a second one (PD) the event is death. There are 17 variables which are as follows: stage, treatment, alkaline phosphatase, edema, age, sex, albumin, bilirubin, cholesterol, aspartate aminotransferase, triglycerides, urine copper, presence of hepatomegaly or enlarged liver, platelet count, standardized blood clotting time, presence of ascites, and blood vessel malformations in the skin.

Dataset Description. From
Some artificial experiments were also conducted to evaluate the effect of number of features included in the model. These experiments are generated with different number of features: 10, 20, . . . , 120. 50 datasets with 200 training and 500 test observations are generated for each feature value. Features values are simulated from a standard normal distribution. Similar methods of dataset generation have been previously applied in the literature [9]. The event time is exponentially distributed with a mean equal to , where is the feature vector and is the weight vector which is zero for half of features and is generated from a standard normal distribution for the rest of the features [31,35,36]. The censoring time is also exponentially distributed with a mean equal to , where determines the censoring rate. Other studies applied similar methods of dataset simulation [35,36]. denotes a -dimensional independent variable vector, is survival time, and is the censorship status. If an event occurs is 1, and if the observation is right censored is 0. In standard SVR, the prognostic index, that is the model prediction, is formulated as where ( ) denotes a linear combination of a transformation of the variables ( ) and is a constant. To estimate the prediction, SVR is formulated as an optimization problem and loss function is minimized subject to some constraints [9]. Shivaswamy et al. [7] extended the SVR model for censored data by modifying the constraints of standard SVR. In this paper, the standard SVR model for censored data is named model 1 and formulated as follows.

Description of the Formulas Used in SVR Models
Model 2 min , , , * , SVR-MRL min , , , * , , subject to ∀ = 1, . . . , , where and are positive regularization constants, , * , and are slack variables, and is sample size. The prognostic index for a new point * is computed aŝ where and * indicate the Lagrange multipliers. ( ) ( ) can be formulated as a positive definite kernel: Commonly used kernels are linear, polynomial, and RBF [9]. Recently, an additive kernel has been used for clinical data defined as ( , ) = ∑ =1 ( , ) in which (⋅, ⋅) is separately defined for continuous and categorical variables [9,37].

Survival-SVR Using Ranking and Regression Constraints
(Model 2). The SVR model proposed by Van Belle et al. [3,9] which includes both ranking and regression constraints is called model 2 in this paper and is detailed in "Description of the Formulas Used in SVR Models". Recall that the previous model (model 1) includes only regression constraints. In model 2, comparable pairs of observations are identified. A data pair is defined to be comparable whenever the order of their event times is known. Data in a pair are comparable if both of them are uncensored, or only one is censored with the censoring time being later than the event time. The number of comparisons is reduced by comparing each observation with its comparable neighbor that has the largest survival time smaller than , which is indicated with ( ) [9].

A New SVR Model Using Mean Residual Lifetime.
The survival SVR models, discussed in previous subsections, uses a one-sided loss function for errors arising from prediction of censored observations. We propose a new SVR model which uses a two-sided loss function. This model assumes that the event time for a censored observation is equal to sum of Computational and Mathematical Methods in Medicine 5 its censoring time and mean residual lifetime (MRL). For individuals of age , MRL measures their expected remaining lifetime using the following formula [1]: where ( ) is the survival function. Kaplan-Meier estimator is a standard estimator of the survival function which is used in the current study. Suppose the th individual is censored and shows her/his censoring time. Therefore, it is only known that she/he is alive until . Since MRL measures her/his expected remaining lifetime from onwards, event time for this individual can be estimated using sum of censoring time and MRL. Thus, the model is also penalized for censored observations when they are predicted to be greater than this sum. In this paper, this model is called SSVR-MRL and is formulated in "Description of the Formulas Used in SVR Models".
The parameters of SVR models described in this subsection and previous subsections are tuned using the three-fold cross-validation criterion. We compare different SVR models with the Cox model [1].
The different survival models are compared using three performance measures: concordance index (c-index) [3,7,9], log rank test 2 statistic [3,9,37], and hazard ratio [3,9,10]. Standard SVR (model 1) and SVR with ranking and regression constraints (model 2) based on different kernels are trained on all datasets. In each experiment 2/3 of the dataset is used as the training set and the rest is used as the test set. Each experiment is repeated 100 times with random division of training and testing subsets. The training set is used for estimating different models. The performance of models is evaluated based on the test data.

Feature
Selection. The first feature selection method proposed in current study is not a feature subset selection method. Considering that the prognostic index in SVR regression is directly correlated to survival time the prognostic index is calculated for all patients and the relationship between this index and each feature is assessed using a relevant statistical test.
The second feature selection method used in this study is univariate feature selection, which employs a univariate standard linear SVR for evaluating the relationship between each feature and survival time. After implementing experiments for all features, they are arranged in increasing order of c-index. Then, we pick out the p top ranked variables and include them in a linear standard SVR model. In the BC dataset, to select the best value of , both SVR and Cox models are fitted for every value of (1 − 11), and a subset with highest performance is selected.
The SVM-RFE algorithm, proposed by Guyon et al. [23], is also used for feature selection. This method identifies a subset with size among m features ( < ) which maximizes the performance measures of model. In this algorithm, the square of feature weight values in linear SVR was used as the ranking criterion [13,23].

Results
All reported performance measures are obtained using the median performance on 100 randomizations between training and test sets for prediction and 50 randomizations for feature selection. All SVR models are implemented in Matlab using the Mosek optimization toolbox for Matlab and Yalmip toolbox. Also we use "R", Version 3.1.2, for running the Cox model and calculating some performance measures.

Prediction of Survival.
All measures for the BC dataset indicated that SVR with linear kernel outperformed SVR with nonlinear kernels. The linear SVR models outperformed other models. Using the Wilcoxon rank sum test, statistically significant differences between linear SVR (model 1) and other models are indicated in Table 2. SVR-MRL slightly outperformed linear SVR (model 1), but differences of performances between the two models were not significant.
As the linear standard SVR model significantly outperformed other models, the prognostic indices were obtained based on this model. By comparing the mean value of prognostic index of subgroups of variables, the following results were obtained ( Table 1). The survival time decreased as the age increased. The patients who developed metastasis survived less than other patients. The mean of prognostic index in subgroup with smaller tumor size was higher than the mean of subgroup with larger tumor size. The relationship between other variables and survival time can be evaluated similarly.
The results of CD experiment showed that SVR-MRL and SVR with clinical kernel outperformed other models but the difference between their performances and linear SVR (model 1) was not significant. In the PD experiment, SVR-MRL and SVR with polynomial kernel slightly outperformed other models but their performance differences with Linear SVR were not significant. Linear SVR significantly outperformed Cox. In the PT experiment, performances of all models were almost similar except that Cox and SVR with clinical kernel performed significantly worse than linear SVR.

Feature Selection.
In the BC dataset, based on the first feature selection method, the relationship between the prognostic index and each feature was evaluated and values of the tests are presented in Table 1. Seven variables were significant ( value < 0.05). For univariate feature selection, Figure 1 indicates that the subset with five features has the best performance. In SVR-RFE method, eleven replications were implemented for finding all subset of features. According to the performance measures of subsets, the subset with three features was selected as the best. In this figure, it is clearly shown that when the model involves large number of features, SVR outperforms Cox.
In the CD experiment, RFE and univariate methods selected subsets with six and five features respectively, which had five common features (Figure 1). In PT experiment, univariate method selected the subset of five features and RFE method selected the subset of three features among which two features were similar (Figure 2). In the PD experiment, univariate method selected the subset with eight features. RFE selected a subset with five features all of which were observed in the subset selected by RFE. However, in these experiments, the best subset may be not unified. The first feature selection method found most features as significantly associated with the survival time. The selected features had noticeable overlap with two other methods. It seemed that in subsets with large number of features, SVR outperformed Cox. The results of simulated datasets also indicated similar results so that when the number of features was more than 60, SVR significantly outperformed Cox (Table 3).

Discussion
The results indicated that, in the BC dataset, linear SVR models outperformed SVR with nonlinear kernel. For other      datasets, linear SVR and SVR with nonlinear kernels were comparable. The SVR-MRL model performed similar to or slightly better than other models. This model is based on additivity assumption. The results of this study indicated that SVR model performed similar to or better than Cox when all features were included in model. When feature selection was used, the performance of SVR and Cox was comparable. Although Cox model is based on proportional hazards assumption, an extension of the model was proposed for time-dependent variables when this assumption is not satisfied [1,38,39]. However, the additivity of the proposed MRL-method is also a restriction. Using BC, PT, and PD datasets, the results showed that when the number of included features in model was large, SVR outperformed the Cox model. The results of simulated datasets also indicated that, in models with large number of features, SVR significantly outperformed the Cox model. Du and Dua [32], also using a breast cancer dataset, indicated that SVR performs better than Cox on initial dataset and performs similar to Cox when feature selection is conducted. Van Belle et al. [3,9] also indicated that the SVR model outperforms the Cox model for high-dimensional data, while for clinical data the models have similar performance.
The experiments indicated that the difference between the simple standard SVR model (model 1) and the complicated SVR model (model 2) was not significant. Using five clinical data sets, Van Belle et al. [9] also found that the differences between two models are not significant.
The results of BC dataset indicated that feature selection methods selected subsets of features with different sizes. However, they were subsets of each other. The three features selected by all methods were as follows: metastasis status, progesterone receptor (PR) status, and human epidermal growth factor receptor 2 (HER2) status. The results indicated that the patients that developed metastasis had shorter life time compared to patients who did not develop metastasis. Some other studies also indicated similar results [25,27,40,41]. Also the results indicated patients with negative HER2 status survived longer than patients with positive HER2 status, which is consistent with results of some previous studies [40,41]. In addition, the present study yielded that patients with positive PR status had less survival time than the patients with negative PR status. Some other studies indicated contrary result [40,41] which may be due to the missing values for this variable. However, Faradmal et al. [28] presented similar result for the PR status.
There are other techniques for feature selection. Some studies used statistical characteristics such as correlation coefficients or Fisher score for feature ranking [15,24,32]. These criteria are simple, neither being based on training algorithm. In the current study, square of feature weight is used as the ranking criterion which is based on SVR. This criterion was previously applied in other studies [13,23]. Chang and Lin [15] also ranked each feature by considering how the performance (accuracy) is affected without that. Their method is more time-consuming compared to the feature ranking method used in our study. Also, they found that feature ranking using weights of linear SVR outperforms the feature expelling method in terms of the final model performance. There are some techniques which ranked features based on their influence on the decision hyperplane and were able to also use nonlinear kernels for feature selection [14,16].
In this study, subset selection methods were used to assess subsets of variables according to their performance in the SVR model. These methods are often criticized due to requiring massive amounts of computation. However, they are robust against overfitting [42]. Also, the better performance of SVM-RFE for feature selection compared with correlation coefficient and mutual information was shown in [23,24].
Some feature selection methods incorporate variable selection as part of the training using a penalty function [18,31]. Becker et al. [18] proposed a combination of SCAD and ridge penalties and found it being a robust tool for feature selection. Most of the mentioned techniques were applied for classification problems. It is suggested that, for future studies, these techniques are extended to survival regression models and different feature selection methods are used, evaluated, and compared.
To the best of our knowledge, there is no previous study employing feature selection based on SVR for survival analysis. The univariate method used in this study is a performance-based method which has the advantage of being applicable to all training algorithms. As opposed to this method, SVR-RFE takes into account possible correlation between the variables. For the BC dataset, SVR-RFE method yielded 11 subsets of features one of which was exactly the same as the subset obtained by the first method used in this study. RFE-SVR is an iterative procedure which requires much time for feature selection but the proposed feature selection method is trained once and the results can simultaneously be used for two purposes, prediction, and feature selection. SVR models have some limitations, as well. They require more time for tuning the parameters of the model and the required time for analysis increases with increasing the number of parameters of the model.

Conclusion
The results for BC data indicated that SVR with linear kernel outperformed SVR with nonlinear kernels. For other datasets, linear SVR and SVR with nonlinear kernels were comparable. Performance of the model improved slightly but not significantly by applying a two-sided loss function. However, the additivity of this model is a restriction.
SVR performed similar or better than Cox when all features were included in model. When feature selection was used, performance of two models was comparable. The result of simulated datasets indicated that when the number of features included in model was large, SVR model significantly outperformed Cox.
Univariate feature selection based on concordance index was proposed in current study which has the advantage of being applicable to all training algorithms. Another method (a combination of SVR and statistical tests) in contrast to univariate and RFE methods is not iterative and can be used for pilot feature selection. For all datasets, the employed methods selected subsets of features with different size, but with high overlaps. Based on the three feature selection methods, metastasis status, progesterone receptor status, and human epidermal growth factor receptor 2 status were the best features associated to survival.

Disclosure
The current study is a part of Ph.D. thesis in biostatistics field.