JCDSA: a joint covariate detection tool for survival analysis on tumor expression profiles

Background Survival analysis on tumor expression profiles has always been a key issue for subsequent biological experimental validation. It is crucial how to select features which closely correspond to survival time. Furthermore, it is important how to select features which best discriminate between low-risk and high-risk group of patients. Common features derived from the two aspects may provide variable candidates for prognosis of cancer. Results Based on the provided two-step feature selection strategy, we develop a joint covariate detection tool for survival analysis on tumor expression profiles. Significant features, which are not only consistent with survival time but also associated with the categories of patients with different survival risks, are chosen. Using the miRNA expression data (Level 3) of 548 patients with glioblastoma multiforme (GBM) as an example, miRNA candidates for prognosis of cancer are selected. The reliability of selected miRNAs using this tool is demonstrated by 100 simulations. Furthermore, It is discovered that significant covariates are not directly composed of individually significant variables. Conclusions Joint covariate detection provides a viewpoint for selecting variables which are not individually but jointly significant. Besides, it helps to select features which are not only consistent with survival time but also associated with prognosis risk. The software is available at http://bio-nefu.com/resource/jcdsa.


Background
Due to the limited effectiveness of current clinical diagnoses, expression profiles are utilized for informing variables, which are not only associated with the categories of patients with different survival risks but also consistent with survival time [1]. Commonly, Cox proportional hazards regression analysis is used to seek relevant variables considering the continuity of the patients' survival outcomes with right censoring [2]. As to small sample data with high dimension, Cox proportional hazards regression has to be combined with methods using dimension reduction or shrinkage such as partial least squares [3] and principal component analysis [4]. However, these approaches only provide a combination of variables. Besides, treestructured survival analysis [5], random survival forests [6] and that associated with hazards regression [7] are proposed for selection of features associated with survival *Correspondence: zhaoxudong@nefu.edu.cn 1 College of Information and Computer Engineering, Northeast Forestry University, No.26 Hexing Road, 150001 Harbin, China Full list of author information is available at the end of the article outcomes. Anyway, these top-down strategies provide so many variable candidates that the real features which may reveal the possible molecular cause of different survival risks are inevitably submerged.
In contrast, univariable hazards regression analyses have been placed firmly in the mainstream. Bottom-up strategies with different constraints such as least-angle regression [8] and sparse kernel [9] are utilized for providing variables associated with survival time. To the best of our knowledge, we are the first to present joint covariate detection [1] that combines significant variables consistent with survival time and associated with the categories of patients. Other than individually significant variables, we concentrate on bottom-up enumeration of feature tuples, each component of which is either individually significant or not. This thought is inspired by Integrative Hypothesis Testing [10], which is used for selecting features differentially expressed between different groups of patients. Unlike Integrative Hypothesis Testing, joint covariate detection is faced with continuous survival time other than labels representing different categories of patients.
In this paper, we further divide the provided feature selection into two steps, i.e., selection of variables associated with survival outcomes and further feature selection for discrimination between patients with different survival risks. In addition, we develop a joint covariate detection tool for survival analysis on tumor expression profiles (i.e. JCDSA), which helps to conveniently select significant features either on a cluster or a workstation, even on a personal computer. Matlab R2012b and Python 3  Compared with the prevailing method named as random survival forests (i.e. RSF), JCDSA shows better experimental results, which demonstrates the effectiveness of our method.

Implementation
In order to elucidate joint covariate detection in brief, a schematic diagram is illustrated in Fig. 1 (Notations: x (i) and β denote the expression levels of sample i and the regression coefficients of the detected variables, respectively. The summation in the denominator is over all subjects in the risk set at ordered survival time t (i) , denoted by R(t (i) ). z 0 k denotes a null statistics by a random rearrangement of survival outcomes. The estimator of the expected number of deaths in high-risk group is denoted byê 1i , expressed asê 1i = n 1i d i n i , where n i and d i represent the number at risk and of deaths at the observation of ordered survival time t (i) , n 1i denotes the number at risk in high-risk group. The estimator of the variance of d 1i on the hypergeometric distribution is defined asv 1i = , where n 0i denotes the number at risk in lowrisk group. Q 0 r denotes a null statistics by a random rearrangement of survival outcomes). Input data is considered as expression profiles with survival time and censoring states of patients. Output data refers to selected features. Joint covariate detection corresponds to two-step feature selection, i.e., selection of features associated with survival outcomes and selection of features for discriminating between two risk groups.

Features associated with survival outcomes
We first consider to select features associated with survival time. A bottom-up enumeration on k-tuple with k variables is made. As to each k-tuple, Cox proportional hazards regression analysis [2] is introduced. By making the maximum partial likelihood estimation on the partial The k-tuple with each component corresponding to a significant p value is regarded as a candidate feature associated with survival outcomes. More details can be seen in [1].

Features for discriminating between two risk groups
We then intend to select features for discriminating between low-risk and high-risk group of patients, which conforms to doctors' daily decision making process. As to each patient, a risk score which is the linear portion of the expression values using the Cox regression coefficients is calculated. A preassigned risk score is utilized as a cut-off value for stratification between high-risk and low-risk group of patients. Log-rank test is made. Furthermore, a permutation test is presented on each tuple, which has been selected to be associated with survival outcomes. The k-tuple with a significant p value is regarded as a candidate feature for discriminating between two risk groups. More details can be also seen in [1].

Brief overview of the software
Our software, which is implemented in Matlab R2012b or other later versions, can work on different computational platforms (e.g., a cluster, a workstation, even a personal computer). Therefore, it contains two parts, i.e., client and server. Selection of features associated with survival outcomes is accomplished by two Matlab m-files (i.e., '/Client/S1_feature_selection.m' and '/Server/S1_feature_selection_on_server.m'). A further selection of features for stratification of patients is fulfilled by a Matlab m-file 'Client/S2_plot_draw.m' . If this program is implemented on a workstation or a personal computer, only the client part is needed. That is to say, users only need to concentrate on two GUIs (i.e., '/Client/S1_feature_selection.m' and 'Client/S2_plot_draw.m') on the client part. Otherwise, the server part is also in demand. Data communications and environment configurations are actualized using Python 3. More details can be seen in the user's guide on the website: http://bio-nefu.com/resource/jcdsa.

Results
According to the presented two-step feature selection strategy, we first consider selecting features associated with survival outcomes. Figure 2 illustrates this step. Cancer type can be selected or input by clicking the right side arrow if it is not supported in the type list. Other selections in the setting frame can be also made, details of which are listed in user's guide. Before running at full speed, JCDSA estimates the finishing seconds which helps to make a further decision. After its completion, the result which records p value(s) of each k-tuple is stored in '/Client/Data/S1' . Figure 3 further illustrates the step of selecting features associated with survival outcomes (i.e., Step 2.1). By setting the threshold of the p value corresponding to permutation test on Wald statistic, features associated with survival outcomes are selected.
Using the miRNA expression data (Level 3) of 548 patients with GBM as an example, individually significant miRNAs and significant miRNAs in pairs are listed in Tables 1 and 2, respectively. After making careful comparisons between Tables 1 and 2, we conclude that significant features in high dimension may not be composed of individually significant miRNAs. Taking the significant pair miR-10b and miR-222 as an example, miR-10b is not listed in Table 1, which shows that it is not individually significant. This phenomenon reveals the advantage of using joint covariate detection.
Together, Figs. 3, 4 and 5 illustrate the feature selection step for discriminating between two risk groups. In Fig. 3, after choosing the files that represent the original data and the result corresponding to significant features associated with survival time at Step 2.2, the software runs to Step 2.3 and Step 2.4.
As shown in Fig. 4, Kaplan-Meier analysis with parameters derived from log-rank test and Harrell's concordance index is made for further selection of features, which helps to discriminate between high-risk and low-risk group of patients. Meanwhile, the result of risk score analysis is illustrated in Fig. 5. Correspondingly, results which refer to significant features are stored in 'Client/Data/S2/S2_3' and 'Client/Data/S2/S2_4' , respectively.
In order to show the effectiveness of our method, we implemented the prevailing method named as random survival forests (i.e. RSF) on the miRNA expression data (Level 3) of 548 patients with GBM for comparison. 1000 binary survival trees were made, with each terminal node containing a minimum of d 0 =10 unique deaths. We made 1000 permutations on each variable, and obtained the variable importance (VIMP) for each variable. The result is listed in Table 3.
After making careful comparisons between Tables 2 and 3, we find that miR-10b is still unimportant, as it is not listed in Table 3. This phenomenon reveals the advantage of using joint covariate detection other than RSF. In fact, the individually significant miR-222 keeps a p=0.0012 corresponding to log-rank test with 10000 rounds of permutation. As to significant pair (i.e., miR-222 and miR-10b), it keeps a p=0.0002 which corresponds to log-rank test with 10000 rounds of permutation. As to miR-10b, it keeps a p=0.285, which is individually insignificant.
We simulated data under 40 independent dimensions, from which we assigned two to be significant. That is, the survival time S is defined as S = exp(−Xβ + ε), where X is the simulated gene expression matrix and β = [ 0.9, 0.1, 0.001, ..., 0.001] 40 denotes the coefficient parameter. ε ∼ N(0, 2). The sample size n is 50. The censoring states are generated, and yield 10 percent censoring for the simulated data.
The experimental results on simulated data are listed in Tables 4, 5 and 6, respectively. The significant pair closely associated with simulated survival outcomes are selected out, as shown in Table 5. In contrast, miRNA-alternative 2 which is in absence in Table 4 shows insignificant (p=0.939), and illustrates less important in Table 6. These results demonstrate the effectiveness of our method. The simulated data and full tables corresponding to Tables 4, 5 and 6 can be downloaded on the website: http://bio-nefu. com/resource/jcdsa.
In order to show that selected variables are improbable false positive or false negative ones, we repeated the simulations above for 100 times with an enlarged sample size (n=500). The experimental results are illustrated in Fig. 6. Figure 6a denotes the p values (p < 1e − 3) of the significant pair through 100 times of simulation. However, miRNA-alternative 2 individually shows less important, as illustrated in Fig. 6b. Comparisons between   Fig. 6a and b indicate that the significant features are probably not composed of individually significant univariables. Figure 6c and d report the number of positive pairs and individuals through 100 times of simulation, respectively. No false negative results are discovered. In Fig. 6c, the maximum number of false positive pair is three, which indicates a small probability of false positive pair 0.0038 (i.e., 3/C 2 40 ). As to Fig. 6d, the maximum number of false positive individual is also three; yet, the probability of false positive individual is 0.075 (i.e., 3/40).

Discussion
There are several states needed to be discussed. First, it is the significant multi-variable other than combinations of individually significant uni-variables that contributes to selection of features not only consistent with survival outcomes but also associated with stratification of patients under different survival risks. This fact has been demonstrated by our experimental results in this paper. Second, components of each significant multivariable may keep a low correlation. This phenomenon has been discovered when experiments on the simulated data were made. Further evidence is still needed. Third, the correction for multiple hypothesis testing is absent, considering the computational cost of calculating FDR, q value, the adjusted p values, etc. on each pair or each high-dimension tuple of variables. However, simulations are made, which demonstrate the effectiveness of our method.

Conclusion
Our joint covariate detection for survival analysis provides a new viewpoint for selecting variable candidates which are not individually but jointly significant. Following a two-step variable selection strategy, we propose a software (i.e., JCDSA) in order to help users to select features which are not only consistent with survival time but also associated with prognosis risk. JCDSA can be adapted for many categories of cancer. Users can easily operate it and conveniently obtain the experimental results for subsequent biological experimental validation.