Exploring the key genes and signaling transduction pathways related to the survival time of glioblastoma multiforme patients by a novel survival analysis model

This study is to explore the key genes and signaling transduction pathways related to the survival time of glioblastoma multiforme (GBM) patients. Our results not only showed that mutually explored GBM survival time related genes and signaling transduction pathways are closely related to the GBM, but also demonstrated that our innovated constrained optimization algorithm (CoxSisLasso strategy) are better than the classical methods (CoxLasso and CoxSis strategy). We analyzed why the CoxSisLasso strategy can outperform the existing classical methods and discuss how to extend this research in the distant future.


Background
Glioblastoma multiforme (GBM) is the most common and malignant brain tumor [1][2][3]. Since GBM is high invasive and is mixed together with the healthy brain tissue, it is almost impossible to remove the tumor without causing serious consequences [4]. Moreover, GBM is very easy to relapse [5,6]. The median survival and progression free survival time of GBM are 14.6 and 6.9 months, respectively. And the 5 year survival rate was 9.8 % [7]. Previous studies [8][9][10] indicated that gene mutation is one of the most important factors for GBM development. Therefore, gene expression analysis can not only be used to discover the underlying abnormality of gene expression associated with the GBM gene mutation, but also be employed to discover gene signatures which could help us to investigate the related signaling transduction pathways. Results from the pathway analysis can lay the foundation for the GBM cancer targeted drug research in the future.
As one of the important survival analysis methods, the cox proportional hazards model [11] is broadly employed to investigate the connections between various covariates and the length of life. However, the classical cox proportional hazards model [12] can only process such survival data that the dimension of the factors (P) are less than the number of samples (N) [13] (we call it as P < <N type of data), but it is not able to handle the survival data that the dimension of the factors are greater than the number of samples such as the gene expression data [13] (we call it as P> > N type of data). To process P> > N type of data, Tibshirani et al., [14] integrated the Lasso algorithm, one of the constrained optimization methods, into the classical Cox * Correspondence: tinalee@swu.edu.cn; zhanglcq@swu.edu.cn 6 College of Mathematics and Statistics, Southwest University, Chongqing 400715, People's Republic of China proportional hazards model [15] to select the key predictors. However, Fan et al., [16] pointed out if the number of predictors is much greater than the sample size (P> > N), a pre-cleaning step by a computationally expedient screening procedure is often preferred to increase the accuracy of the algorithm. Thus, Fan et al., [16] developed the sure independence screening (SIS) method by fitting marginal Cox regression models for each covariate and screening out several covariates by a pre-specified threshold. Nevertheless, reported by Hong et al., [17], marginal screening may encounter the difficulty in identifying these hidden and jointly important variables to incur false negatives. Therefore, Hong et al., [17] proposed a conditional SIS method to explore the potential predictors for the regular linear system, but not consider the survival data. On the other hand, developing a systematic approach to identify the target generic drug for the cancer treatment already becomes a popular research field [18,19]. However, to our knowledge, there is no recent research discussing the incoherent connection between survival time and the target generic drug in detail.
To overcome the shortcomings of these previous research, this study proposed a multi-scale genes and signaling transduction pathways exploration platform ( Fig. 1) with the following three innovations. Firstly, we innovatively analyzed the clinical GBM gene expression and survival time data [20] to investigate the incoherent relation between the signatures of genes and the survival time of GBM patient. Secondly, we not only integrated the constrained optimization method such as Lasso [15] into classical Cox proportional hazards model [13] to explore survival time related key genes by processing the P> > N type of data, but also used the SIS algorithm to improve the predictive accuracy. Thirdly, we employed KOBAS database [21] and hypergeometric test [22] to investigate the correlated GBM signaling transduction pathways regarding the explored survival time related key genes. And then, these survival time related signaling transduction pathway could help us to bridge the relation between the targeted drugs and the survival time for GBM patients.
The clinical GBM gene expression and survival data set used in this study is downloaded from the Georgetown Database of Cancer G-DOC [20], which has 54,675 features (P) and 227 samples (N). To handle such a P> > N type of data, we developed the CoxSisLasso strategy. It firstly integrated constrained optimized methods such as Lasso into the classical cox regression model to select the prior genes with potentially great impact on the patients' survival time. Secondly, conditioned on these genes selected by Lasso, conditional SIS method [23] is used to re-select the possible genes from these genes screened out in the first step. To bridge the relation between the targeted drugs and the survival time for GBM patients, we employed the KOBAS [21] application and the explored GBM survival time related key genes to investigate which signaling transduction pathways closely correlate with the GBM survival time. In general, this study developed a multi-scale genes and signaling transduction pathways exploration algorithm that can not only investigate the molecular mechanism between the key genes and cancer patients' survival time, but also employ hypergeometric distribution based database (KOBAS) to look for the related signaling pathways in the proteomics level for the future targeted cancer therapy [24,25]. Manually-reviewed experimental evidences showed that mutually explored GBM survival time related genes [26][27][28][29][30][31][32][33][34][35][36][37][38] and signaling transduction pathways [39][40][41][42][43][44][45][46][47][48][49][50][51][52] are closely related to the GBM. In addition, the research results demonstrate that our proposed CoxSisLasso strategy has the best predictive power and model fitting capacity compared to the CoxLasso and CoxSis strategy developed by Tibshirani et al., [14] and Fan et al., [16], respectively. Finally, we theoretically analyze why the CoxSisLasso strategy outperforms CoxLass and CoxSis and discuss the further research.

Materials
We used a multi-study microarray database of GBM expression profiles (n = 227) from the Georgetown Database of Cancer G-DOC [20], based on the Affymetrix U133 plus 2.0 GeneChip microarray platform. The microarray datasets of GBM are listed by Table 1.

Data filtering
The original microarray datasets are normalized and preprocessed by R software package [53]. After preprocessing step, there are 227 samples and 54,675 genes left in the data matrix. Next, the interquartile range (IQR) threshold [54] is employed to screen out the genes with small variance value. After that, there are only 227 samples and 10,992 genes left in the GBM gene expression and survival time data matrix.

Cox proportional hazards model
Survival analysis [11,55] works for the analysis of time duration until one or more events happen. As one of the widespread used survival analysis methods, the Cox proportional hazards model [13] is used to analyze the time-to-event data with both censored data and covariates, which assumes a semi-parametric form for the hazard as Eq. 1.
where h i (t) is the hazard for patient i at time t, h 0 (t) is a shared baseline hazard function, β is an unknown p-dimensional regression coefficient vector and x i is a vector of potential predictors for the i th individual. Based on the available samples, the estimator of the unknown parameter coefficientsβ , can be obtained by maximizing the log-partial likelihood function as Eq. 2 where D is the set of indices of the events and R k denotes the set of indices of the individuals at risk at time t k . Since this study encounters the P> > N type of data, it is impossible to employ classical Cox proportional hazard regression method [13] to analyze the GBM gene expression data matrix directly. Therefore, the following sections propose three variable selection strategies to obtain the sparse regression coefficient.

Combined Cox and Lasso (CoxLasso) strategy
To obtain the sparse solution for the parameter β in the Cox proportional hazards model (Eq. 1), we have to integrate constrained optimization methods such as Lasso proposed by Tibshirani et al., [14] into classical Cox proportional hazards model to minimize the negative log partial likelihood subject to the sum of the absolute values of the parameters being bounded by a constant as Eq. 3.
It is equivalent to the following optimization problem where λ is the tuning parameter to control the sparsity of the estimator. This research used the R package tool glmnet developed by Friedman et al., [56] to implement the combined Cox and Lasso (CoxLasso) strategy (Fig. 2a) by using cross validation to choose the tuning parameter.

Combined Cox and SIS (CoxSis) strategy
Though directly integrating Lasso method into Cox model can process P> > N type of data, it may encounter problems with speed, stability, and accuracy, once the dimension of the covariates is ultra-high [23] . Therefore, it is often preferred to employ a simple and computationally efficient screening procedure to reduce the data dimensionality to a moderate size before using Lasso method. The combined Cox and SIS (CoxSis) strategy is illustrated by the following steps: Step 1: Fit a marginal Cox regression model for each covariate x m to obtainβ m by Eq. 5.1 Step 2: Rank the magnitudes ofβ j ; j ¼ 1; 2; …; p in decreasing order and keep the number of d top ranked covariates.
Step 3: Denote the index of selected covariates by Θ. Implement Lasso with the selected d covariates by minimizing Eq. 5.2 This study employs R package of SIS developed by Fan et al., [16] to implement the combined Cox and SIS (CoxSis) strategy (Fig. 2b).

Combined Cox, SIS and Lasso (CoxSisLasso) strategy
Recently, Barut et al., [57] proposed a conditional screening approach (Conditional SIS) to enhance the accuracy of SIS by using the prior knowledge of the key factors to select the predictors. Regarding to our P> > N type of data and the limitation of Lasso method in the stability and accuracy, this study proposed a combined Cox, SIS and Lasso (CoxSisLasso) strategy (Fig. 2c) to increase the predictive accuracy of the model as follows: Step 1: Implement Lasso for the data. Denote the index of selected covariates with Lasso by C 0 .
Step 2: Conditioned on the selected subset of covariates C 0 , for each covariate x m , m ∉ C 0 , fit the following Cox regression model by maximizing Eq. 6 Step 3: For a given threshold γ, keep the variables x m , m ∉ C 0 ifβ m ≥γ. Denote C 1 ¼ m∉C 0 ;β m ≥γ n o . Then the augmented selected predictors are C 0 ∪ C 1 .
Step 4: Implementing Lasso with the covariates in the set C 0 ∪ C 1 to select the final predictors.
For the threshold γ, Barut et al., [57] proposed two procedures by controlling FDR and random decoupling to choose the proper level of threshold. Motivated by Zhao and Li [23], this study sets the threshold γ = 1/p, and p is the total number of all the covariates. Once the p-value of the Z-test of the covariate x m , m ∉ C 0 is less than the γ, we keep it as one of the important predictors.
Investigate potential signaling pathway regarding to the candidate genes related to the GBM survival time After obtaining the explored GBM survival time related key genes by previous strategies, it is interesting for us to investigate which potential signaling pathways are closely related to these genes. And the potential pathways will be employed for the targeted drug therapy to treat the GBM cancer in the future.
KOBAS is a signaling transduction pathway database to identify statistically significantly enriched pathways using hypergeometric test [11]. In statistics, the hypergeometric test uses the hypergeometric distribution (Eq. 7) to calculate the statistical significance.
where N is the population size, K is the number of success states in the population, n is the number of draws, k is the number of observed successes.

Results
The explored GBM survival time related key genes by CoxLasso, CoxSis and CoxSisLasso strategy, respectively Here, Table 2 shows the explored GBM survival time related key genes by CoxLasso, CoxSis and CoxSisLasso strategy, respectively. Also, the Venn plot ( Fig. 3) indicates there are four common genes (AEBP1, GDNF, IL17RC and EIF3A) mutually selected by these three strategies, which closely correlate with the survival time of GBM patient validated by the manually-reviewed experimental evidences [26][27][28][29][30][31][32][33][34][35][36][37][38]. Firstly, AEBP1 (Adipocyte enhancer binding protein 1) was discovered as a transcriptional repressor [26]. It not only expresses at different levels in different organ and tissue types and its expression is relatively strong in brain [27], but also it can interact with tumor suppressor protein PTEN and inhibit its tumor-suppressing function [28]. AEBP1 can also negatively regulate IkB, resulting in the up-regulation of NF-kB and enhanced inflammatory response [29]. It is well known that both PTEN and NF-kB, closely related to the AEBP 1, are important players in GBM cancer progression. Moreover, previous research identified several genomic targets of AEBP1 playing vital roles in the survival of glioma cells [30].
Secondly, GDNF is a Glial Cell derived neurotrophic factor which promotes survival of neurons [31]. GDNF is not Table 2 The explored genes for strategy CoxLasso, CoxSis and CoxSisLasso only identified as an important factor in macrophage infiltration into GBM, contributing to GBM progression [32], but also it can promote glioma cell invasion through its receptors that are present on invasive GBM cells [33].
Thirdly, EIF3A (Eukaryotic translation initiation factor subunit 3A) is not only expressed in all tissue types in human body and its expression is up-regulated in some type of cancers [34], but also it is important in regulating the expression of proteins involved in DNA repair pathway which is essential in drug sensitivity and resistance in cancer treatment [35,36]. Especially, EIF3A is found to be overexpressed in some glioma patients [37].
Fourthly, Inlerleukin-17 receptor C (IL17RC) is a key molecule mediating interleukin 17 signaling. It is important in immune response and inflammation which are important in GBM progression [38].

Predictive performance comparison of survival time for each strategy
This study employs the idea of time-dependent receiver operating characteristic curve (ROC) for the censored data and the area under the curve (AUC) [58,59] to quantify the predicative accuracy for each strategy, when the outcome of interest is the survival time. The ROC curve depicts the sensitivity (Eq. 8.1) versus 1-specificity (Eq. 8.2) at each time t for any risk score function x T β with c being the cut-off value and δ(t) is the event indicator at time t.
Figures 4 and 5 depicts the ROC curve at a specific predicted time 30 and AUC over a period of time respectively to quantify the performance of the three strategies to predict the survival time of the GBM patients. It demonstrates that CoxLasso and CoxSis strategy shares the similar predictive performance, whereas our proposed CoxSisLasso strategy has the best predictive accuracy since it not only has the greatest value of the sensitivity and 1-specificity (Fig. 4), but also the largest  AUC value (Fig. 5). Furthermore, to assess the generalization ability of the proposed model, we randomly select 120 samples as the training samples and the rest 68 samples as the test samples. Figure 6 shows the AUCs of the three strategies for the testing samples. Both Figs. 5 and 6 turn out that our proposed CoxSisLasso method provides the largest AUC value with the best performance. Table 3 summarizes the Cox regression results with the key genes selected by three strategies. R 2 is the statistic of the goodness of fit measure [60]. The concordance index [61] is a valuable measure of model discrimination in analyses involving survival time data. Greater R 2 and concordance index value imply better model fitting performance. Table 3 shows that R 2 and concordance index value of CoxSisLasso strategy (Table 3C) outperforms the other two (Table 3A & B). Moreover, by comparing results of Table 3C (CoxSisLasso) with the results of  Table 3A (CoxLasso) and Table 3B (CoxSis), we found that CoxSisLasso not only can preserve the genes selected by the CoxLasso and CoxSis, but also it can introduce several statistically significant genes, which are potential for us to explore their relationships with GBM in the distant future.

Model fitting performance comparison for each strategy
The explored GBM survival time related signaling transduction pathways by CoxLasso, CoxSis and CoxSisLasso strategy, respectively Here, Table 4 lists the explored GBM related signaling transduction pathways by CoxLasso, CoxSis and CoxSi-sLasso strategy, respectively. Also, the Venn plot (Fig. 7) indicates three explored GBM related signaling transduction pathways mutually selected by these three strategies.
Firstly, mTOR (Mammalian target of rapamycin) is an important mediator of phosphatidyl-inositol-3 kinase (PI3K) pathway. And previous research turned out that constitutive activation of PI3K signaling is found in the majority of GBM patients [39]. Moreover, PI3K-akt-mTOR axis plays essential role in cell growth and proliferation [40]. Signaling of mTOR pathway is vital for cancer cell growth and survival in GBM patients [41]. Currently mTOR pathway inhibitors are under active investigation in preclinical experiments and in clinical trials for GBM treatment [42].
Secondly, TGF-beta (Transforming growth factor beta) is a secreted cytokine which signals through specific   6 AUCs for test samples for strategy CoxLasso, CoxSis and CoxSisLasso receptors and exerts its effect via intracellular Smad family proteins [43]. TGF-beta pathway controls GBM cell proliferation [44]. Its signaling contributes to the maintenance of tumor-initiating cells in GBM [45]. TGF-beta pathway is also involved in tumor invasion and metastasis in GBM patients [46]. Inhibition of TGFbeta pathway signaling reduced GBM cell proliferation and invasion in preclinical cell-based assays [47]. TGFbeta pathway inhibitors showed promising results in improving GBM patient survival in clinical trials [48].
Thirdly, IRE (Internal Ribosomal Entry) pathway is involved in the synthesis of some proteins during which protein synthesis is initiated from a start codon near an IRE site rather than by scanning the Kozak sequence. This IRE pathway is used in the translation of many eukaryotic genes including growth factors such as VEGF, FGF2 and PDGF [49] and transcription factors such as c-myc and hypoxia induced factor [50,51] . Indeed, upregulated expression of proto oncogene c-Jun in human GBM is mediated through a potent internal ribosomal entry site (IRES) in the 5′UTR of the c-Jun mRNA, and the upregulation of c-Jun contributes to the malignant properties of GBM cells [52].

Discussion
This study developed a multi-scale gene and signaling transduction pathway exploration platform based on the classical Cox proportional hazard model [12], constrained optimization method [14][15][16] and hypergeometric test to analyze P> > N type of GBM gene expression and survival time data (Table 1). Compared to the previous research  On the one hand, manually reviewed experimental evidences validate that both mutually explored key genes [26][27][28][29][30][31][32][33][34][35][36][37][38] (Table 2) and signaling transduction pathways [39][40][41][42][43][44][45][46][47][48][49][50][51][52] ( Table 4) are closely related to GBM. On the other hand, since CoxLasso strategy may encounter problems with speed, stability, and accuracy for processing high dimensional data [23], the CoxSis strategy is developed by employing a simple and computationally efficient screening procedure to reduce the dimensionality of the data to a moderate size before using Lasso method based on the previous work of Fan et al., [16]. Though classic marginal screening approach based CoxSis is theoretically proved to be capable of selecting all important predictors [56], it is difficult to identify these hidden predictors which jointly correlate with the response variable but not marginally. For this reason, we proposed the CoxSisLasso strategy, which not only uses the CoxLasso strategy to obtain a prior set of important predictors, but also incorporates the SIS [16] approach to select the important predictors regarding to the previous results. Figure 5 and 6 turned out that CoxSi-sLasso strategy has the best predictive power and model fitting capacity than both CoxLasso and CoxSis.

Conclusions
In general, this study innovatively developed a CoxSi-sLasso strategy to interrogate the connections between GBM gene expression and GBM patients' survival time as well as employed the KOBAS database [21] and hypergeometric test [21] to investigate the incoherent signaling transduction pathways and the survival time of GBM patient. Though the research results demonstrated the advantages of our algorithm, the current research still has several shortcomings such as the theoretically proof for the CoxSisLasso strategy, simulation study for the gene and pathway selection platform and so on. In the distant future, we will not only need improve our current CoxSi-sLasso algorithm, but also will employ the related pathway analysis theory [63] to explore the GBM survival time related proteins for the target drug study.