Assessing the clinical utility of genomic expression data across human cancers

Cancer molecular profiling provides better understanding of tumor mechanisms and helps to improve the existing cancer management. Here we present the gene expression signatures from ∼9000 human tumors with clinical information across 32 malignancies from The Cancer Genome Atlas project (TCGA). Major predictors from the RNA sequencing data that were significantly correlated with cancer survival were identified. The expression level of these prognostic genes revealed significant genomic pathways that were clinically relevant to survival outcomes across human cancers. Furthermore, it is shown that in most cancer types, combinations of these genomic signatures with clinical information might yield improved predictions. Thus, with respect to clinical utility, our study reveals the promising values of genomic data from the pan-cancer perspective.


INTRODUCTION
Cancer is a global health burden and the second leading cause of death [1]. Despite various detection method and treatment options, survival rates for most cancers are still very low.
Genomic features emerge as promising biomarkers for cancer [2]. Of the various molecular data, the gene expression value from RNA sequencing revealed detailed molecular features with prognostic associations [3]. However, due to high cost, most studies only focused on several pre-selected genes, or based on small sample sizes [4,5].
The cancer genome atlas (TCGA) project "motivated large-scale genomic efforts to obtain the complete catalogs of the genomic alterations in cancer" [6]. Besides the rich molecular features (genomic, transcriptomic, epigenomic and proteomic) of each tumor, it also provides valuable clinical information. However, the clinical utility of these data has not been fully elucidated.
In the present study, we depicted the global pan-cancer prognostic landscape by analyzing the expression signatures from ~9000 human tumors across 32 malignancies from TCGA data sets. Furthermore, the clinical utility of survival predictions was evaluated by combining the genomic data with clinical information.
Data analysis steps and clinical characteristics of the cancer patients were shown in Figure 1A-1E. In total, there were 9175 patients of the 32 tumor types, in which, 49.4% were male and 50.6% were female. Their median age was 60 years old. With respect to tumor stage, 30.7% of the patients were in stage 1, 30.9% were in stage 2, 26.7% were in stage 3, and 11.7% were in stage 4. At the time of analysis, 77.4% of the patients remained alive, and 22.6% were deceased.

Pan-cancer prognostic genes and risk scores
To explore the pan-cancer prognostic signatures, pan-cancer dataset was built by combining all the cancer patients. Samples were randomly assigned into two groups, where 80% of the samples were assigned as the training group and 20% as the testing group. By cox regression analysis for the training group, the top ten adverse genes (B3GNT5, SLC11A1, ELF4, GALNT2,  PA2G4P4, SKP2, S100A9, FOXM1, PSMB2, ARL6IP6)  and top ten favorable prognostic genes (TADA2B, CBX7,  CIRBP, MAGED2, CRY2, CREBL2, TMED8, XPC, SECISBP2, GPD1L) were identified (Figure 2A). Based on these top prognostic genes, risk scores were calculated ( Table 1). The risk score was defined as the weighted sums of the independent prognostic gene values (1 for high expression, and 0 for low expression), weighted with their regression coefficients from the cox models ( Figure 2B).
To assess the clinical utility of the risk score, correlation of the risk score with the clinical variables in the training group was explored. In the analysis, higher scores were associated with male patients, patients with older age, and patients with advanced tumor stages ( Figure 2C). Further cox analysis and log rank test also confirmed that the poor survival outcome in patients with higher risk scores in different tumor stages ( Figure 2D-2E). In the testing group, similar relationship between the risk score and clinical variables was also shown ( Figure 2F). Notably, with respect to survival analysis, higher risk scores in the testing group also indicated higher risks of prognosis, suggesting that the risk score showed valuable clinical utility ( Figure 2G-2H).

Evaluation of the prognostic genes and risk scores
The RNA-seq data demonstrated great value for cancer prognosis. Risk scores specific for each type of cancer were shown in Table 1, which were calculated by applying the method used in the whole cancer population. However, at this moment, these prognostic models are limited by the sample size to be of clinical value.
To evaluate the effect of different normalization method for the RNA-seq data, quantile data were transformed into z-score or being applied the voom normalization. As shown in supplementary Figure 1A-1B, after quantile normalization of the RNA-seq data, the z-score transformation or voom normalization doesn't change much of the prognostic genes (based on z values of cox regression), with the pearson r value of 0.98 and 0.97, respectively. After various normalization method, top prognostic genes remained mostly the same, which was shown in the supplementary Figure 1C. Thus, applying Z-score transformation or voom normalization yield limited value for the survival analysis.     For the above prognostic model, the high or low expression for prognostic model was determined by the median expression level of each gene. The gene was divided as binary categorization such as 1 for high expression (> median value) and 0 for low expression (< median value).
Here we also applied the z-score (continuous variable) directly to propose a prognostic model that can reflect the values of gene expression (Table 1). Cox regression results showed that the continuous prognostic model have an hazard ratio of 1.22, which means the death risk increases by 22% if the patients get a risk score increased by 1.
The prognostic genes (in each cancer type and in the whole cancer population) were filtered by a specific cutoff (|z| > 3.09, or nominal one-sided p < 0.001). As an investigation of the relationship of different prognostic gene across different cancer types, prognostic genes in each cancer type were compared with the whole cancer population. As shown in the supplementary Figure 1D, for the prognostic gene identified in the study on single cancer type, most of them were also found in the pooled analysis. For example, 66% of the prognostic genes in the ACC also had prognostic values in the whole cancer population.

Pathway analysis in patients with different prognosis
Based on the prognostic risk score, patients were stratified into two different survival groups of a positive risk score and a negative risk score. This unsupervised cluster analysis showed obvious distinctions between the stratified survival groups, both in the training group and the testing group ( Figure 3A, 3E). To link the observed gene expression changes with molecular pathways that may impact the differential survival between high-and low-risk groups, gene set enrichment analysis (GSEA) was performed. As shown in Figure 3B and 3F, pathways such as E2F targets, MYC targets, G2M checkpoint, mTORC1 signaling and interferon gamma response were significantly enriched in the patients of higher risk scores, with good consistency between the training group and the testing group.
In order to assess possible effects of different pathways, the GSEA for every sample were evaluated using the single sample gene set enrichment analysis (ssGSEA). Based on the calculated scores for each pathway, cox analysis was performed to evaluate their prognostic effects. Results showed that most of the significant pathways from the GSEA output showed positive correlations with the survival outcome ( Figure 3C, 3G). In addition to the cox analysis, positive correlations were detected between the pathway ssGSEA scores and the prognostic risk scores. In Figure 3D and 3H, correlation analysis were shown in the most significant pathway (E2F targets), in both the training group and testing group.

Assessment of prognostic power of gene expression data
Since the gene expression analysis and pathway analysis showed great prognostic values in the study, prognostic power of gene expression data were further explored. C-index was applied to assess the predictive power of the gene expression data alone or combined with clinical information. To improve accuracy, cancer types that don't have enough death events (< 20 deaths or < 10% mortality) were excluded. Cancer patients were randomly split into 80% training and 20% testing for 100 times to calculate the final C-index. As shown in Figure 4A and 4B, the predictive power of gene expression data alone varied across cancer types. In KIRC and GBMLGG, the prognostic power was much higher when compared with other cancer types. levels after unsupervised hierarchical clustering in the training set and testing set, respectively. Expression levels are indicated on a low-tohigh scale (green-black-red). Two clusters are defined, namely the high risk group and low risk group. B, F. GSEA analysis was performed in the training set and testing set, respectively, to identify biological pathways associated with survival outcome. FWER-p values are indicated on a low-to high scale (lightblue-darkblue). The number of significant genes in each gene set is indicated by the circle size. C, G. Forest plots of pathway score association with cancer mortality in the training set and testing set, respectively. D, H. Scatter plots of correlations between risk scores and the E2F pathway scores in the training set and testing set, respectively.
To explore any additional prognostic power, the gene expression data was combined with clinical information. Significant clinical features (correlated with survival) were applied as baseline to build the cox model. A featureselection step against the residuals was utilized to include the gene features that better fit the model. Results showed that the most gene expression data alone (18 out of 20 cases) had significant predictive power (C-index > 0.5).

DISCUSSION
In this study, we assessed the clinical utility of genomic expression data from ~9000 cancer patients of 32 tumor types. The prognostic power across different cancer types was also evaluated [7,8].
Currently, only a few gene expression-based markers are routinely used in clinical practice [9][10][11][12]. The clinical utility of genomic expression has not been fully explored. Yuan et al. reported that for cancer patients, incorporating molecular features with clinical information yields significantly improved predictions. However, they only focused on 4 cancer types (KIRC, GBM, OV, LUSC), and no conclusions could be drawn for the whole cancer population [13]. Recently, Gentles et al. described the genomic prognostic landscape across human cancers, highlighting the promise of genomic expression data as biomarkers for clinical outcomes [14]. In our study, besides illuminating the prognostic landscape of genomic expression, pathway analysis based on these prognostic genes was also evaluated. In addition, the C-index was calculated from the prognostic models across tumor types, to assess the prognostic power of gene expression data.
Based on the genomic expression data in the whole cancer population, the top prognostic genes were identified, such as FOXM1, CBX7, CREBL2 and SKP2, which were consistent with previous studies [14][15][16]. Notably, when building the risk scores based on these top prognostic genes, significant stratification in survival outcomes were shown, both in the training and validation cohorts, indicating the robustness of the predicting effect of the prognostic genes.
Because of heterogeneity, many statistical methods have been developed to analyze cancer genomics, based on gene sets, pathways and network modules [17][18][19]. For the first time, our study described the prognostic landscape of biological pathways in the whole cancer population. Gene set enrichment of the differentially expressed genes revealed significant prognostic pathways, such as the E2F targets, MYC targets, G2M checkpoint, interferon gamma response, and so on. Mostly, these pathways are correlated with cell cycle, proliferation and inflammation, which is consistent with the biological mechanisms of tumor progression [20,21].
To explore prognostic power, our results showed that combining the clinical and molecular information could improve the predictive power of the gene expression data in most cancer types. Although the absolute magnitude gains were limited, the gene-expression signatures provide new biological insights into the process of cancer progression and metastasis that can help to improve the prediction power [22]. Actually, some of the gene-based prognostic signatures have already been demonstrated to be clinically useful for predicting the risk of tumor recurrence, such as the 70-gene and 76-gene signatures in breast cancer [23][24][25][26].
It is also important to realize that gene expression information is just one of the abundant molecular data (genomic, transcriptomic, epigenomic and proteomic) revealing the biological complexity of cancer. Other molecular information will also improve our understanding of the genotype-phenotype relationships involved in cancer. On the other hand, regarding the reliability and the reproducibility of the clinical use of molecular data, future technology, statistical and analytical methods are in great need to catch up with clinical needs [22].
In conclusion, our gene analysis and pathway analysis showed significant values for the prediction of survival outcomes for cancer patients. Additionally, it was found that by combining clinical information with molecular data, the model performance could be boosted statistically in most cancer types. However, further efforts would be needed to generate prognostic models ready for clinical use in the future.

Data set compilation
Clinical and survival data were acquired from the TCGA Data Portal (https://tcga-data.nci.nih.gov/tcga/). RNA sequencing data was obtained from the GDAC Firehose System (http://gdac.broadinstitute.org/). To maintain data consistency, only the RNA sequencing data from the platform of Illumina HiSeq 2000 RNA Sequencing V2 was included. Patients who have a complete clinical and RNA sequencing data were screened for further analysis. For each cancer data set, patients were split into two groups randomly: 80% as the training set and 20% as the testing set. For the pan-cancer study, all RNA sequencing data were combined by intersecting the common genes across different cancer types.

Prognostic genes and construction of the prognostic model
For RNA sequencing data, all "raw count" values were divided by the 75th percentile of the same patient (after removing zeros) and multiplied by 1000, to get the quantile normalization for survival analysis. Furthermore, quantile data were also transformed to the z-score or normalized by "voom" to evaluate the effects of different normalization method. The Z-score was calculated as "(tumor expression -mean expression in reference) / standard deviation of expression in reference". The voom normalization was applied using the R package "limma". It estimates the mean-variance relationship of the log-counts and generates a precision weight for each observation.
The association of each gene expression with survival outcomes was assessed via cox proportional hazards regression using the 'coxph' function of the R 'survival' package. Cox coefficients, hazard ratios with 95% confidence intervals, p values, and z-scores were obtained for each array probe. Top prognostic genes were identified by the values of z-scores. Based on these top prognostic genes, risk scores were built and it was defined as the weighted sums of the independent prognostic gene values (1 for high expression, and 0 for low expression). They were weighted with their regression coefficients from the cox models. Based on the prognostic risk score, further cox regression analysis and correlation analysis with clinical variables were performed.

Differential expression analysis and clustering analysis
Differential expression analysis was done using R "limma" package. Based on the limma output for the most differentially expressed genes, unsupervised hierarchical clustering analysis was used to discover the gene expression patterns of these groups sharing common characteristics. Heatmap was constructed using the R "gplots" package.

Gene set enrichment analysis
Prognostic gene sets are groups of genes that share common biological function. The evaluation of prognostic gene sets was performed using gene set enrichment analysis (GSEA) [27], where gene sets were obtained from the Molecular Signatures Database (mSigDB) [28]. In addition, a variant of GSEA, termed single sample gene set enrichment analysis (ssGSEA) was applied to calculate separate enrichment scores for each pairing of a sample and gene set [29]. Further cox regression analysis and correlation analysis were performed based on the enrichment scores of each gene set.

Performance evaluation of gene expression data
Performance evaluation of gene expression data was conducted based on the method suggested by Yuan et al [13]. Firstly, univariate cox was applied to the training set to select the top features correlated with survival, which were then converged by the LASSO using the R package "glmnet". The model was then applied to the testing set for prediction. Concordance index (C-index) was estimated from 100 randomizations using the R package "survcomp". To explore the predictive power of integrating gene expression data with clinical information, we used the significant clinical features (correlated with survival) as baseline to build the cox model. Then a feature-selection step against the residuals was applied to combine the gene features that better fit the model.