An integrated lncRNA, microRNA and mRNA signature to improve prognosis prediction of colorectal cancer

Although the outcome of patients with colorectal cancer (CRC) has improved significantly, prognosis evaluation still presents challenges due to the disease heterogeneity. Increasing evidences revealed the close correlation between aberrant expression of certain RNAs and the prognosis. We envisioned that combined multiple types of RNAs into a single classifier could improve postoperative risk classification and add prognostic value to the current stage system. Firstly, differentially expressed RNAs including mRNAs, miRNAs and lncRNAs were identified by two different algorithms. Then survival and LASSO analysis was conducted to screen survival-related DERs and build a multi-RNA-based classifier for CRC patient stratification. The prognostic value of the classifier was self-validated in the TCGA CRC cohort and further validated in an external independent set. Finally, survival receiver operating characteristic analysis was used to assess the performance of prognostic prediction. We found that the multi-RNA-based classifier consisted by 12 mRNAs, 1miRNA and 1 lncRNA, which could divide the patients into high and low risk groups with significantly different overall survival (training set: HR 2.54, 95%CI 1.67-3.87, p<0.0001; internal testing set: HR 2.54, 95%CI 1.67-3.87, p<0.0001; validation set: HR 5.02, 95% CI 2.2–11.6; p=0·0002). In addition, the classifier is not only independent of clinical features but also with a similar prognostic ability to the well-established TNM stage (AUC of ROC 0.83 versus 0.74, 95% CI = 0.608-0.824, P =0.0878). Furthermore, combination of the multi-RNA-based classifier with clinical features was a more powerful predictor of prognosis than either of the two parameters alone. In conclusion, the multi-RNA-based classifier may have important clinical implications in the selection of patients with CRC who are at high risk of mortality and add prognostic value to the current stage system.


INTRODUCTION
Despite advances in screening, diagnosis, and curative resection, colorectal cancer (CRC) is still the third most common epithelial malignancy and the fourthleading cause of mortality around the world [1][2][3]. At present, the tumor, node, and metastasis (TNM) staging is the only well-recognized stratification system used in clinical practice to guide therapy decision and predict CRC patients' prognosis [4]. However, TNM staging fails to evaluate survival outcomes accurately in many patients undergoing surgical resection [5]. Conflict outcomes even existed among patients with same stage category [6,7]. Besides that, although the version of TNM staging has been continuously updating in the past decade, the prognosis value does not increase significantly [8]. All the www.impactjournals.com/oncotarget/ Oncotarget, 2017, Vol. 8, (No. 49), pp: 85463-85478 Research Paper above factors highlight the urgent need to identify reliable prognostic factors for a more precise prediction in CRC patients [9][10][11].
The accumulation of basic research revealed that certain molecules which intimately associated with tumor phenotype and clinical behavior, maght be with better predictive value than clinicopathological features [12]. Indeed, many previous studies had confirmed that the discovery and application of individual biomarkers could facilitate the prognostic evaluation [9][10][11]. But due to their specificity and sensitivity, individual molecular alone or in combination with clinical features also do not predict the survival of CRC patients accurately [13].
Given the heterogeneity of CRC and the multitude of variables influencing clinical progress, the multimolecular signatures provide a more comprehensive prognostic information. Expression levels of thousands of molecular are now widely evaluated simultaneously by microarray, sequencing and mass spectrometry due to huge breakthrough of high-throughput technology during the last decade [14]. Therefore, expression profiling especially mRNA, miRNA and lncRNA, has shown great prospect in clinical practice to predict the long-term outcome of individual patient. Moreover, many researches have demonstrated that, notwithstanding the importance of individual RNA, intrinsic multi-RNA profiles of CRC may have greater prognostic value. Ramon Salazar et al. reported an expression profiling study and screened 18 mRNAs that could significantly improve the prognostic accuracy in patients with stage II and III CRC [15]. In another study, by integrating 7 genes into a single model, Anita Sveen et al constructed a prognostic classifier for stage III CRC patients and validated that the classifier is a clinically feasible prognostic predictor of poor prognosis [16]. Besides that, MicroRNAs (miRNAs) and long noncoding RNAs (lncRNAs) as key ingredients of expression profiling involved in directly regulating approximately 30% of protein-encoding genes [17]. Although functions of miRNAs are far from being fully understood, growing evidences indicated that aberrant expression of miRNAs meet the requirement of ideal biomarkers for cancer detection [18], which were not only stable in plasma and cuticle at detectable levels [19], but also showed a good sensitivity and specificity [20]. Thus, the prognostic value of individual miRNA in CRC was continuously being reported [21,22] since it was first discovered by Lee in 1993 [23]. Similarly, recent investigations on various human cancers demonstrated that lncRNAs may be an overlooked source of biomarkers and therapeutic targets [24]. Although only a limited number of lncRNAs have been well characterized on biological mechanism, accumulating evidences have suggested that lncRNAs may have significant prognostic value in many types of cancers [25] including breast cancer [26], prostate cancer [27] and CRC [28,29].
Despite much was known about RNAs in CRC, previous studies mainly focused on them separately. It is still unknown whether combining different types of RNAs could substantially increase the prognostic value. Therefore, the aims of this present work was to construct a multi-RNA-based classifier based on exploring the lncRNA, miRNA and mRNA profiles of CRC patients. The prognostic value of the classifier was investigated in training cohort and further confirmed in independent validation cohort. Our findings suggest that the multi-RNA-based classifier could effectively stratify CRC patients who are at high risk of mortality and add prognostic value to the current stage system.

Clinicopathological features of patients in the TCGA and validation CRC cohort
Two CRC cohort and corresponding clinical data were downloaded from the publicly available TCGA and GEO database, respectively. After removal of the samples with inadequate clinical information, a total of 663 CRC patients including 338 females (mean age 66.04 ± 13.76 years) and 325 males (mean age 66.63 ± 11.43 years), were analyzed in the present study (median follow-up: 23.98 months). All the included patients were pathologically diagnosed with CRC and undergone surgical resection, in which 598 patients from TCGA database were randomly divided into training set (n=498) and internal testing set (n=100) separately, and 65 patients from GEO database (GSE29623) possessing lncRNA, miRNA and mRNA profiles simultaneously were set as validation set [30]. Of note, 50 patients with expression profiles derived from adjacent non-tumor tissues were specifically assigned to training set for analyzing the differential expression of RNAs. Demographic and clinical data for the training, internal testing and independent validation set were summarized in Table 1. As we expected, no significant difference was observed in the major clinicopathological characteristics. However, owing to the high censoring rate in the TCGA CRC cohort(67.15%), remarkable difference existed in overall survival status(p<0·0001). Thus, Kaplan-Meier tests were subsequently conducted to evaluate the accuracy of the survival data. As clearly indicated in Figure 1, although containing a majority of censored data, the survival information in the TCGA CRC cohort was significantly related to the well-established TNM stage, which means its accuracy was appropriate for use in further studies.

Construction of prognostic classifier from the training sets
Aberrant expression of RNAs which mediated tumor initiation, progression, and metastasis is the potential prognostic biomarker. To obtain robust and reproducible differentially expressed RNAs (DERs), global expression profiling was conducted on TCGA training set including 498 CRC specimens and 50 adjacent normal tissues, by using redgeR and limma respectively. A total of 1786 DERs were preliminarily screened by both the algorithms with the threshold of |log2FC| >2 and adj.P.Value < 0.05, in which the number of mRNA, miRNA and lncRNA was 1247, 163, 376 respectively ( Figure 2A). The details of DERs was comprehensively displayed in circos plot ( Figure 2B). Moreover, as shown in Figure 2C, using unsupervised hierarchical clustering based on those DERs, the samples of tumor and normal were clearly separated into two discrete groups which indicated that the DERs identified in the present study were credible. Based on univariate survival analysis, the DERs in which not significantly correlated with overall survival was further filtered out. Then, a relative regression coefficient was calculated by LASSO analysis based on the survivalrelated DERs. By forcing the sum of the absolute value of the regression coefficients to be less than a fixed value, certain coefficients were shrunk exactly to zero and the most powerful prognostic marker of all the CRC survival-associated DERs were selected including 12 mRNAs, 1miRNA and 1 lncRNA ( Figure 2A). As  summarized in Table 2 , ten were associated with high risk (FAM132B, CPB1, NXPH4, OSR1, PCOLCE2,  RNF112, TNNT2, GABRD, MIR27A, HOTAIR, HR>1) and four were shown to be protective (MMP1, MS4A1, IZUMO2, GIF, HR<1). Combine the relative expression levels of the DERs in the classifier and corresponding LASSO coefficients, a risk score (RS) was calculated for each patient in the TCGA training set. The cutoff point of RS for dividing the high-risk and low-risk patients was generated according to the optimum sensitivity (74.27%) and specificity (84.68%) based on ROC curve for predicting 5-year survival. Patients with a RS greater than or equal to 0.2835 were assigned to high-risk group and the rest patients belonged to low-risk group (Figure The distribution and variation trend of each DERs on chromosomes was shown in Circos plots. Color gradient from red to blue represent the logFC of DERs, the gene symbol of each DERs was displayed in outermost region and been pointed to a specific location on chromosome by a connecting line. (C) Two-way hierarchical clustering of 498 tumour tissues and 50 adjacent normal mucosa with the 2114 differentially expressed RNAs using Euclidean distance and average linkage clustering. Every row represents an individual gene, and each column represents an individual sample. color gradient from green to red indicate expression levels from low to high on a normalized value(-10 to 10). The clustering of samples were mainly divided into two major clusters, one was the normal tissue samples and the other was cancer tissue samples. www.impactjournals.com/oncotarget 3B). As shown in Figure 3C and 3D, we found that patients with high RS tended to express high-risk RNAs, whereas tumors with low RS incline to express protective RNAs.

Validation of the integrated classifier for survival prediction in the TCGA and validation CRC cohort
To investigate the relationship between RS and survival status of CRC patients, Kaplan-Meier analysis and log-rank test were conducted on the training sets. Obviously, patients with higher RS generally had a significantly worse overall survival (OS) than those with lower RS (p<0.0001; Figure 4A ). As the majority of events occurred within 5 years, time-dependent ROC curves were used to assess the prognostic power based on OS at 1, 3, 5 years respectively ( Figure 4A) Figure 2A). In addition, we did the same analyses on the internal testing set. As shown in Figure 4B, the results were similar to what we observed in the training set (HR 2.54, 95%CI 1.67-3.87; p<0.0001).
To validate whether our prognostic classifier also had similar predictive ability in different populations, we applied it to the independent set. There were only a limited number of patients (N=65) in the independent set because of lack of published datasets that have the mRNA, miRNA and lncRNA profiles for the same CRC patients. However, the variables (N= 15) to sample size (n=65) ratio was enough to accurately estimate the regression coefficients based on ridge regression. By using the same prognostic model and cutoff value (RS = 0.2835), the classifier can also successfully subdivide patients into high risk (n = 28) or low risk (n = 37) groups with remarkable differences in overall survival (HR 5.02, 95% CI 2.2-11.6; p=0·0002; Figure 4C). In consistence with the results demonstrated above, 5-year OS was 68.4% (95% CI 51.6-90.6) for the low-risk group, whereas it was only 27.2% (95% CI 12.8 -57.4 ) for the high-risk group.

Prognostic value of the integrated classifier is independent of clinical feature
To assess whether the prognostic classifier represents an independent indicator in CRC patients, the effect of each clinicopathologic feature on survival was analyzed by Cox regression. As shown in Table 3 , after multivariable adjustment, the multi-RNA-based classifier remained a powerful and independent factor in training set, testing set and independent set. Additionally, timedependent receiver operating characteristic (ROC) was applied to compare the predictive accuracy between the multi-RNA-based classifier and the other independent clinical factors ( Figure 5). By calculating the area under the curve (AUC) of ROC, we found that the multi-RNAbased classifier had significantly higher prognostic accuracy than any other factors except TNM stage (0.83 versus 0.74, 95% CI = 0.608-0.824, P =0.0878). Although TNM stage is a well-recognized prediction system for prognosis, conflict outcome existed among patients with same stage category. To investigate whether the multi-RNA-based classifier adds prognostic value to the current system, data stratification analysis was performed. As shown in Figure 6, within each stratum (stage II-IV), our classifier could further subdivide the patients into longer survival and shorter survival group ( Figure 6B-6D). More importantly, the multi-RNA-based classifier combined with clinical features achieved the greatest area under the curve of ROC which was significantly greater than classifier alone (0.89 versus 0.83, 95% CI 0.801-0.926, P=0.0106). These results demonstrated that combining the multi-RNA-based classifier with clinical features could further improve the capacity for predicting outcome. Therefore, we constructed a nomogram that integrated both the classifier and clinicopathological features to predict survival probability of patients who had undergone surgical resection ( Figure 7A). Calibration plot showed that predicting 3-year and 5-year survival probabilities corresponded closely to the actual observed proportions ( Figure 7B).

DISCUSSION
The increasing evidences are improving the understanding that, despite the importance of individual molecules, tumorigenesis and prognosis is strictly controlled by the interactions between the myriad of cellular constituents including DNA, RNA, proteins and small molecules [31]. Even a discrete biological function can rarely be ascribed to an individual molecule [32]. Therefore, based on an oversimplified model, which has dominated medical research for the past century, no longer applies to the current situation of data explosion in medical. The breakthrough of high-throughput technology was now powerful enough to measure each component of transcript in the tissue or cell at any given time [33,34]. According to the estimate, the amount of bioinformation in the world doubles every 20 month, which promises to promote understanding of disease at the integration level and add new dimensions to our ability to predict the prognosis of an individual patient [35].
As being recognized previously, ribonucleic acid (RNA) is a polymeric molecule essential in various    biological roles and many of which have been identified as prognostic biomarkers [18,25]. Given the tumor heterogeneity and the multitude of variables involved in influencing clinical progress, combination of multiple RNA provides a more comprehensive prognostic information. Indeed, many former researches have revealed the great prospect in clinical utility of multi-RNA-based classifiers [36][37][38]. For example, using miRNA microarrays, Zhang et al. analysed 40 paired colon tumor and adjacent normal tissues and found that a six-miRNA-based classifier could predict disease recurrence and serve as an indicator of efficacy for adjuvant chemotherapy [39]. More importantly, the six-miRNA-based classifier even had better prognostic value than TNM stage and mismatch repair status. Additionally, it has been clarified recently that lncRNA is another dimension of transcription-regulatory networks [25,37,40,41]. Aberrant expression of lncRNA is associated with tumorigenesis, tumor progression and metastasis [41]. Until now, although only a few lncRNAs have been investigated in CRC, existing results demonstrated that lncRNAs may be ideal prognostic biomarkers [25]. Thus, Hu et al. performed lncRNA expression profiling in large CRC cohorts from GEO and established a set of six lncRNAs that may be an efficacy tool for clinical prognosis evaluation [28]. Duo to the great power of multi-RNA-based risk stratification, its clinical utility has recently been approved by the FDA for making treatment decisions in early-stage breast cancer(MammaPrint; Agendia, Amsterdam, the Netherlands) [42,43].
Although much is known about RNAs in CRC, previous studies mainly focused on them individually. The prognostic value of combining different type of RNAs is still not elucidated. More importantly, as the third most common epithelial malignancy, prognosis evaluation of patients with CRC based on current prognostic system still presents challenges [8]. Therefore, in the present study, we first constructed a novel multi-RNA-based classifier consisting of 12 mRNA, 1 miRNA and 1 lncRNA, which was validated as an independent predictor for CRC patient survival. Our data revealed that this classifier can successfully subdivide CRC patients into high and low risk groups with remarkable differences in overall survival. The results was further validated by a internal set and an independent external set, which reflects the good reproducibility of the classifier. In addition, even stratified by TNM stage, the CRC patients could also be divided into longer survival and shorter survival group by the multi-RNA-based classifier. And this further indicated that our classifier could act as an independent factor for CRC prognosis. More importantly, it is well known that CRC prognosis is highly stage dependent, however, dilemmas still exist in making appropriate treatment decisions, especially for a stage II patient. Therefore, identification of high-risk subgroup among stage II CRC patients by a reliable indicator is of great clinical need. Our data demonstrated that the multi-RNA-based classifier could be a promising biomarker to stratify stage II patients into distinct risk subgroup and guide individualized therapy choices.
Moreover, it was fascinating to find that the multi-RNA-based classifier and TNM stage had a similar prognostic ability and were independent of each other, which means the classifier may be used to refine the current prognostic model and facilitate further stratification of patients with CRC in the same TNM stage. Indeed, our ROC analysis indicated that a stronger power for prognostic prediction could been achieved by a combination of the multi-RNA-based classifier and clinicopathologic risk factors, which at least in part confirmed above conclusion. Intuition tells us that integrating different types of survival-related RNAs into a single model, instead of study on them separately, was expected to increase prognostic value substantially. Comparing the AUC of the ROC curve, we clearly found that removing any RNA type would significantly decrease the predictive ability. Therefore, single selective type of RNA was difficult to construct an sufficient precise prognosis model.
In this study, we evaluated the correction between survival and all the DERs, only CRC survival-related DERs were chosen to further analysis in LASSO algorithm. Finally, in our multi-RNA-based classifier, only 12 mRNA, 1 miRNA and 1 lncRNA were retained. Among them, MIR27A [44,45], HOTAIR [46], MMP1 [47], MS4A1 [48], GIF [49] were previously reported to be related with CRC patient prognosis. As the only miRNAs, MIR27A directly downregulated the tumor suppressor FBXW7 and mediate selective activation of NOTCH, JUN and MYC signaling [45]. Moreover, long non-coding RNA HOTAIR in our classifier also silenced metastasis suppressor genes by recruiting the PRC2 complex to specific target genes [50]. These basic findings may, in part, account for the risky role of the two RNAs in classifier(HR > 1). Moreover, to our knowledge, we are the first to report the prognostic value of the other 9 RNAs(FAM132B, CPB1, NXPH4, OSR1, RNF112, TNNT2, IZUMO2, OSR1, PCOLCE2), which may provide valuable directions for the future research.
In summary, we constructed a powerful multi-RNAbased classifier which could effectively stratify CRC patients into groups at low and high risk of mortality. Further analysis revealed that our classifier was not only independent of clinical features but also with a similar prognostic ability to the well established TNM stage. Furthermore, to help clinician to evaluate survival probability of CRC patients, we integrated the multi-RNA-based classifier with traditional clinicopathological risk factors to make a quantitative nomogram. Base on our knowledge, this is the first report that combines multiple type of RNA to improve the current CRC prognosis. However, there were still some limitations in this research. In particular, some clinical information was incomplete, which made our study susceptible to the inherent biases. Moreover, the censoring rate of TCGA dataset was too high , which may affecte the reliability of this study. Clearly, our results should be further validated by prospective study in multicentre clinical trials.

Data collection
All 635 patients' data of CRC including RNA expression (mRNA, miRNA and lncRNA) and corresponding clinical information were retrieved from The Cancer Genome Atlas (TCGA) data portal. Both the expression profiles and clinical outcome are publicly available and open-access. Among the TCGA CRC cohort, 50 patients have expression data from both normal and tumor tissue simultaneously were used to screen differentially expressed RNAs (DERs). To validate the prognostic value of the integrated classifier obtained from the TCGA CRC cohort, external published datasets which have mRNA, lncRNA and miRNA profiles for the same CRC patients were retrieved from Gene Expression Omnibus(GEO). Finally, independent expression datasets with a total of 65 CRC patients were downloaded. Owing to the data were separately stored in different files, the barcodes of each sample were used to crossreference the expression profiles and clinical outcome. The data collection was conducted in compliance with the publication guidelines and policies for the protection of human subjects provided by TCGA and GEO.
Before analyze the downloaded data, mRNA and miRNA expression profiles were annotated based on Refseq transcript ID and/ or Ensembl gene ID as previously described [2]. In addition, LncRNA expression profiles from patients in TCGA were retrieved from the Atlas of Non-coding RNAs in Cancer (TANRIC, http:// bioinformatics.mdanderson.org/main/TANRIC:Overview) database [51], and any lncRNAs that overlapped with any given mRNAs were filtered out. By analysis of the download data, some patients do not meet the following criteria were eliminated in the present study: (1) a histological diagnosis of CRC (2) patients with fully clinical features including sex, age, tumor location, local invasion, lymph node metastasis, distal metastasis, differentiation grade, pathologic stage, survival information (Table 1); (3) patients were still alive at least 1 month after initial pathologic diagnosis.

Identification of differentially expressed RNAs between CRC and normal tissue
The analysis was carried out in training set which contain 50 adjacent normal and 498 CRC sample, by using the R/Bioconductor package of redgeR and limma respectively, as previous described in detail. The expression differences were characterized by logFC (log2 fold change) and associated adj.P.Value. LogFC indicates the fold change in expression of each miRNA from CRC to normal tissue. Down and up-regulated RNAs were assigned a logFC < −2 and logFC >2 respectively, with adj.P.Value < 0.05. The RNAs identified to be differentially expressed by both of the algorithms were selected as DERs. In order to further assess the accurate of the DERs, hierarchical clustering analysis was also applied to categorize the data based on the similar expression patterns by using heatmap.2 function of the R/package gplots with complete linkage.

Survival analysis
The differences clinical features including sex, age, tumor location, local invasion, lymph node metastasis, distal metastasis, differentiation grade, pathologic stage, survival status between training set, internal testing set and independent validation set were analyzed using the chi-square test. A univariate Cox model was performed to investigate the relationship between the continuous expression level of each DERs and OS, and for preliminary screening of clinical variables that were correlated with the OS of patients with CRC. Hazard ratios (HRs) and P-value from univariate Cox regression analysis were used to identify candidate survival-related DERs. DERs with HR for death > 1 were defined as high-risk RNAs a and those with HR < 1 were defined as a protective RNAs. The common DERs meet criteria of P-value <0.05 were selected as survival-related DERs and further analyzed in LASSO regression to identity the most powerful prognostic markers. Finally, a multi-RNA-based classifier was constructed for predicting the OS. In order to quantify the risk of OS, a standard form of risk score(RS) for each CRC patient was calculated combine the relative expression levels of the RNAs (Exp i ) www.impactjournals.com/oncotarget and LASSO coefficients (L i ), Risk score = ∑ × = Exp L i i n i 1 . To divide the patients into the high or low risk group, the best cutoff RS was determined when the sensitivity and specificity in the ROC curve achieved optimum for predicting 5-year survival of the training set. Kaplan-Meier curves were used to estimate the survival for patients in training, testing and validation set between high risk and low risk group. As a powerful predictive factor, whether the prognostic value of the multi-RNA-based classifier is independent of clinical feature was assessed by multivariate Cox regression model. More importantly, to investigate the performance of the prognostic classifier and clinical features in predicting CRC patient outcome, the area under the curve (AUC) of the receiver-operator characteristic (ROC) was calculated and compared.
We used R software version 3.3.3 and the "TimeROC" package to do the time-dependent ROC curve analysis. Moreover, "glmnet" package was used to do the LASSO Cox regression model analysis. Nomogram plots were done with the rms package. All the other statistical tests were done with R software version 3.3.3 and corresponding fundamental package.