Prognosis prediction model based on competing endogenous RNAs for recurrence of colon adenocarcinoma

Colon adenocarcinoma (COAD) patients who develop recurrence have poor prognosis. Our study aimed to establish effective prognosis prediction model based on competing endogenous RNAs (ceRNAs) for recurrence of COAD. COAD expression profilings downloaded from The Cancer Genome Atlas (TCGA) were used as training dataset, and expression profilings of GSE29623 retrieved from Gene Expression Omnibus (GEO) were set as validation dataset. Differentially expressed RNAs (DERs) between non-recurrent and recurrent specimens in training dataset were screened, and optimum prognostic signature DERs were revealed to establish prognostic score (PS) model. Kaplan-Meier survival analysis was conducted for PS model, and GEO dataset was used for validation. Prognosis prediction efficiencies were evaluated by area under curve (AUC) and C-index. Meanwhile, ceRNA regulatory network was constructed by using signature mRNAs, lncRNAs and miRNAs. We identified 562 DERs including 42 lncRNAs, 36 miRNAs, and 484 mRNAs. PS prediction model, consisting of 17 optimum prognostic signature DERs, showed that high risk group had significantly poorer prognosis (5-year AUC = 0.951, C-index = 0.788), which also validated in GSE29623. Prognosis prediction model incorporating multi-RNAs with pathologic distant metastasis (M) and pathologic primary tumor (T) (5-year AUC = 0.969, C-index = 0.812) had better efficiency than clinical prognosis prediction model (5-year AUC = 0.712, C-index = 0.680). In the constructed ceRNA regulatory network, lncRNA NCBP2-AS1 could interact with hsa-miR-34c and hsa-miR-363, and lncRNA LINC00115 could interact with hsa-miR-363 and hsa-miR-4709. SIX4, GRAP, NKAIN4, MMAA, and ERVMER34–1 are regulated by hsa-miR-4709. Prognosis prediction model incorporating multi-RNAs with pathologic M and pathologic T may have great value in COAD prognosis prediction.


Background
Colon cancer is the first leading cancer type in digestive system, and approximately 101,420 new cases of colon cancer will be diagnosed in United States in 2019 with the estimated deaths of 51,020 [1]. Colon adenocarcinoma (COAD) counts for more than 80% of all colon cancers, and the remaining small proportion of colon cancers are related to sarcomas and squamous cell carcinomas [2,3]. Retrospectively study conducted by Cass et al. has revealed that 37% patients developed local recurrence and distant metastases after complete primary resection, and local recurrence without clinical evidence of distant metastases was the most common cause of death within 5 years [4]. Even though survival of patients with colon cancer has been improved along with the developing of new chemotherapeutics and advanced techniques, the prognosis of patients developed recurrence still remains poor and early recurrence have the possibility to be cured by salvage surgery [5,6]. Thus, it is urgently to find the prognostic markers and construct effective prognosis prediction model for the recurrence of COAD.
It is widely acknowledged that competing endogenous RNAs (ceRNAs) could interact with mRNAs by competing with miRNAs, and miRNA-mediated interactions between long non-coding RNAs (lncRNAs) and mRNAs exist in the progressing of various diseases [7][8][9]. The lncRNA H19 has been reported to function as a ceRNA for miR-138 and miR-200a to suppress the target gene expressions of Vimentin, ZEB1, and ZEB2, thereafter to promote epithelial to mesenchymal transition in colorectal cancer [10]. Moreover, lncRNA H19 could also up-regulate cancer-related mRNA expressions by functioning as a ceRNA to influence phosphatidylinositol-3-kinase (PI3K)/Akt signaling pathway, and lncRNA H19 is associated with poor prognosis in colorectal cancer [11]. The lncRNA ATB could inhibit the expression of E-cadherin, thus contributing to the colon cancer progression via epithelial to mesenchymal transition, and higher levels of lncRNA ATB are related to poor prognosis [12]. Recently, the ceRNA regulatory networks of lncRNAs, miRNAs and mRNAs have been also constructed in many studies to reveal the molecular mechanisms involved in initiation and progression of human colon adenocarcinoma [13][14][15]. However, there are few studies about ceRNAs based prediction models of COAD recurrence. In the study of Chen et al., they established ceRNA network for differentially expressed RNAs (DERs) between 15 non-recurrent and 98 recurrent COAD patients from The Cancer Genome Atlas (TCGA) database [16]. Meanwhile, Chen et al. identified the prognostic lncRNA markers by using multivariate regression and established a nomogram for recurrence prediction based on the identified lncRNA biomarkers and clinical covariates [16].
In order to further reveal the recurrence prediction markers, the expression profilings of 367 COAD samples (279 non-recurrent and 88 recurrent ones) downloaded from TCGA were used as training dataset, and expression profilings of 65 COAD samples from Gene Expression Omnibus (GEO) were considered as validation dataset. Optimum prognostic signature DERs were identified via the L1 penalized lasso estimation in addition to multivariate regression analysis. Prognosis prediction efficiency of the ceRNAs-based prognostic score prediction model was compared with that of clinical prognostic prediction model.

Data source and preprocessing
The mRNA, lncRNA and miRNA sequencing data of COAD samples based on the platform of Illumina HiSeq 2000 RNA Sequencing were downloaded from TCGA (https://portal.gdc.cancer.gov/projects/TCGA-COAD). The mRNA along with lncRNA expression profilings (RNA-seq data) of 512 COAD samples and miRNA expression profilings (miRNA-seq data) of 461 COAD samples were downloaded from TCGA A total of 367 COAD samples that had both of mRNA expression profilings and miRNA expression profilings were obtained by mapping the clinical data for each sample. These expression profilings of the 367 COAD samples, including 279 non-recurrent and 88 recurrent specimens, were used as training dataset.
Meanwhile, RNA expression profilings of human colon adenocarcinoma were retrieved by searching the keywords of "colon adenocarcinoma" and "Homo sapiens" from National Center for Biotechnology Information (NCBI) GEO (https://www.ncbi.nlm.nih.gov/geo/) for validation. The datasets were retained when meet the following criteria: expression profilings of tumor tissue samples from COAD patients; containing lncRNA, mRNA and miRNA expression profilings; with corresponding available clinical information about recurrence and prognosis. Subsequently, GSE29623 [17] consisted of lncRNA, mRNA and miRNA expression profilings of 65 colon adenocarcinomas based on the platforms of GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array and GPL11162 NIH Taqman Human MicroRNA Array v.2 (https://www.ncbi.nlm.nih.gov/geo/query/acc. cgi?acc=GSE29623) was used as validation dataset.

Construction and validation of prognosis prediction model based on prognostic score
The DERs related to prognosis were firstly uncovered by using the univariate Cox regression analysis in Survival package (Version2.41-1, http://bioconductor.org/packages/survivalr/) [22]. Then, multivariate Cox regression analysis was conducted to screen the independent prognostic DERs with the threshold of log-rank p value < 0.05. Thereafter, the optimum prognostic signature DERs were identified via the L1 penalized lasso estimation based Cox-Proportional Hazards (Cox-PH) model in penalized package (Version0.9-50, http://bioconductor.org/packages/penalized/) of R3.4.1. When the maximal cross-validation likelihood run 1000 times, the optimal lambda was determined. The prognostic score was calculated based on the identified optimum prognostic signature DERs according to the following formula: Where β DERs referred to the LASSO coefficients of signature DERs, and Exp DERs denoted the expression levels of signature DERs.
The PS value of each sample in training dataset was calculated, and samples were divided into high risk group (PS > median value) and low risk group (PS < median value) by using median value as the cut-off point. The Kaplan-Meier curves were plotted by survival package (Version2.41-1) of R3.4.1 [22]. At the same time, the PS value of each sample in validation dataset was also calculated using the expression levels, and samples were divided into high risk group and low risk group with median value as the cut-off point. The Kaplan-Meier curves for samples in validation dataset were plotted by survival package (Version2.41-1) of R3.4.1 [22].

Screening independent prognostic clinical factors and prognosis prediction efficiencies
The independent prognostic clinical factors for training dataset were screened by using the univariate and multivariate Cox regression analyses in Survival package (Ver-sion2.41-1, http://bioconductor.org/packages/survivalr/) [22] with the threshold of log-rank p value < 0.05. Afterwards, the clinical prognostic prediction model was constructed based on the screened independent prognostic clinical factors.
The target genes of differentially expressed miRNAs were predicted based on the starBase (Version 2.0, http://starbase.sysu.edu.cn/) [27], which contains prediction data from five databases including targetScan, picTar, RNA22, PITA and miRanda. The regulatory interactions predicted in at least one of the five databases were remained. Signature mRNAs identified above were mapped into the predicted target genes to construct the signature miRNA-mRNA regulatory network, which was also visualized by Cytoscape (Version 3.6.1, https://cytoscape.org/) [26]. Subsequently, lncRNA-miRNA regulatory network was integrated with signature miRNA-mRNA regulatory network to construct the ceRNA regulatory network by Cytoscape (Version 3.6.1, https://cytoscape.org/) [26].

Screened DERs between non-recurrent and recurrent COAD samples
After annotation, 13,834 mRNAs, 539 lncRNAs and 540 miRNAs were obtained in training and validation datasets. In total, 562 DERs were identified by Limma between 279 non-recurrent and 88 recurrent COAD samples in the TCGA training dataset, including 42 lncRNAs (20 up-regulated and 22 down-regulated ones), 484 mRNAs (393 up-regulated and 131 down-regulated ones), and 36 miRNAs (30 up-regulated and 6 downregulated ones) (Fig. 1a, b). Meanwhile, the hierarchical clustering analysis results showed the expression levels of differentially expressed lncRNAs, miRNAs or mRNAs could well distinguish the non-recurrent COAD samples from recurrent COAD samples (Fig. 1b).
The PS value of each sample in training dataset was calculated based on the LASSO coefficients (Fig. 2a) and expression levels of the 17 signature DERs. Samples were divided into high risk group and low risk group with the median value of 1.218 as the cut-off point (Fig. 2b). The survival time and recurrent status of patients in high risk group and low risk group showed that patients in low risk group had obviously longer survival time and lower proportion of recurrence (Fig. 2c). Meanwhile, expression levels of signature DERs in each sample revealed that the 3 protective RNAs were up-regulated in low risk group, while the other 14 risky RNAs were up-regulated in high risk group (Fig. 2d).
The Kaplan-Meier curves of training dataset showed high risk group and low risk group classified by PS prediction model had significantly different prognosis (p = 1.062e-10) (Fig. 3a). The AUC of 1-year, 3-year, and 5-year ROC of training dataset was 0.971, 0.949, and 0.951, respectively. With regarding to the validation dataset, high risk group also showed significantly lower survival ration in comparison with the low risk group (p = 3.491e-02) (Fig. 3b). Meanwhile, the AUC of 1-year, 3-year, and 5-year ROC of validation dataset was 0.841, 0.837, and 0.909, respectively.

Independent prognostic clinical factors and prognosis prediction efficiencies of different models
Three independent prognostic clinical factors, i.e. pathologic distant metastasis (M) (p = 2.65E-02), pathologic primary tumor (p = 2.23E-02), and PS model status (p = 1.29E-05), were found according to univariate and multivariate Cox regression analyses ( Table 2). Moreover, the Kaplan-Meier curves of pathologic M and pathologic T showed low pathologic M or pathologic T stage had significantly higher survival ratio (Fig. 4a, b). The prognosis prediction model constructed by both of pathologic M and pathologic T had better prognosis prediction efficiency (5-year AUC = 0.712, C-index = 0.680) than prediction model of pathologic M or pathologic T alone. The prognosis prediction model constructed by mRNAs (C-index = 0.746), or multi-RNAs (C-index = 0.788) had a C-index value larger than 0.7, suggesting acceptable discriminatory power. Moreover, prognosis prediction model constructed by multi-RNAs combined pathologic M and pathologic T had a 5-year AUC of 0.969 and a C-index of 0.812, suggesting excellent discriminatory power (Fig. 4c).

ceRNA regulatory network
Based on the DIANA-LncBasev2 database, the regulatory interactions between signature lncRNAs and miR-NAs were obtained. The lncRNA NCBP2-AS1 could interact with hsa-miR-34c and hsa-miR-363, and lncRNA LINC00115 could interact with hsa-miR-363 and hsa-miR-4709. The target genes of signature miR-NAs were predicted by starBase and regulatory interactions involving signature mRNAs were remained. SIX4, GRAP, NKAIN4, MMAA, and ERVMER34-1 are regulated by hsa-miR-4709. All these regulatory interactions were used for ceRNA regulatory network construction.

Discussion
Colon adenocarcinoma is one of the most common causes of death in adults, causing a major public health problem over the world [1]. The current strategies for treating colon cancer are mainly radical surgery and chemotherapy, while the prognosis of colon cancer patients who develop distant metastasis and local recurrence after treatments would become worse [28,29]. Therefore, effective novel biomarkers and reliable prognosis prediction model for COAD recurrence are of great help in selecting treatment strategies for patients according to the recurrence risk. In our study, 562 DERs (42 lncRNAs, 36 miRNAs, and 484 mRNAs) were identified between 279 non-recurrent and 88 recurrent COAD sample profilings downloaded from the   [30]. Another prognostic model composed of six significant prognostic factors (age, first-degree relative cancer history, differentiation grade, vessels/nerves invasion, TNM stage and HALP) has a 5-year AUC of 0.73 for patients with locally advanced colorectal cancer [31]. In our study, prognosis prediction model constructed by incorporating multi-RNAs with pathologic M and pathologic T has a 5-year AUC of 0.969 and a Cindex of 0.812, suggesting this prognosis prediction model may have great value in COAD.
In the constructed ceRNA regulatory network, lncRNA NCBP2-AS1 could interact with hsa-miR-34c and hsa-miR-363, and lncRNA LINC00115 could interact with hsa-miR-363 and hsa-miR-4709. SIX4, GRAP, NKAIN4, MMAA, and ERVMER34-1 were regulated by hsa-miR-4709. It has been reported that lncRNA LINC00115 might be a prognostic lncRNA in lung adenocarcinoma, and functions as a ceRNA by interacting with hsa-miR-7 to regulate FGF2 [32]. Zhang et al. have identified the differentially expressed lncRNAs related to cancer recurrence, and revealed that lncRNA LINC00115 is one of the markedly up-regulated lncRNAs in cancer associated with disease-free survival [33]. In the study of Li et al., hsa-miR-4709 was identified to be significantly up-regulated in colon cancer and high levels of hsa-miR-4709 was associated with poor prognosis [34]. As one member of the sine oculis homeobox (SIX) homolog family, e SIX4 have been found up-regulated in colorectal patients, which is significantly related to lymph node metastasis, advanced Tumor Node Metastasis (TNM) stages, and unfavorable prognosis [35]. Moreover, knocking down of SIX4 could suppress colorectal cell metastasis via in-activation of the PI3K/Akt signaling pathway. Additionally, SAM68 (Src-associated in mitosis, 68 kDa) has been reported overexpressed in colorectal cancer and higher expressions were related with poorer prognosis in colorectal cancer [36]. SAM68 could interact with GRAP to active oncogenic pathways, such as epidermal growth factor and PI3K/Akt signaling pathways, thereby contributing to cancer progressing [37,38]. Thus, lncRNA LINC00115 identified as an optimum prognostic signature DERs in our study, may highly correlated with COAD recurrence via functioning as a ceRNA by interacting with hsa-miR-4709 to regulate expressions of SIX4, GRAP, NKAIN4, MMAA, and ERVMER34-1. Moreover, in order to be further used in clinical, the prediction efficacy of the constructed prognosis prediction model, which incorporating multi-RNAs with pathologic M and pathologic T, should be considered to be validated by using a larger number of clinical samples.

Conclusion
In conclusion, 562 DERs (42 lncRNAs, 36 miRNAs, and 484 mRNAs) were identified between 279 non-recurrent and 88 recurrent COAD samples downloaded from the TCGA. PS prediction model based on 17 optimum