A 6 lncRNA-Based Risk Score System for Predicting the Recurrence of Colon Adenocarcinoma Patients

Colon adenocarcinoma (COAD) is a common type of colon cancer, and post-operative recurrence and metastasis may occur in COAD patients. This study is designed to build a risk score system for COAD patients. The Cancer Genome Atlas (TCGA) dataset of COAD (the training set) was downloaded, and GSE17538 and GSE39582 (the validation sets) from Gene Expression Omnibus database were obtained. The differentially expressed RNAs (DERs) were analyzed by limma package. Using survival package, the independent prognosis-associated long non-coding RNAs (lncRNAs) were selected for constructing risk score system. After the independent clinical prognostic factors were screened out using survival package, a nomogram survival model was constructed using rms package. Furthermore, competitive endogenous RNA (ceRNA) regulatory network and enrichment analyses separately were performed using Cytoscape software and DAVID tool. Totally 404 DERs between recurrence and non-recurrence groups were identified. Based on the six independent prognosis-associated lncRNAs (including H19, KCNJ2-AS1, LINC00899, LINC01503, PRKAG2-AS1, and SRRM2-AS1), the risk score system was constructed. After the independent clinical prognostic factors (Pathologic M, pathologic T, and RS model status) were identified, the nomogram survival model was built. In the ceRNA regulatory network, there were three lncRNAs, four miRNAs, and 77 mRNAs. Additionally, PPAR signaling pathway and hedgehog signaling pathway were enriched for the mRNAs in the ceRNA regulatory network. The risk score system and the nomogram survival model might be used for predicting COAD recurrence. Besides, PPAR signaling pathway and hedgehog signaling pathway might affect the recurrence of COAD patients.


INTRODUCTION
As a common malignancy occurring in the colon, colon cancer is divided into adenocarcinoma, undifferentiated carcinoma, and mucinous adenocarcinoma (1). Colon cancer is induced mainly by a high-fat low-fiber diet, and its symptoms include hematochezia, purulent stools, the change of bowel habits, abdominal mass, ileus, bellyache, and anemia (2). Colon cancer ranks third among the gastrointestinal cancers, and its morbidity and mortality rates in men are higher than those in women (3). The 5years survival rate of colon cancer after radical operation is about 60-70%, and the main reasons leading to the failure of surgical treatment are post-operative recurrence and metastasis (4,5). Therefore, the discovery of valuable molecular markers is important for the early prediction and treatment of colon cancer recurrence.
Long non-coding RNA (lncRNA) exerts a key regulatory effect in the development and progression of some diseases (6). In recent years, a variety of lncRNAs have been reported to be dysregulated in colon cancer. For example, the lncRNA small ubiquitin-like modifier 1 pseudogene 3 (SUMO1P3) is overexpressed in colon cancer, and is positively related to angiogenesis, metastases, advanced histological stages, and unfavorable outcome of colon cancer patients (7). Through activating the Wnt/β-catenin signaling, the lncRNA breast cancer anti-estrogen resistance 4 (BCAR4) accelerates the progression of colon cancer by promoting cell proliferation and suppressing cell apoptosis (8). The lncRNA cytoskeleton regulator RNA (CYTOR) facilitates epithelial-to-mesenchymal transition (EMT) and metastasis in colon cancer via interacting with β-catenin, which predicts poor prognosis and may be promising therapeutic target of colon cancer (9,10). These results suggested that lncRNAs might be used as biomarkers for predicting the prognosis of colon cancer.
Along with the development of bioinformatics, several researchers have developed lncRNA-related signatures for predicting prognosis of colon cancer. Xue et al. proposed a two-lncRNA expression signature to predict survival of patients with colon adenocarcinoma (11). Xing et al. developed a 14-lncRNA prognostic signature for patients with colon adenocarcinoma (12). Lv et al. identified a five-lncRNA prognostic signature to predict the survival of colon cancer patients (13). These studies proved the strong power of prognostic prediction value of the lncRNA signatures for colon cancer patients. However, seldom studies were conducted to develop lncRNA signatures for predicting recurrence of colon cancer.
As a competitive endogenous RNA (ceRNA), lncRNA can mediate gene expression through acting as microRNA (miRNA) sponge (14,15). The up-regulation of the lncRNA metastasis associated lung adenocarcinoma transcript 1 (MALAT1) mediates high mobility group box 1 (HMGB1) expression in colon cancer via competing with miR-129-5p, and the MALAT1/miR-129-5p/HMGB1 axis serves as a key prognostic marker in the development of the tumor (16). In this study, the lncRNAs, miRNAs, and mRNAs with differential expression between recurrence and non-recurrence colon adenocarcinoma (COAD) samples were analyzed. The recurrence prognosis-associated lncRNAs were screened, and then the independent prognosis-associated lncRNAs were further selected for constructing risk score system. Moreover, nomogram survival model construction, ceRNA regulatory network construction and enrichment analysis were conducted. Our findings might be conducive to predicting the recurrence of COAD patients.

Differential Expression Analysis
A total of 13,834 mRNAs, 827 lncRNAs, and 1,037 miRNAs were annotated from the TCGA transcriptomic RNA and miRNA datasets. The 310 COAD samples in the TCGA dataset were classified into recurrence (66 samples) and non-recurrence (244 samples) groups. Under the defined thresholds, 404 DERs (including 357 DE-mRNAs (122 down-regulated and 235 up-regulated), 26 DE-lncRNAs (eight down-regulated and 18 upregulated), and 21 DE-miRNAs (eight down-regulated and 13 upregulated) between recurrence and non-recurrence groups were screened out ( Figure 1A). Based on the expression of the DERs, the clustering heatmap is drew and presented in Figure 1B.
Subsequently, the risk score system based on the independent prognosis-associated lncRNAs was constructed, and the relevant formula was as follow: The median of the RSs of samples were calculated, and the samples in the TCGA dataset and the validation sets separately were divided into high and low risk group. Then, the correlation between the grouping and the actual recurrence prognosis information was evaluated using KM curves. The results showed that the grouping based on the risk score system had significant correlations with the actual recurrence prognosis in the TCGA dataset and the validation sets (Figure 2).

Establishment and Validation of Nomogram Survival Model
By using univariable and multivariable Cox regression analyses, pathologic M, pathologic T, and RS model status were selected as the independent clinical prognostic factors in the COAD samples ( Table 3). The COAD samples in lower pathologic M and pathologic T stages had better recurrence prognosis, which was consistent with clinical facts (Figure 3).
Moreover, a nomogram survival model was built based on the three independent prognostic factors ( Figure 4A). The 3-years survival probability and 5-years survival probability of patients could be easily calculated based on their pathologic M, pathologic T and RS. Additionally, the nomogram-predicted 3-years survival probability/5-years survival probability was further compared with the actual 3-years survival probability/5-years survival probability recorded in TCGA and results suggested that there is high agreement between nomogram-predicted probability of recurrence free survival and the actual recurrence free survival ( Figure 4B), indicating the good performance of nomogram survival model.

CeRNA Regulatory Network Construction and Enrichment Analysis
The lncRNA-miRNA regulatory network was constructed, which had seven nodes (three lncRNAs and four miRNAs) ( Figure 5). Meanwhile, the miRNA-mRNA regulatory network was built, involving 81 nodes (four miRNAs and 77 mRNAs) (Figure 6). Based on the lncRNA-miRNA-mRNA relationships, the ceRNA regulatory network (84 nodes, including three lncRNAs (two up-regulated and one down-regulated), four miRNAs (two upregulated and two down-regulated), and 77 mRNAs (21 upregulated and 56 down-regulated) was constructed (Figure 7).

DISCUSSION
In the present study, a total of 404 DERs (including 357 DE-mRNAs, 26 DE-lncRNAs, and 21 DE-miRNAs) between recurrence and non-recurrence groups were identified. After the 21 recurrence prognosis-associated lncRNAs were screened out, six independent prognosis-associated lncRNAs (including H19, KCNJ2-AS1, LINC00899, LINC01503, PRKAG2-AS1, and SRRM2-AS1) were further selected. Based on the independent prognosis-associated lncRNAs, the RS system was constructed. Pathologic M, pathologic T, and RS model status were identified as the independent clinical prognostic factors, and then the nomogram survival model was built. This nomogram survival model might be used to predict 3/5 years recurrence free survival in future clinical practice, by combining pathologic M, pathologic T and the RS calculated from the expression level of the 6 lncRNAs detected by qRT-PCR from surgical specimens. High mobility group AT-hook 1 (HMGA1) is inhibited by H19 short hairpin RNA (shRNA) and is promoted by miR-138 inhibitor, and the H19-miR138-HMGA1 pathway plays important roles in mediating the invasion and migration of colon cancer (17). H19 expression is significantly up-regulated in immunodeficient mice induced by colon cancer cells, and H19 may be taken as a novel therapeutic target in colon cancer (18). H19 can suppress vitamin D receptor (VDR) expression via miR-675-5p, and increased H19 leads to the resistance to 1,25(OH)2D3 treatment in the advanced colon cancer (19). LINC00899 is elevated in the serum and bone marrow of acute myeloid leukemia (AML) patients, therefore, serum LINC00899 may be a promising marker for the early diagnosis and prognosis of AML (20). These suggested that H19 and LINC00899 might be involved in the recurrence prognosis of COAD. Thirteen lncRNAs (including KCNJ2-AS1) are selected as prognostic biomarkers, based on which a prognostic signature is constructed for predicting the disease free survival in gastric cancer patients (21). Potassium voltage-gated channel subfamily J member 2 (KCNJ2)/Kir2.1 belongs to the inwardly rectifying potassium channel family, which regulates drug resistance and cell growth through mediating mitochondrial 37S ribosomal protein MRP1 (MRP1)/ATP binding cassette subfamily C member 1 (ABCC1) expression in small-cell lung cancer (22). Up-regulated LINC01503 contributes to cell proliferation, growth, invasion, and migration in esophageal squamous cell carcinoma (ESCC), which may be used as a marker of aggressive ESCC (23).
Overexpressed LINC01503 regulates forkhead box K1 (FOXK1) expression by competing with miR-4492, which promotes cell proliferation and invasion in colorectal cancer (24). Therefore, KCNJ2-AS1 and LINC01503 might also be correlated with the recurrence prognosis of COAD. The genetic variation in a candidate pathway contributes to the risk of both colon and rectal cancers, and protein kinase AMP-activated non-catalytic subunit gamma 2 (PRKAG2), mechanistic target of rapamycin kinase (FRAP1), TSC complex subunit 2 (TSC2), and protein kinase AMP-activated catalytic subunit alpha 1 (PRKAA1) genes involved in the pathway are significantly related to colon cancer (25,26). Through influencing the alternative splicing of downstream genes, the S346F mutation in serine/arginine repetitive matrix 2 (SRRM2) promotes the susceptibility of papillary thyroid carcinoma (27). Erb-b2 receptor tyrosine kinase 2 (ERBB2) overexpression is correlated with increased metastasis of human cancers, and the depletion of SRRM2, splicing factor, arginine/serine-rich 1 (SFRS1), SFRS9, and SFRS10 proteins reduced the migration rate of ovarian cancer cells overexpressing ERBB2 (28). These indicated that PRKAG2-AS1 and SRRM2-AS1 might play roles in the recurrence prognosis of COAD.
Through interacting with T cell transcription factor-4 (Tcf-4) and beta-catenin, peroxisome proliferator activated receptor gamma (PPAR gamma) may determine colon cell fate and serve as a target of the Wnt pathway in colon cancer cells (29). Hedgehog signaling pathway functions in gastrointestinal development and affects the formation of multiple tumors, which predicts unfavorable outcomes in colon cancer patients (30). The mRNAs in the ceRNA regulatory network were involved in PPAR signaling pathway and hedgehog signaling pathway, suggesting that PPAR signaling pathway and hedgehog signaling pathway might be related to the recurrence prognosis of COAD. (C) The KM curves for the validation set GSE39582; (D) The ROC curves for the TCGA dataset and the validation sets; In KM curves, blue and red curves separately represent low risk group and high risk group. In ROC curves, red, black, and blue curves represent the TCGA dataset, the validation set GSE17538, and the validation set GSE39582, respectively. HR, hazard ratio; AUC, area under the receiver operating characteristic curve.     However, there are some limitations in this study. First, the differential expression of the 6 lncRNAs was identified from RNA sequence data. Experiment validation of these lncRNAs in colon cancer patients should be conducted in further research. Second, the ceRNAs related with these lncRNAs should also be validated in in vitro and in vivo studies.
In conclusion, 357 DE-mRNAs, 26 DE-lncRNAs, and 21 DE-miRNAs were identified between recurrence and nonrecurrence groups. Besides, the risk score system (involving H19, KCNJ2-AS1, LINC00899, LINC01503, PRKAG2-AS1, and SRRM2-AS1) and the nomogram survival model might be useful for the prediction of COAD recurrence. Moreover, PPAR signaling pathway and hedgehog signaling pathway might be correlated with the recurrence prognosis of COAD patients.

Data Source
From The Cancer Genome Atlas (TCGA) database (https:// cancergenome.nih.gov/), the transcriptomic RNA (including 512 samples) and miRNA (including 461 samples) expression data of COAD (downloaded in May 15, 2019; platform: Illumina HiSeq 2000 RNA Sequencing) was extracted. The two groups of samples were paired according to the sample numbers, and then were corresponded to the downloaded clinical information. Finally, a total of 310 paired COAD samples with recurrence prognosis information were screened and used as the training set.
Taking "colon adenocarcinoma" and "Homo sapiens" as searching words, the eligible datasets were selected from Gene

Construction of Risk Score System
Based on the recurrence information of the 310 COAD samples in the TCGA dataset, the DE-lncRNAs having significant associations with recurrence prognosis were selected combined with the univariable Cox regression analysis in the survival package (version 2.41-1, http://bioconductor.org/ packages/survivalr/) in R (34). The threshold was set as log-rank p-value < 0.05. From the recurrence prognosis-associated lncRNAs, the independent prognosis-associated lncRNAs were further selected using the multivariable Cox regression analysis in the survival package in R (34). Afterwards, the risk score system was built based on the expression levels and independent prognostic coefficients of the independent prognosis-associated lncRNAs. The risk scores (RSs) of the COAD samples were calculated using the formula below: Risk score (RS) = β lncRNA ×Exp lncRNA β lncRNA indicates the independent prognostic coefficient of independent prognosis-associated lncRNA, and Exp lncRNA stands for the expression level of independent prognosisassociated lncRNA. The median of the RSs of the COAD samples in the TCGA dataset were calculated and used to divide the samples into high risk group and low risk group. Using the Kaplan-Meier (KM) curve in the R package survival (34), the association between the grouping and the actual recurrence prognosis information was analyzed. Similarly, the prediction efficiencies of the risk score system in the validation sets were evaluated.

Construction of Nomogram Survival Model
Using the univariable and multivariable Cox regression analyses in the R package survival (34), the independent clinical prognostic factors in the COAD samples in the TCGA dataset were identified. The log-rank p-value < 0.05 was selected as the significant threshold. Using the R package rms (version 5.1-2, https://cran.r-project.org/web/packages/rms/index.html) (35), nomogram survival model was constructed based on the independent clinical prognostic factors and the predicted risk information.
The starBase database (version 2.0, http://starbase.sysu.edu. cn/, including the information of RNA22, PITA, targetScan, picTar, and miRanda databases) (38) was applied for searching the targets of the miRNAs implicated in the lncRNA-miRNA regulatory network. The regulatory relationships included by at least two of RNA22, PITA, targetScan, picTar, and miRanda databases were taken as the miRNA-target pairs. Subsequently, the pairs involving negatively correlated miRNAs and DE-mRNAs were further screened. Moreover, Cytoscape software (37) was utilized to visualize the miRNA-mRNA regulatory network.
The lncRNA-miRNA-mRNA regulatory relationships were obtained through integrating the lncRNA-miRNA and miRNA-mRNA pairs. Based on the lncRNA-miRNA-mRNA regulatory relationships, the ceRNA regulatory network was built by Cytoscape software (37). Using DAVID online tool (version 6.8, https://david.ncifcrf.gov/) (39), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and Gene Ontology (GO) functional enrichment analyses for the mRNAs in the ceRNA regulatory network were carried out. The p-value < 0.05 was selected as the threshold of enrichment significance.

DATA AVAILABILITY STATEMENT
The datasets generated for this study can be found in The Cancer Genome Atlas (https://portal.gdc.cancer.gov/); the NCBI Gene Expression Omnibus (GSE17538, GSE39582).