Integrative Analysis and Identication of an Excellent lncRNA Signature to Predict Prognosis in Patients with COAD

Backgroud: Tumour recurrence and metastasis lead to poor prognosis incolon cancer(COAD). Therefore We aimed to identify a lncRNA signature through an integrative analysis of copy number variation, mutation and transcriptome data to predict prognosis and explore its internal mechanism. Methods: The lncRNA expression prole were collected fromThe Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO). TCGA data was randomly divided 3:1 intotraining andtesting cohort. In the training, weperformed integrated analyses of three candidate lncRNA sets that correlated with prognosis, copy number variations and mutations to establish a signature through Cox regression analysis. The robustness was determined in the testing and GEO. Results: An 11-lncRNA signature that was signicantly associated with prognosiswas constructed in the training (P<0.0001, HR=2.014) , And this signature was validated in the testing(P=0.0019, HR=3.374) and GSE17536(P=0.0076, HR=1.864). The signature is signicantly related to MSI status and clinical prognostic factors. The prognostic-relatedrisk scores were signicantly excellent than the other ve models have been reported. Furthermore, GSEA suggested that the signature was involved in COAD development and metastasis-related pathways. Conclusions: We identiedansignature has strong robustness and can stably predict the prognosis of COAD in different platformsand may be implicated in COAD pathogenesis and metastasis and applied clinically as a prognostic marker.


Introduction
Colorectal cancer (CRC) is the third most commonly diagnosed cancer [1]. As the treatment technology for CRC continues to advance, its mortality rate has declined for several decades. However, CRC is still the second most common cause of cancer-related death worldwide because of its high rates of tumour recurrence and metastasis, which cause a poor prognosis [2]. Therefore, it is important to nd a molecular model that can effectively identify and predict the risk of CRC recurrence and metastasis. Zhou Yiming [3] identi ed and structured a prognostic signature based on ve candidate genes, REG1B, TGM6, NTF4, PNMA5, and HOXC13, that could recognize COAD patients at a high risk of metastasis.
Long noncoding RNA (lncRNA) is a major type of noncoding RNA (ncRNA) de ned as an RNA transcript of more than 200 base pairs in length. Recently, several studies have shown that the aberrant expression of lncRNAs is closely related to the development of many human tumours, including CRC [4][5][6]. After the lncRNA CCAT1 (CARLo-5) was identi ed as an oncogene in COAD [7], a number of lncRNAs that are dysregulated in COAD have been identi ed [8], and the role of lncRNAs has recently received increasing attention. lncRNAs are characterized as oncogenes or tumour suppressor genes [9,10] and play a role in different biological processes in COAD. lncRNAs participate in transcriptional and epigenetic regulation by interacting with genomic DNA, transcription factors, chromatin, spliceosomes, chromatin regulators and other nuclear proteins [11], and lncRNAs are always involved in posttranscriptional, translational and posttranslational regulatory processes [12]. Various lncRNAs are involved in carcinogenesis and progression by regulating COAD cell proliferation, migration and invasion [7,13].
High-through put multi-omics sequencing data have laid a solid foundation for identifying genes associated with cancer prognosis. Multi-omics data analysis can reveal the mechanism of cancer development from multiple perspectives. To improve the predictive value and accuracy for COAD prognosis, we need to identify a robust lncRNA signature through an integrative analysis of prognosis-related lncRNA, copy number variation, mutation and transcriptome data with the help of multi-omics data analysis technology. In this study, We collected the data of the copy number variation, mutation and transcriptome of COAD tissues from The Cancer Genome Atlas (TCGA, n = 478) and the GEO(GSE17536 dataset, n = 177).
A signature was established through integrated analyses, and validated in the diferent platforms of the testing cohort and the GSE17536 cohort. The microsatellite instability (MSI) status and clinical independence of the lncRNA signature were analysed. We also explored the pathways associated with the development and metastasis of COAD enriched in the lncRNA signature (Fig. 1). Furthermore, Compared with other rognostic-related risk signatures, the signature was better of predicting prognosis in COAD patients.

Results
1 Comprehensive analysis of multi-omics data to obtain lncRNAs related to COAD prognosis 1.1 Lncrnas That Are Closely Related To Coad Prognosis According to the univariate Cox regression analyse, we identi ed a total of 483 candidate prognostic lncRNAs from the training cohort, and information on the top 20 lncRNAs is shown in Table 2. We obtained lncRNAs that are closely related to gene copy number variation. A total of 137 lncRNAs that were signi cantly ampli ed on each fragment of the COAD genome ( Fig. 2A), including LINC00392 on the 13q22.1 segment (q = 8.17E-12), LINC01598 on the 20q11.21 segment (q = 5.75E-09) and LOC730183 on the 16p11.2 segment (q = 0.0014485), were identi ed. On the other hand, a total of 261 lncRNAs that were signi cantly deleted on each fragment in the COAD genome ( Fig. 2B), including LINC00681 in the 8p22 segment (q = 2.74E-45), LOC101928728 in the 1p36.11 segment (q = 2.02E-09), and LINC00491 in the 5q22.2 segment (q = 1.48E-05), were identi ed.

lncRNAs that are closely related to gene mutations
Through MutSig2, we identi ed a total of 41 genes with signi cant mutation frequencies. The distribution of synonymous mutations, missense mutations, framework insertions or deletions, framework movements, nonsense mutations, cleavage sites and other nonsynonymous mutations in these 41 genes in TCGA COAD patient samples are shown in Fig. 3. We identi ed 41 genes, some of which have been reported to be closely related to the development of cancer, such as KRAS, TP53, APC, PIK3CA, and FBXW7. Among these 41 genes, we identi ed lncRNAs associated with gene mutations using each of the genes to mutate into a tag, and a rank-sum test was used to detect the difference between the expression of each lncRNA in the mutant and nonmutant groups. lncRNAs with a P-value < 0.01 were considered to be associated with a gene mutation; thus, we obtained 2712 lncRNAs related to gene mutations.

Identi cation Of An 11-lncrna Signature For Coad Survival
The comprehensive analysis revealed 147 lncRNAs associated with ampli cations, deletions, and mutations from a total of 483 candidate prognostic lncRNAs. We analysed the change trajectory of each independent variable (Fig. 4A). It can be seen that with the gradual increase in lambda, the number of independent coe cients becomes closer to zero. We used three-fold cross-validation to build the model. The con dence interval under each lambda is shown in Fig. 4B. As shown in the gure, the model was optimal when lambda = 0.04078231. For this reason, we selected the lncRNAs obtained when lambda = 0.04078231 as the target lncRNAs to construct the model.
After Lasso Cox regression narrowed the scope, we obtained 21 target lncRNAs that were used to construct the model. A multivariate Cox survival analysis was performed on 21 lncRNAs, and the 11 lncRNAs with the lowest AIC values (AIC = 767.27) were used to construct the nal model. Details of the 11 lncRNAs are shown in Table 3. The 11-lncRNA signature was then tested for its ability to predict survival in COAD patients. The risk score was calculated as the sum of the above gene expression values * the ordinal, and then we selected 0.9892846 as the cutoff (median risk score) and divided the samples into high-risk and low-risk groups. Finally, 247 patients were classi ed as low risk, and 110 patients were classi ed as high risk; signi cantly different OS rates were observed between the two groups in the training cohort (log-rank P < 0.0001, HR = 2.014) Fig. 5C). We acquired a ve-year AUC of 0.83 according to the ROC curve for predicting survival in COAD patients (Fig. 5B). As the patient's risk score increased, the OS rate signi cantly decreased, and the number of deaths in the high-risk group increased signi cantly (Fig. 5A). According to the changes in the expression levels of the 11 different lncRNAs in the signature observed with increases in the risk score, the expression of ENSG00000246627 was correlated with a low risk of COAD, and the other 10 lncRNAs were identi ed as risk factors based on their high expression and correlation with a high risk of COAD.
4 Validation of the 11-lncRNA signature in the testing and GSE17536 cohorts First, the 11-lncRNA signature was validated in the testing cohort; 88 patients were classi ed as low risk, and 31 patients were classi ed as high risk. There was a signi cant difference in OS between the two groups (log-rank P = 0.0019, HR = 3.374) (Fig. 6C). The ve-year AUC was 0.66 according to the ROC curve (Fig. 6B). The results of the testing cohort were similar to those of the training cohort; as the patient's risk score increased, the OS time decreased signi cantly, and the number of deaths in the high-risk group increased signi cantly (Fig. 6A). Moreover, 10 lncRNAs (all lncRNAs except ENSG00000246627) were identi ed as risk factors.
Similarly, we used the same model and the same cut-off from the training cohort and veri ed the model's robustness using an external independent data cohort (GSE17536). Ultimately, 99 patients were classi ed as low risk, and 78 patients were classi ed as high risk. A signi cant difference in OS was observed between the two groups (log-rank P = 0.0076, HR = 1.864) (Fig. 7C). The ve-year AUC was 0.71 according to the ROC curve (Fig. 7B). The GSE17536 cohort showed similar results to the TCGA training cohort. As the risk score increased, the survival time decreased signi cantly, and the number of deaths in the high-risk group increased. The expression of the 11 different signature lncRNAs also increased with the increase in the risk score, indicating that high expression of the 11 lncRNAs is associated with a high risk of COAD and could serve as a risk factor (Fig. 7A).
5 Independent predictive power of the 11-lncRNA signature according to the MSI status, tumour stage and clinicopathological characteristics The patients were subdivided into a high-risk subgroup and a low-risk subgroup based on different MSI statues, and the 11-lncRNA signature was used to predict OS; the OS rate was signi cantly different between the high-risk and low-risk subgroups in the MSI-L and MSS groups (excluding MSH) ( Fig. 8A-C). Based on their tumour stage, patients were subdivided into a high-risk group and a low-risk group in each stage, and the 11-lncRNA signature revealed no signi cant difference in OS at the II, III and IV stages (all stages except stage I) between the two groups ( Fig. 8D-G). Furthermore, these results demonstrate that the 11-lncRNA signature model can better predict the OS of patients with different MSI statuses and tumour stages.
We systematically analysed the clinical information of the TCGA and GSE17536 patients, including age, sex, lymph node invasion status, pathology (T, N, and M classi cations), tumour stage, and the 11-lncRNA signature grouping information (Table 4). In the TCGA training cohort, we found signi cant survival-related correlations in the clinicopathological characteristics, with the exception of age and sex, according to the univariate Cox regression analysis, but we found that only the risk score (HR = 4.39, 95% CI = 2.58-7.46, P = 4.26E-08), age, and stage III/IV vs sage I/II were signi cantly related to survival according to the multivariate Cox regression analysis. The 11-lncRNA signature was veri ed to be clinically independent.
In the TCGA testing cohort, the risk score, N1/N2 vs N0, M1 vs M0, and stage III/IV vs stage I/II were signi cantly associated with survival according to the univariate Cox regression analysis, but only the risk score (HR = 3.02, 95% CI = 1.13-8.08, P = 0.027) and M1 vs M0 were signi cantly related to survival according to the multivariate Cox regression analysis. The 11-lncRNA signature was also veri ed to be clinically independent.
In the GSE17536 cohort, the risk score and stage III/IV vs stage I/II were signi cantly associated with survival according to the univariate Cox regression analysis, but only the risk score (HR = 1.65, 95% CI = 1.02-2.67, P = 0.039), age and stage III/IV vs stage I/II were signi cantly associated with survival according to the multivariate Cox regression analysis.
In conclusion, the 11-lncRNA signature is a prognostic indicator independent of other clinical factors and shows independent predictive performance with clinical application value.
6 Identi cation of the 11-lncRNA signature-associated biological pathways in the training cohort The signalling pathways associated with the 11-lncRNA signature signi cantly enriched in the TCGA training cohort were detected by GSEA ( Table 5). The signalling pathways that were signi cantly enriched in the high-risk and low-risk groups, were the Notch signalling pathway, the VEGF signalling pathway, the P53 signalling pathway and the cell cycle; all were signi cantly associated with the development and metastasis of COAD (Fig. 9). The ROC and OS KM curves of the ve models are shown in Fig. 10. Signi cantly different OS rates were observed between the high-risk and low-risk groups using the six-lncRNA signature established by Zhao [17] (log-rank P = 0.0014,HR = 2.03) (Fig. 10B). We acquired a ve-year AUC of 0.65 and a ten-year AUC of 0.67 according to the ROC (Fig. 10A).
Signi cantly different OS rates were also observed between the two groups using the two-lncRNA signature established by Xue[18] (log-rank P = 0.018, HR = 1.73) (Fig. 10D), and we obtained a ve-year AUC of 0.54 and a ten-year AUC of 0.47 (Fig. 10C). Signi cantly different OS rates were also observed with the 14-lncRNA signature reported by Xing [19] (Fig. 10F), and we found a ve-year AUC of 0.66 and a ten-year AUC of 0.53 (Fig. 10E). In addition, the six-lncRNA signature established by Fan [20] (Fig. 10H) yielded a ve-year AUC of 0.64 and a ten-year AUC of 0.41 (Fig. 10G), and the 15-lncRNA signature obtained by Wang [21] (Fig. 10J) resulted in a ve-year AUC of 0.78 and a ten-year AUC of 0.67 (Fig. 10I). The nal comparison showed that our model was slightly better than the 15-lncRNA model and signi cantly better than the other four models.

Discussion
CRC is a common digestive tract tumour that is a serious threat to the health of patients. According to recent statistics, there are approximately 1.45 million new cases of CRC each year, which made it the third most prevalent cancer in 2018 [1] and the second most prevalent cancer in American males in 2019 [22]. Approximately 694,000 deaths have been reported every year, making the second most common cause of cancer-related death worldwide [2]. The recurrence and metastasis of CRC seriously affect the e cacy and prognosis of treatment. Identifying the risk of recurrence and metastasis can help us guide early intervention for the treatment of CRC, ultimately improving the prognosis. Therefore, it is very urgent to study and identify one or more e cient molecular models for predicting prognosis to guide treatment options and to improve the survival quality of CRC patients.
lncRNAs are a major class of ncRNAs with a length of more than 200 base pairs and are not translated into proteins [23,24]. Over the past decade, it has become clear that certain lncRNAs have strong potential, and an in-depth study is needed to elucidate their mechanism of action. lncRNAs not only control the nuclear structure[25] but also regulate the expression of adjacent genes and act as ampli ers, with remarkable tissue speci city through various mechanisms [26]. lncRNAs have been shown to directly regulate gene expression at the transcriptional, posttranscriptional and epigenetic levels. lncRNAs have long nucleotide chains and intricate secondary structures and can interact with genomic DNA, chromatin, transcription factors, chromatin regulators, spliceosomes and other nuclear proteins [27]. lncRNAs are known to serve as tumour regulators and participate in complex networks of biological regulation [28,29]. Therefore, the identi cation of lncRNAs closely related to tumour prognosis and the establishment of a signature that can predict the risk of tumour prognosis will be helpful for improving the prevention and treatment of tumours. Zhang GH [15] identi ed a novel four-lncRNA signature using Cox regression analysis to identify lncRNAs that correlated with the prognosis of 111 laryngeal cancer patients from the TCGA. The signature was shown to predict the prognosis of patients with laryngeal cancer and may in uence the prognosis of laryngeal cancer through many pathways, such as regulating immunity and tumour apoptosis. Jie Li [30] also identi ed a ve-lncRNA signature to predict the risk of tumour recurrence in breast cancer (BC) patients and found that it was independent of clinical prognostic factors, such as BC subtypes and adjuvant treatments.
Recently, an increasing number of researchers have focused on the role of lncRNAs in the development and progression of CRC and its signi cance to clinical prognosis [4,6]. Many lncRNAs, such as MALAT1 and HOTAIR, have also been used as biomarkers in CRC [31,32]. Several lncRNAs have not only been reported as markers in CRC diagnosis but also been shown to be correlated with patient prognosis (e.g., CCAL, PURPL and lnc-GNAT1-1) [33][34][35]. Therefore, this method was also used to identify a CRC-related lncRNA signature for predicting the prognosis of CRC [17][18][19][20][21] by analysing different datasets.
At the same time, as high-throughput multi-omics sequencing data have laid a solid foundation for identifying genes associated with cancer prognosis, multi-omics data analysis can reveal the mechanisms of cancer development from multiple perspectives [36]. We identi ed closely related genomics, gene mutations, epigenetics, and functions of lncRNAs that are inherently regulated by COAD [4,6]. We show that dysregulated lncRNAs may be new prognostic and diagnostic biomarkers or therapeutic targets for clinical applications. Therefore, we identi ed a robust lncRNA signature through an integrative analysis of prognosis-related lncRNA, copy number variation, mutation and transcriptome data with the help of multi-omics data analysis technology. In this study, we fully integrated and analysed lncRNAs that are related not only to prognosis but also to copy number variations, mutations and transcriptome regulation in 359 COAD samples from the TCGA training cohort. We developed an 11-lncRNA signature that was validated in testing and GSE17536 cohorts. We also found that the signature had good robustness and good predictive ability of the prognosis risk in other independent datasets. Moreover, we compared the 11-lncRNA signature with other COAD prognostic signatures [17][18][19][20][21] to validate the good prediction of the prognosis risk of the 11-lncRNA signature. We found that the AUC of the 11-lncRNA signature was better than that of the other ve signatures. Because the researchers did not con rm that the lncRNAs directly regulate gene expression at the transcriptional, post-transcriptional and epigenetic levels, we focused on the predictive role of lncRNAs in tumour prognosis and established lncRNA signatures to predict prognosis merely by analysing and identifying lncRNAs associated with prognosis. However, the intrinsic interactive relationship between lncRNAs, genomics, gene mutations and epigenetics was not investigated. Ultimately, the studies failed to identify a good lncRNA signature to predict the risk of CRC prognosis. Signi cant differences in OS outcomes between the high-risk and low-risk groups were obtained using the described method [37].
Furthermore, we found that the 11-lncRNA signature had independent predictive power from the MSI status and tumour stage (e.g., MSI-L and MSS patients and patients in stages II, III and IV, excluding stage I), which may be due to a good survival rate in COAD patients with stage I [22]. This study also showed that the 11-lncRNA signature was a prognostic indicator independent of other clinical factors and had independent predictive performance with clinical application value.
Additionally, we identi ed the 11-lncRNA signature-associated biological pathways that were signi cantly enriched in COAD patients as detected by GSEA. In summary, we determined that the Notch signalling pathway, the VEGF signalling pathway, the P53 signalling pathway and the cell cycle are signi cantly associated with the development and metastasis of COAD [38,39].
In summary, we constructed a novel 11-lncRNA signature that can be used to predict the prognosis of patients with COAD and exploited the possible underlying mechanisms involved. The 11-lncRNA signature may indicate the potential roles of lncRNAs in COAD pathogenesis. The results will provide molecular diagnostic markers and therapeutic targets with clinical implications in COAD patients. Finally, we hope that all of the above results will be veri ed in basic experiments and clinical trials in further studies.

Conclusion
In our study, we integrated analysis of the lncRNAs were not only related to the prognosis, but also related to copy number variation, mutation and transcriptome data to identify and constructe a lncRNA signature to predict prognosis in COAD patients. And the signature was validated in the testing and GSE17536 cohorts and was independent of the MSI status and clinical prognostic factors. The signature was signi cantly better than the other ve lncRNA signatures have been reported in the COAD.

Patients and data collection
We downloaded COAD RNA-Seq data, clinical follow-up information and copy number variation data from the SNP 6.0 chip in the TCGA database from the UCSC cancer browser (https://xenabrowser.net/datapages/), and we downloaded the mutation comment le (MAF) from the GDC client. We also downloaded the GSE17536 dataset, which included COAD expression pro le data and clinical information, from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/).
We preprocessed the downloaded data, downloaded the fragments per kilobase of transcript per million mapped reads (FPKM) RNA-Seq data from the TCGA. The R package "caret" was used to randomly divide the samples into the training cohort (359 samples) and the testing cohort (119 samples). The lncRNA expression pro le data is extracted according to the Ensemble ID of lncRNA in the GENCODE database. SeqMap was used to compare GSE17536 expression data (no mismatch was allowed). A total of 5,076 probes were annotated on the lncRNAs. All selected expression datasets in the training, testing and GSE17536 cohorts were log 2 transformed for standardization.
Ultimately, a total of 478 samples from the TCGA were randomly divided at a ratio of 3:1 into a training cohort (n = 359) and a testing cohort (n = 119), and 177 samples were obtained from the GEO database (GSE17536). We obtained clinical pathology data, including the patient age, survival status, sex, lymph node metastasis status, T classi cation, N classi cation, M classi cation and tumour stage (Table 1), from the three cohorts and found no signi cant differences among the groups. 3 Identi cation of lncRNAs that are closely related to gene copy number variation We used GISTIC 2.0 to identify genes with signi cant ampli cation or deletion from copy number variation data in the TCGA training cohort. We set a parameter threshold for fragments with ampli cation or deletion lengths greater than 0.1 and signi cant P-values (P < 0.05). The signi cantly ampli ed fragments in the genome and the genes ampli ed on each of the fragments were recorded and the signi cantly deleted fragments in the genome and the genes that were signi cantly deleted on each fragment were recorded and incorporated to establish the lncRNAs associated with copy number variation.
4 Identi cation of lncRNAs that are closely related to gene mutations We used MutSig2 to identify genes with signi cant mutations from the mutation annotation data of the TCGA training cohort, and with a threshold of P < 0.05. We identi ed lncRNAs associated with gene mutations using the rank-sum test to detect the difference in the expression of each lncRNA between the mutant and nonmutant groups, and each gene mutation was used as a label. lncRNAs with signi cant P values (P < 0.01) were considered to be associated with a gene mutation. Finally, the lncRNA dataset related to gene mutations was established.
5 Development and validation of the robustness of the lncRNA signature By intersecting these three lncRNA sets (Prognosis-related lncRNAs, copy number variation-related lncRNAs and mutationrelated lncRNAs), we obtained the targeted lncRNAs. The least absolute shrinkage and selection operator (Lasso) method, which is a compression estimate, was used to narrow the lncRNA range. We used the R package glmnet for the Lasso Cox regression analysis [14,15]. Furthermore, we performed a multivariate Cox regression analysis, and stepwise regression was used to reduce the number of lncRNAs again. The lowest Akaike information criterion (AIC) value as the nal model. RiskScore = coe cient*exp lncRNA1 +oe cient*exp lncRNA2 +oe cient*exp lncRNA3 +....+coe cient*exp lncRNAn .
The risk score of each sample in the TCGA training cohort was then calculated. Based on the median risk score, the COAD patients were divided into two groups: a high-risk group and a low-risk group. A receiver operating characteristic (ROC) curve was used to test the accuracy of lncRNA signature to predict prognosis. Finally, the lncRNA signature was veri ed in the testing cohort. The GSE17536 cohort used the same model and the same cutoff as the TCGA training cohort.  [21], for comparison with our lncRNA signature. To make the models comparable, we performed a multivariate Cox regression analysis to calculate the risk scores of the training set samples based on the corresponding genes in the three models. We evaluated the ROC of the ve models and then divided the samples into high-risk and low-risk groups according to the median risk score and analysed the difference in OS between the two groups. Figure 1 Work ow of identifying a COAD survival-related 11-lncRNA signature.