Construction of competitive endogenous RNA network related to circular RNA and prognostic nomogram model in lung adenocarcinoma

Early researches have revealed that circular RNA (circRNA) had the potential of biomarkers and could affect tumor progression through regulatory networks. However, few research focused on the function of circRNA in lung adenocarcinoma and the regulation mechanism of competitive endogenous RNA. In present study, through differential expression analysis, 10 circRNAs, 98 miRNAs(microRNA) and 2497 mRNAs were screened. Based on the 10 circRNAs and related databases, a competitive endogenous RNA regulatory network (ceRNA network) containing 7 circRNAs, 13 miRNAs and 147 mRNAs was constructed. KEGG and GO analysis suggested that 147 mRNAs were obviously enriched in biological pathway related to LUAD. By constructing a PPI network, 12 hub genes were identified by MCODE. The result of survival analysis showed that 10 hub genes (BIRC5, MKI67, CENPF, RRM2, BUB1, MELK, CEP55, CDK1, NEK2, TOP2A) were significantly related to the survival of LUAD. We randomly divided 483 clinical data into two parts: train set and validation set. The train set was used for Cox regression analysis, 3 prognostic factors (stage, T, CDK1) were screened. The nomogram model was constructed based on stage, T and CDK1. The model was evaluated by ROC curve, calibration chart, Kaplan-Meier (KM) curve and validation set data. The results indicated that the model has good accuracy. Our study elucidated the regulatory mechanism of circRNA in lung adenocarcinoma, and the nomogram model also provided insight for the clinical analysis of lung adenocarcinoma.


Introduction
There are many kinds of malignant tumors, but lung cancer is the most common one. It mainly causes cancer morbidity and death, causing 18.4% of the cancer deaths [1]. There are many types of lung cancer, of which lung adenocarcinoma (LUAD) is the most common, which is often caused by tumor cell driven oncogene aberrations [2]. Notably, the incidence rate of LUAD significantly raised and it constituted nearly 40% of the incidence of all lung malignancies [3]. Although experts have made headway in therapy and diagnosis, the survival rate in 5 years of patients with LUAD is less than 20% due to its complex mechanism [4]. Therefore, understanding the potential regulatory mechanism of LUAD, looking for new biomarkers and establishing an effective prognostic nomogram model are essential for the prognosis and treatment of LUAD.
In 2011, Salmmena et al. put forward the hypothesis of competitive endogenous RNA (ceRNA), believing that noncoding RNA could indirectly regulate mRNA expression by competitively binding miRNA [5]. CircRNA is a new noncoding RNA molecule with complete closed-cycle structure [6]. Unlike linear RNAs that are terminated with 5' caps and 3' tails, circRNA is not affected by RNA exonuclease and its expression is more stable [7]. Studies have shown that circRNA could affect the generation and evolution of tumors. For example, studies have reported that circRNA plays a pivotal role in the pathogenesis of lung cancer and hold important clinical relevance [8]. CircRNA can participate in the proliferation, invasion, metastasis and apoptosis of gastric cancer [9]. CircRNA expression has an extensive effect on the biological behavior of hepatocellular carcinoma [10]. Numerous researches have indicated that circRNA is enriched and stable in exosomes and can be taken as biomarkers of cancer [11][12][13]. Recently, it has been found that circRNA is abundant in miRNA connection sites and can play the part of miRNA sponges, which makes circRNA become hot point in research of ceRNA family. CircRNA can indirectly influence the steadiness or translation of targeted mRNA by the way of competing with miRNA, thereby regulating gene expression [14]. For example, CiRS-7 could promote the metastasis and development of oral squamous cell carcinoma through the RAF-1/PIK3CD pathway [15]; circRNA hsa_circ_0012673 promoted the proliferation of lung adenocarcinoma by interacting with miR-22 [16]. Circ_0001588 promoted the malignant progression of lung adenocarcinoma by binding miR-524-3p to /NACC1 signaling [17]. Previous studies have reported the possible existence of ceRNA network in LUAD. For example, Liang et al. constructed ceRNA network associated with hsa_circ_0072088 and has_circ_0008274 [18]; Liu et al. suggested that hsa_circ_0005962 and hsa_circ_0086414 might be biomarkers of LUAD [19]. However, the research on circRNA and its ceRNA network in lung adenocarcinoma is still insufficient. Many circRNAs and related ceRNA network still need to be discovered.
In present study, circRNAs, miRNAs and mRNAs with abnormally expression in cancer and normal samples were extracted by differential analysis. Then, by pairwise predicting, a circRNA-miRNA-mRNA network was structured. The GO and KEGG enrichment analysis were carried out to understand the latent biological functions of mRNAs. Hub gene was screened by MCODE from PPI network. Verified by survival analysis, circRNA-miRNA-hub gene subnetwork was structured. Cox regression analysis was used to filter factors related to survival rate of LUAD. With the prognostic factors, a prognostic nomogram model for patients with LUAD was constructed and verified.

Data collection
CircRNA data of LUAD were acquired from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/), and two datasets GSE101586 (5 cancerous samples, 5 normal samples) and GSE101684 (4 cancerous samples, 4 normal samples) were selected. The expression matrices of the two datasets were combined by taking the intersection with respect to the name of circRNAs. The Combat function in the sva package was used to eliminate the batch effects according to the number of samples in the control group and the experimental group in the two datasets. The Cancer Genome Atlas (TCGA) database (https://xenabrowser.net, updated by the end of April, 2020) is a database for acquiring gene expression data where we obtained miRNA expression data (518 cancerous samples, 46 normal samples) and mRNA expression data (526 cancerous samples, 59 normal samples). Clinical data (age, gender, stage, T, N, OS time and OS) were also download from TCGA database (513 patient samples). A total of 483 patient samples were retained after deleting the samples with missing clinical characteristics.

Differential expression analysis of three kinds of RNA
Based on the combined dataset after batch effect normalization, the expression file of circRNA were further analyzed to obtain differentially expressed circRNA (DEcircRNA) by using the limma package. The screening thresholds were P < 0.01 and |log2FC| > 2. The differentially expressed miRNAs (DEmiRNAs) and differentially expressed mRNAs (DEmRNAs) were screened through the edgeR package. The screening thresholds were False-Discovery Rate < 0.05 and |log2(FC)| > 2.

Construction of circRNA-associated ceRNA network
Cancer-Specific CircRNA Database (CSCD) [20] (http://gb.whu.edu.cn/CSCD/) is a database for studying of tumor specific circRNA, from which we can predict the target miRNA of circRNA. The predicted miRNA and DEmiRNA were crossed, and the ceRNA network was constructed using the miRNA shared by the two datasets.
Using miRTarbase [21] (http://mirtarbase.mbc.nctu.edu.tw/) and Targetscan [22] (http://www.targetscan.org/) to collect the target mRNA of miRNA. Only mRNAs that existed in two databases were finally selected. The predicted mRNA and DEmRNA were crossed, and the ceRNA network was constructed using the mRNA shared by the two datasets.

Functional enrichment analysis of mRNAs
For exploring the latent biological mechanism and functions of mRNAs in the ceRNA network, the DAVID database [24,25] (https://david.ncifcrf.gov/)was used to perform GO function enrichment analysis and KEGG pathway enrichment analysis. The results of enrichment analysis were visualized by ggpolt2 R package.

Construction of PPI network and screening of hub genes
To further study the interaction among mRNAs in the ceRNA network, PPI network was analyzed by Search Tool for the Retrieval of Interacting Genes [26] (STRING, version 11, https://string-db.org/) database. The STRING database has a combined score for each PPI relationship pair that is distributed between 0 and 1. The higher the total score, the more reliable the PPI relationship. In current study, we chose combined score greater than 0.9 as the threshold. Then, the Molecular Complex Detection (MCODE), a clustering algorithm identifying densely connected regions of a molecular interaction network [27], was used to screen modules of hub genes from the PPI network.

Survival analysis of hub genes
GEPIA [28] (http://gepia.cancer-pku.cn/) integrated gene expression data from GTEX and TCGA projects which can be used for multiple data analysis and visualization. The prognostic value of hub genes in LUAD were analyzed by survival analysis tool in GEPIA. UALCAN (http://ualcan.path.uab.edu/index.html ) [29] is a web resource for analyzing cancer transcriptome data based on TCGA database, which can perform biomarker identification, expression spectrum analysis, survival analysis and so on. To ensure the accuracy of the results, we further verified the prognostic value of hub gene in TCGA (through the survival analysis package in R software) and the UALCAN database. Then, based on the hub genes with prognostic value, circRNA-miRNA-hub gene subnetwork was identified to explore important regulatory module in the ceRNA network.

Screening of prognostic factors by Cox regression analysis
According to the proportion of 7:3, all LUAD samples were randomly chopped into train set and validation set. Fisher exact test and wilcox test were used to test whether there was significant difference between two datasets. Univariate Cox regression analysis of the train set was performed using Survival package of R software to identify the factors significantly associated with the survival of LUAD. In the univariate Cox analysis, the factors with p value < 0.05 were further included in multivariate Cox regression analysis. Multivariate Cox regression analysis was used to screen independent prognostic factors, and hazard ratio (HRS) and regression coefficient were calculated.

Construction and validation of nomogram model
Based on the prognostic factors, the nomogram model was established by rms package of R software. In the nomogram model, we first set the scoring standard according to the regression coefficient of each influence factor, and then give the score value of each influence factor, thereby calculating the total score of each patient [30]. Finally, the conversion between the probability of occurrence and the total score is calculated by a function to evaluate the 1-year, 3-year and 5-year OS of LUAD patients. The ROC curves of train set and validation set were drawn by survivalROC package, and the differentiation of the model was assessed by calculating the area under ROC curve (AUC). Calibration chart of train set and validation set was used to evaluate the calibration degree of the model by comparing the difference between prediction probability and survival probability of nomogram model. Considering the median of the total scores as the cut-off value, the LUAD samples were divided into high risk group and low risk group. Then, we analyzed the evaluation ability of the model for high risk and low risk populations through survival analysis.

Enrichment analysis of mRNAs
GO and KEGG analysis were applied to explore the functions of the 147 mRNAs. KEGG enrichment analysis identified 4 KEGG pathways and GO enrichment analysis identified 58 GO Terms. The results of 4 KEGG pathways and top 15 Go terms (5 biological processes, 5 cell components and 5 molecular functions) were shown in Figure 2. As can be known from Figure 2 that the most enriched KEGG pathway terms were correlated with ECM-receptor interaction, cell cycle, protein digestion and absorption and p53 signaling pathway. The most enriched GO terms were correlated with singlestranded DNA binding, ATP binding, sequence-specific DNA binding, protein kinase activity, extracellular matrix structural constituent, caveola, apical plasma membrane, collagen trimer, midbody, chromosome and centromeric region, cartilage development involved in endochondral bone morphogenesis, cholesterol homeostasis, cell division, mitotic nuclear division and skeletal system development.

Construction of PPI network and screening of hub genes
To further explore the relationship among mRNAs, PPI network was constructed by Cytoscape software which contained 39 nodes and 230 edges. A total of 12 hub genes (BIRC5, MKI67, CENPF, AURKA, KIF23, RRM2, BUB1, MELK, CEP55, CDK1, NEK2 and TOP2A) were screened by MCODE plugin from the PPI network ( Figure 3).

Survival analysis of 10 hub genes
GEPIA analysis results showed that 12 hub genes were closely related with the survival of lung adenocarcinoma (Figure 4). The survival analysis result of R based on TCGA dataset indicated that 11 genes were significantly correlated with the survival of LUAD, except for AURKA (Supplementary Figure 1). The result of survival analysis in UACLAN database indicated that 11 genes were significantly associated with the survival of LUAD, except for KIF23 (Supplementary Figure 2). Therefore, 10 genes (BIRC5, MKI67, CENPF, RRM2, BUB1, MELK, CEP55, CDK1, NEK2 and TOP2A) in the intersection of the three methods were considered as important prognostic genes for the survival of the patients with LUAD. Based on the 10 hub genes, we built circRNA-miRNA-hub gene regulation subnetwork ( Figure 5).

Screening of prognostic factors by Cox regression analysis
The clinical characteristics and test result of LUAD samples were shown in Table1.The two datasets were not obviously different because all P values were greater than 0.05. The Cox regression analysis results of train set were shown in Table 2. Univariate analysis showed that the P values of stage, T and 9 hub genes(except for AURKA) were all less than 0.05, indicating that these variables were connected with the survival of LUAD patients. Then, we incorporated these variables into the multivariate Cox analysis and the result suggested that stage, T and CDK1 were statistically significant (P < 0.05), which implied these variables were independent prognostic factors.

Establishment and verification of nomogram model
On the basis of 3 prognostic factors (stage, T and CDK1), a prognostic nomogram model ( Figure 6) was constructed to study the 1-year, 3-year and 5-year survival rates of LUAD. In the ROC curve of the train set ( Figure 7A), the AUC value of the 1-year, 3-year and 5-year were 0.739, 0.727 and 0.665, respectively. While in that of validation set ( Figure 7B), the AUC value of the 1 year, 3 year and 5 year were 0.693, 0.625 and 0.677, respectively. The results showed that the model had good discrimination. The calibration charts in train set and validation set ( Figure 8) did not deviate from the ideal line. The KM survival curves ( Figure 7C,D) in train set and validation set showed that the survival rate of the high risk group is obviously lower than that of the low risk group, which further verified the validity and applicability of the model.

Discussion
Previous studies have shown that circular RNA could be used as biomarker of cancer, and could promote the invasion, immigration and propagation of tumor cells through regulatory mechanisms. In the current study, 10 circRNAs were extracted by differential analysis. Through the mutual analysis of various databases, a ceRNA regulatory network containing 7 circRNAs, 13 miRNAs, and 147 mRNAs was finally constructed. The 7 circRNAs were hsa_circ_0043278, hsa_circ_0043256, hsa_circ_0000026, hsa_circ_0082950, hsa_circ_0004592, hsa_circ_0020390, hsa_circ_0019390. Studies have confirmed that hsa_circ_0043278 (alias circTADA2A) could promote the proliferation and migration of non-small cell lung cancer cells by regulating the miR 638/KIAA0101 signaling pathway [31]; propofol could inhibit canceration of lung cancer cells through hsa_circ_0043278/miR-455-3p/FOXM1 axis [32]. Meanwhile, hsa_circ_0043278 has also been confirmed by experiments to play a key role in breast cancer and glioma. [33,34]. In addition, studies have reported that hsa_circ_0043256/miR-1252/ITCH axis might suppress Wnt /b-catenin pathway activities, resulting in apoptosis of non-small cell lung cancer [35]; hsa_circ_0000026(alias circ-EIF4G3) could promoted the proliferation and migration of gastric cancer via sponging miR-335 [36]. Studies have also shown that hsa_circ_0020390 (alias Circ_DOCK1) could inhibit the growth and metastasis of colorectal cancer cells and increase cell apoptosis by regulating the miR-132-3p/USP11 axis [37]. However, hsa_circ_0004592, hsa_circ_0082950 and hsa_circ_0019390 have not been reported yet. Therefore, these identified circRNAs may be important circRNAs for lung adenocarcinoma. Hopefully these circRNAs could be further verified by biological experiments in future.
MiRNA is a type of single-stranded RNA molecule encoded by endogenous genes. It is known that miRNA could regulate the expression of target genes by interacting with them [38]. In this study, we found 13 miRNAs through the ceRNA network. Most of these miRNAs have been shown to be involved in the progression of lung adenocarcinoma, non-small cell lung cancer or lung cancer. For example, overexpression of miR-183-5p might play an oncogenic role in LUAD through involvement in the regulatory networks of its target genes [39]；miR-3607 promoted LUAD proliferation by inhibiting APC expression [40]; miR-767-3p could inhibit growth and migration of LUAD cells by regulating CLDN18 [41]; miR-708 participated in tumor growth and disease progression as a tumor gene by directly down regulating Wnt signaling pathway in lung cancer [42]; miR-605 inhibited the oncogenicity of non-small-cell lung cancer by directly targeting Forkhead Box P1 [43]; miR-503 inhibited non-small-cell lung cancer by targeting PDK1/PI3K/AKT pathway [44]; LncRNA HCG11 inhibited cell viability, migration and invasion in non-small-cell lung cancer by functioning as a ceRNA of miR-522-3p to upregulate SOCS5 [45]; Circular RNA circ-RAD23B promoted cell growth and invasion by miR-653-5p/TIAM1 pathways in non-small-cell lung cancer [46]; miR-205 could act as a promising biomarker in the diagnosis and prognosis of lung cancer [47]; miR-1911-3p could interact with the targeted mEAK-7 to inhibit mTOR signal transduction in human lung cancer cells [48].
The GO and KEGG analysis results suggested that these selected mRNAs in the ceRNA network were significantly enriched in biological functions and pathways related to LUAD. For example, Chromatin patterns were associated with lung adenocarcinoma progression [49]; the score of cell cycle progression could be used as a marker of specific death risk in patients with LUAD [50]; UBE2T could promote autophagy via the p53/AMPK/mTOR signaling pathway in lung adenocarcinoma [51]; ATP binding cassette E1 could enhance viability and invasiveness of lung adenocarcinoma cells in vitro [52]; inhibitor of DNA-binding protein 4 could suppress LUAD metastasis through the regulation of epithelial mesenchymal transition [53].

Conclusions
In this study, we found the important circRNAs in LUAD, constructed the ceRNA regulatory network and discovered important regulatory modules, which provided potential curative targets and molecular mechanisms for LUAD. Meanwhile, a nomogram model was established to forecast the survival of LUAD, which will be helpful to the prognosis and therapy of LUAD patients.