Development and validation of a novel prognostic signature in gastric adenocarcinoma

Competing endogenous RNA networks have attracted increasing attention in gastric adenocarcinoma (GA). The current study aimed to explore ceRNA-based prognostic biomarkers for GA. RNA expression profiles were downloaded from TCGA and GEO databases. A ceRNA network was constructed based on the most relevant modules in the weighted gene coexpression network analysis. Kaplan-Meier (KM) survival analysis revealed prognosis-related RNAs, which were subjected to the multivariate Cox regression analysis. The predictive accuracy and discriminative ability of the signature were determined by KM analyses, receiver operating characteristic curves and area under the curve values. Ultimately, we constructed a ceRNA network consisting of 55 lncRNAs, 17 miRNAs and 73 mRNAs. Survival analyses revealed 3 lncRNAs (LINC01106, FOXD2-AS1, and AC103702.2) and 3 mRNAs (CCDC34, ORC6, and SOX4) as crucial prognostic factors; these factors were then used to construct a survival specific ceRNA network. Patients with high risk scores exhibited significantly worse overall survival than patients with low risk scores, and the AUC for 5-year survival was 0.801. A total of 112 GA specimens and the GSE84437 dataset were used to successfully validate the robustness of our signature by qRTPCR. In summary, we developed a prognostic signature for GA, that shows better accuracy than the traditional TNM pathological staging system.


INTRODUCTION
Despite technological advances in diagnosis and treatment, stomach cancer remains the fourth most common cancer with the second highest mortality rate [1]. Although some prognostic factors, such as genes and the tumor microenvironment, have been evaluated, the exact mechanisms involved remain unclear [2][3][4]. The development of a prognostic marker that can accurately predict clinical results will better serve as a guide in the clinic. Long noncoding RNAs (lncRNAs) are more than 200 bp in length and lack coding ability.
AGING Several studies suggest that lncRNAs participate in the regulation of tumor progression and tumor biological behavior by interacting with microRNAs (miRNAs) or messenger RNAs (mRNAs). LncRNAs containing miRNA response elements can compete with miRNA target genes and regulate their expression by reducing free functional miRNAs. This kind of lncRNA is called a competitive endogenous RNA (ceRNA) [5][6][7][8]. This hypothesis has attracted increasing attention [9]. For example, Chen et al. explored whether the lncRNA PVT1 promotes tumor progression by regulating the miR-143/HK2 axis in gallbladder cancer [10]. Additionally, Wang et al. found that a novel lncRNA, MCM3AP-AS1, promotes the growth of hepatocellular carcinoma by targeting the miR-194-5p/FOXA1 axis [11]. Gastric cancer development in attributed to an imbalance between protein-coding and noncoding genes, and the regulatory mechanism of ceRNAs may be involved in this pathogenic process.
To identify ceRNAs associated with prognosis in GA and guide clinical applications, we integrated RNAseq data from TCGA and GEO datasets and 112 GA specimens to establish a signature based on ceRNAs. Functional enrichment analysis and gene set enrichment analysis (GSEA) were performed to predict the potential functions of the genes in the ceRNA network.

Data acquisition and preprocessing
We collected gene matrix and clinical information from a TCGA data set, including 376 tumor tissues and 32 normal tissues. We excluded patients with a survival time of zero or incomplete clinical and pathological data. After screening, 349 tumor and 32 normal samples remained. After annotation, we obtained 19,754 mRNAs and 14,848 LncRNAs. The GSE84437 dataset was used as a validation cohort and contains 431 GA patients with a nonzero survival time.
Moreover, from June 2017 to August 2019, a total of 112 frozen, surgically resected tumor tissues were obtained from patients with a pathological diagnosis of GA at the Department of Pathology, Chengdu Third People's Hospital. The clinicopathological data of the TCGA and real-time quantitative PCR (qRT-PCR) datasets are presented in Table 1.

Differential expression analysis
After obtaining the expression data, we identified differentially expressed genes among tumorous and normal GA samples in the TCGA dataset using the software package edgeR and selected genes that were at least 2-fold higher in GA samples than in normal samples (Poisson model FDR < 0.05). Ultimately, we obtained 4721 differentially expressed mRNAs/lncRNAs (Supplementary Figure 1A, 1B). For miRNAs, if their expression deviated by more than 1.2 among these samples, they were subjected to WGCNA. Ultimately, we obtained 486 miRNAs for subsequent analyses.

WGCNA
We used the expression profiles of 4721 mRNAs/ lncRNAs and 486 mRNAs/lncRNAs to construct a coexpression network with the WGCNA software package in R software. In the coexpression network analysis, the β values of lncRNAs/mRNAs and miRNAs were 3 and 8, respectively ( Figure 1A, and 1B). Ultimately, we obtained 17 and 10 modules in the coexpression network of lncRNAs/mRNAs and miRNAs, respectively ( Figure 1C and 1D). Moreover, we calculated and plotted the relationship between each module and clinical features. Figure 1E shows a strong negative correlation between the turquoise module and tumor characteristics (module-trait weighted correlation =-0.61). However, as shown in Figure 1F, there is a significant positive correlation between the brown module and tumor characteristics (module-feature weighted correlation = 0.75).

CeRNA network in GA
Through the prediction of online database and screening of matrix internal relationship pairs, 234 pairs of lncRNAs-miRNAs were obtained, including 55 lncRNAs (12 upregulated and 43 downregulated) and 17 miRNAs. Ninety pairs of miRNAs-mRNAs were also obtained, including 73 mRNAs (11 upregulated and 62 downregulated) (Supplementary Figure 2).

Functional enrichment analysis
Functional enrichment analysis was performed to explore the Gene Ontology (GO) database terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways associated with genes in the turquoise module. The results indicated that the enriched biological processes mainly involved nuclear division, DNA replication, chromosome segregation, organelle fission and so on (Figure 2A). The cell components that were correlated with the resulting terms included chromosome, centromeric region, condensed chromosome, spindle and so on ( Figure 2B). The results also showed that the molecular functions were related to DNA helicase activity, catalytic activity, acting on DNA, DNA-dependent ATPase activity, actin binding and so on ( Figure 2C). KEGG pathway functional AGING enrichment analysis showed that Cell cycle, p53 signaling pathway, cAMP signaling pathway, cGMP-PKG signaling pathway and DNA replication were the main pathways related to the genes in this module ( Figure 2D).

Analysis of survival-related biomarkers in the GEPIA2 database
Compared with normal gastric tissues, SRY-box transcription factor 4 (SOX4) is significantly overexpressed in GA tissues ( Figure 4A). SOX4 also appears to be upregulated in various tumor tissues compared with corresponding normal tissues ( Figure  4B). Similar situations were also obtained for CCDC34 and ORC6 ( Figure 4C-4F). The three lncRNAs (LINC01106, FOXD2-AS1, and AC103702.2) also showed highly similar results (Supplementary Figure 3).

Construction of the survival -specific ceRNA network and prognostic signature
As shown in Figure 5A, the upregulated lncRNAs (LINC01106, FOXD2-AS1, and AC103702.2) associate with the same upregulated mRNAs (CCDC34, ORC6, and SOX4) via hsa-miR-17-5p and hsa-miR-7-5p. To AGING further verify the functions of the hub RNAs, we conducted multivariate Cox regression analysis and calculated the risk score. The results are presented in Table 2. All samples were randomly separated into highand low-risk groups with the median risk score as the cutoff value. Patients in the high-risk group had significantly worse OS than those in the low-risk group ( Figure 5C). In addition, receiver operating characteristic (ROC) curves were used to explore whether the prognostic ability of the ceRNA-based signature was better than that of the traditional TNM pathological staging system. The AUC values of the signature assessed for 5-year (AUC = 0.801) and 7-year (AUC = 0.853) OS were more accurate than those of the pathological stage (5-year AUC = 0.609) ( Figure 5B and 5D). Figure 5E-5I, Cell cycle and P53 signaling pathway were enriched in the high-risk group. In  AGING addition, tumor-related pathways such as the MAPK signaling pathway, pathways in cancer and cytokine receptor interaction, were also enriched.

Validation of the prognostic value of the ceRNAbased signature
To determine the stability of the nomogram, we performed a similar process in the qRT-PCR validation cohort (n = 112). First, KM analysis was performed to identify the prognostic value of FOXD2-AS1, LINC01106 and ORC6 ( Figure 6A-6C). Then, with the median risk score as the cut-off point, the patients were divided into the high-risk group (n = 56) and the lowrisk group (n = 56) ( Figure 6D). The AUC value for 3and 4-year OS reached 0.809 and 0.820, respectively ( Figure 6E), which were still higher than the AUC values of the traditional TNM pathological staging   AGING system ( Figure 6F). The KM survival curves suggested that the OS of patients in the high-risk group was significantly worse than that of patients in the low-risk group ( Figure 6G).
We also validated the robustness of the signature in the GSE84437 dataset (n = 431). The KM OS curves showed that the high expression of FOXD2-AS1 and SOX4 could be related to a poor prognosis ( Figure 7A and 7B). Moreover, the median risk score was used as the cut-off point to divide patients into the high-risk group (n = 215) and the low-risk group (n = 216) ( Figure 7C). The AUC value for 5-year OS reached 0.755 ( Figure 7D). The KM OS curves indicated that the OS of patients in the highrisk group was significantly worse than that of patients in the low-risk group ( Figure 7E). These results suggest that the new signature can effectively evaluate the prognosis of GA patients.

DISCUSSION
Despite advances in diagnosis, prognosis, and treatment, GA remains a worldwide public health concern. While some lncRNAs and mRNAs dysregulated in GA and their clinical value as potential biomarkers for prognosis have been previously reported [12,13], this study provided additional data on two novel lncRNAs (LINC01106, and AC103702.2) and two novel mRNAs (CCDC34 and ORC6) contributing to GA and constructed a ceRNA-based signature that can be used to predict the prognosis of GA. Most importantly, the qRT-PCR validation cohort of 112 GA patients successfully verified its stability.
In the present study, we identified a turquoise module and a brown module (Brown) that were markedly associated with the GA tumor status by WGCNA. Next, we constructed a ceRNA network to identify potential prognostic lncRNA and mRNA biomarkers. The results revealed that LINC01106, FOXD2-AS1, AC103702.2, CCDC34, ORC6, and SOX4 whose high expression may indicate a poor OS. We also revealed the ceRNA relationship between these lncRNAs and mRNAs, that likely represents its mechanism of action in GA. Next, multivariate Cox regression analysis was carried out among these molecules, and the risk score was calculated. All samples were randomly separated into high-and low-risk groups with the median risk score as the cut-off value. KM analyses, ROC curves and AUC values showed that the signature based on the ceRNAs objectively and accurately predicted the prognosis of patients with GA. The results were successfully verified in the qRT-PCR validation cohort and the GSE84437 dataset. AGING LINC01106 is broadly distributed in the brain (RPKM 31.9), stomach (RPKM7.0) and 17 other tissues. Sun et al. [14] found that LINC01106 was overexpressed in colon cancer, and survival analysis showed that LINC01106 was strongly associated with the OS of colorectal cancer patients. Similarly, we found that LINC01106 was overexpressed in GA tissue and associated with a poor prognosis.
The protein encoded by ORC6 is a subunit of the ORC complex, which includes a core complex consisting of ORC2, ORC3, ORC4, and ORC5, loosely interacting with ORC6. Gene silencing studies with siRNAs demonstrated that this protein plays an essential role in coordinating chromosome replication and segregation with cytokinesis. [15] Research has also shown that the expression of ORC6 in colon cancer tissues is high and associated with invasion depth [16]. Moreover, a reduction in ORC6 expression sensitizes human colon cancer cells to 5-fluorouracil and cisplatin [17]. Our results suggest that ORC6 overexpression in GA likely causes poor OS.
Coiled-coil domain containing 34 (CCDC34), also known as renal carcinoma antigen NY-REN-41, is a proteincoding and disease related gene. Current research have shown that CCDC34 is upregulated in a variety of tumors and contributes to the malignant behaviors of cancer cells [18][19][20][21][22]. Recently, Gong et al. [19] found that CCDC34 was overexpressed in bladder cancer and promoted cell proliferation and migration. Lin et al. [21] found that it was also upregulated in hepatocellular carcinoma and contributes to cell proliferation and metastasis. However, its expression and role in GA have not been reported. In our study, we found that CDCC34 was highly expressed in GA tissues and not conducive to prognosis. However, little is known about its specific role in GA. Current studies have confirmed that SOX4 and FOXD2-AS1 are overexpressed in GA tissues and promote the proliferation and migration of GA cells [23][24][25]. These results further validate the authenticity of the current study. In addition, by constructing a ceRNA network, we also found that FOXD2-AS1 may increase the expression of SOX4, ORC6 and CCDC34 through hsa-miR-17-5p and hsa-miR-7-5p, respectively, and thus play a regulatory role in the growth, differentiation and migration of GA cells. In turn, SOX4 may be increased by LINC01106, FOXD2-AS1 and AC103702.2 through hsa-miR-17-5p, and activate the biological behavior of GA.
To validate our conclusion, we conducted qRT-PCR experiments to demonstrate that LINC01106, FOXD2-AS1, AC103702.2, CCDC34, ORC6, and SOX4 are differentially expressed in GA and corresponding normal tissues. Our results showed that the expression of LINC01106, FOXD2-AS1, AC103702.2, CCDC34, ORC6, and SOX4 was higher in GA tissues than in corresponding normal tissues (Supplementary Figure 4). We also found that SOX4, ORC6, and CCDC34 were more highly expressed in adenocarcinoma tissue than in normal gastric tissue using Western blot (WB) assays and immunohistochemistry (IHC). Moreover, SOX4, ORC6, and CCDC34 are located mainly in the nucleus of adenocarcinoma tissues (Supplementary Figure 5). It is worth noting that the lncRNA-miRNA relationship pair were constructed based on the correlation of the internal expression of the matrix and not the results of the analysis of the online databases miRcode and starBase, because very few relationship pairs remained.
In summary, we identified two novel lncRNAs (LINC01106 and AC103702.2) and two novel mRNAs (ORC6 and CCDC34) related to the prognosis of GA.
We also constructed a survival-specific ceRNA network between these lncRNAs and mRNAs, and developed a prognostic signature for GA, that shows better accuracy than the traditional TNM pathological staging system. Our research will help further understand the molecular mechanism of GA and provide new insights for the treatment and prognosis of GA.

Data acquisition and preprocessing
Open data sets were downloaded from the TCGA (https://portal.gdc.cancer.gov/) and GEO (https:// www.ncbi.nlm.nih.gov/gds/) databases, including the lncRNA, mRNA and miRNA expression profiles of GA specimens and the corresponding clinical follow-up data. The RNA expression profile was downloaded in fragments per kilobase million (FPKM) format. The data used in this study met the following criteria: (1) RNAs with nonzero expression levels (accounting for 75% of all samples); (2) the median and SD of the RNAs were larger than 1.2, and (3) exact follow-up times.

Differential expression analysis
Differentially expressed RNAs were identified by the edgeR package in R software [26]. Significantly expressed RNAs were identified by setting the adjusted P value to < 0.05 and the |log 2 FC (fold change) | > 1(|log 2 FC > 1| and adjusted FDR < 0.05).

Construction of the weighted gene coexpression network
The WGCNA package implemented in R software [27] was used to build a gene coexpression network based on the gene, lncRNA and miRNA expression characteristics. A scale-free plot was used to evaluate whether AGING the network exhibited scale-free topology. The power value of the soft threshold of the adjacency matrix met the scale-free topology criterion. On this basis, we built a scale-free network and topological overlap matrix (TOM). The dynamic tree cutting method was used to generate modules with the following main parameters: deepSplit of 2 and min module size of 30 (for miRNAs, the min module size was set as 10). The height cut-off was set to 0.25, and if the module's similarity was > 0.8, the modules were merged. Based on Pearson's tests, we further determined the association between module eigengenes (MEs) and external clinical information, including sample status. If the P-value was < 0.05 and the correlation coefficient was>0.9, it was considered a significant correlation.

CeRNA network construction and analysis
According to the results of the WGCNA, we selected all the mRNAs/lncRNAs in the most negative correlation module and the miRNAs in the most positive correlation module to build the ceRNA network. In short, the related ceRNA network in GA was constructed through four stages: First, we identified the internal network relationship based on the lncRNA/mRNA and miRNA expression matrices via the WGCNA package in R software, in which the weight coefficient was controlled to be greater than 0.15. Second, miRTarBase (http://mirtarbase. cuhk.edu.cn/) [28], TargetScan (www.targetscan.org/), miRDB (mirdb.org/) [29] and miRWalk (http:// mirwalk.umm.uni-heidelberg.de/) [30] were used to predict the target genes of the miRNAs. Then, to further ensure the reliability of the constructed ceRNA network, we compared the predicted relationship pairs with the internal network, and we retained only the data of overlapping interaction pairs for further analysis. Finally, Cytoscape 3.7.2 software was used to construct and visualize the ceRNA network based upon the remaining interaction pairs.

Module function annotation
GO and KEGG analyses were realized through the org.Hs.eg.db package and clusterProfiler in R software. GO consists of three terms: biological process (BP), molecular function (MF), and cellular composition (CC). All important GO terms and KEGG pathways were filtered according to a P < 0.05 and at least two associated mRNAs.

Survival analysis
KM survival analysis was performed to evaluate the association between disease prognosis and lncRNAs and mRNAs. We also analyzed the clinical data of 349 GA patients with a nonzero survival time from the TCGA database and drew the survival curves of all lncRNAs and mRNAs in the ceRNA network using the R package "survival". When the P-value of lncRNA or mRNA was < 0.05, it was considered statistically significant and indicated that the lncRNA or mRNA has potential prognostic value.

Analysis of survival-related biomarkers in the GEPIA2 database
GEPIA2 is an updated version of GEPIA that can be used to analyze the RNA sequencing expression data of 9736 tumor samples and 8587 normal samples from the TCGA and the Genotype-Tissue Expression (GTEx) projects, using a standard processing pipeline [31]. We used this online tool to analyze the expression levels of six survival-related biomarkers in various organs and tissues based on the TCGA and GTEx databases. The differential method was based on the "limma" R package, and we used log2 (TPM + 1) for log-scale. The |log2FC| cut-off was set as 1, while the q-value cutoff was set as 0.05. We also drew a box plot of the expression levels of the survival-related biomarkers in GA tissue via internal tools.

Construction of the survival-specific ceRNA network and prognostic signature
We obtained the connections between 3 lncRNAs (LINC01106, FOXD2-AS1, and AC103702.2) and 3 mRNAs (CCDC34, ORC6, and SOX4), whose P-value was < 0.05 in the survival analysis, from the lncRNA-miRNA and mRNA-miRNA relationship pairs. The results were visualized using Cytoscape 3.7.2 software. Next, multivariate Cox regression was used to identify the corresponding coefficients of the GA prognostic signature and to calculate the risk score (risk score = expGene1 ×βGene1 + expGene2 × βGene2 + expGenen × βGenen (exp, prognostic gene expression level; β, multivariate Cox regression model regression coefficients)) by using the R packages "glmnet", "survminer" and "survival". All samples were randomly separated into high-and low-risk groups with the median risk score as the cut-off value. Survival for each group was evaluated by the KM analysis and the logrank test. The ROC curve and AUC were drawn with the R package "timeROC".

qRT-PCR
Total RNA was dreverse transcribed into cDNA with random primers using a Transcriptor First Strand cDNA Synthesis Kit (Roche, Penzberg, Germany) following the manufacturer's instructions. The expression levels of the RNAs were measured by qRT-PCR using AGING FastStart Essential DNA Green Master Mix (Roche, Penzberg, Germany) on a Roche LightCycler 480 (Roche, Penzberg, Germany). RNA expression was normalized to that of GAPDH. All quantitative PCRs were conducted in triplicate. Divergent primers, rather than the more commonly used convergent primers, were designed for the RNAs. We verified the specificity of the PCR primers using BLAST. A single peak in the melting curve indicated that the PCR products were specific. The primers used in the study are presented in Supplementary Table 1

Western blot analysis
Total protein was extracted from tissues using rotor and radio immunoprecipitation assay (RIPA) lysis buffer.
Equal amounts of protein were separated by SDS-PAGE in a 12% gel and transferred to a nitrocellulose membrane. The proteins were detected using an enhanced chemiluminescence system according to the manufacturer's instructions. Membranes were incubated overnight with the following primary antibodies: anti-ORC6 (Proteintech), anti-SOX4 (Abcam), anti-CCDC34 (invitrogen).