Systematic Construction and Validation of an RNA Binding Proteins-Based Signature for Prognostic Prediction in Gastric Cancer

Background: Gastric cancer (GC) is one of the most common cancers with high incidence and mortality worldwide. Recently, RNA-binding proteins (RBPs) have drawn more and more attention for its role in cancer pathophysiology. In this study, we aim to explore the function and clinical implication of RBPs in GC. Methods: RNA sequencing data along with the corresponding clinical information of GC patients were downloaded from The Cancer Genome Atlas (TCGA) database. Differentially expressed RNA-binding proteins (DERBPs) between tumor and normal tissues were identied by ‘limma’ package. Functional enrichment analysis and the protein-protein interaction (PPI) network were harnessed to explore the function and interaction of DERBPs. Next, Univariate and multiple Cox regression were applied to screen prognosis-related hub RBPs and to construct a signature for BC. Meanwhile, a nomogram was built based on the same RBPs. Results: A total of 296 DERBPs were found, and most of them mainly related to post-transcriptional regulation of RNA and ribonucleoprotein. A PPI network of DERBPs was constructed, consisting of 262 nodes and 2567 edges. A prognostic signature was built depended on seven prognosis-related hub RBPs that could divide GC patients into high- and low-risk groups. Survival analysis showed that the high-risk group had a worse prognosis compared to the low-risk group and the time-dependent receiver operating characteristic (ROC) curves suggested that the signature existed moderate predictive capacities of survival for GC patients. Similar results were obtained from another independent set GSE84437, conrming the robustness of signature. Calibration plots reported good consistency between overall survival (OS) prediction by nomogram and actual observation. Conclusion: The ndings of this study would provide evidence of the effect of RBPs on GC as well as offering novel potential biomarkers in prognosis prediction and clinical decision for GC patients.


Introduction
Gastric cancer (GC), one of the most frequently occurring digestive tract tumors, originates in the gastric mucosal epithelium. GC is an important leading cause of cancer-associated death worldwide, with an estimated 1,000,000 new cases and approximately 700,000 deaths every year derived from the cancer statistics of 2018 [1]. Despite the advances in the diagnosis and treatment of GC, the prognosis for patients with GC remains poor due to most of the diagnosis during its middle to late stages [2,3]. Therefore, exploring molecular mechanisms behind the occurrence and progression of GC and discovering new biomarkers are urgently required for early diagnosis and prognosis improvement of GC.
RNA-binding proteins (RBPs) are inherently pleiotropic proteins, interacting with an assortment of types of RNAs include mRNAs, tRNAs, miRNA and ncRNAs [4,5]. RBPs have central roles in RNA structure, localization, stability, translatability and regulate gene expression post-transcriptionally and other cellular functions [6]. It is well established that post-transcriptional deregulation has emerged as a frequent pathological mechanism in numerous diseases, which demonstrate the crucial function of RBPs in human cellular processes [7]. In order to explore the construction and function of RBPs, we must comprehensively identify and annotate them rstly. Given that rapid advances in high-throughput sequencing technologies, over 1500 RBPs have been found and deposited into databases [4]. Numerous studies have indicated that RBPs play essential roles in tumor occurrence and development [8,9]. For example, RBP RNPC1 regulates of P63 gene stability to inhibit initiation and progression [10]. RBP U2AF1 affects pre-mRNA splicing of a good deal of oncogenic drivers to promote tumorigenesis. Overexpression RBP LIN28A accelerates cell's progress from S to G2/M to enhance colon cancer cell proliferation [11]. A systematic study on RBPs may be conductive to understand their contribution to tumors and help discover potential diagnostic or prognostic biomarkers that are what we lack on GC.
In the study, differentially expressed RNA-binding proteins (DERBPs) were investigated between tumor and normal tissues. Subsequently, we performed functional enrichment analysis of DERBPs to explore the biological functions and constructed a co-expression network to reveal the relationship between them.
Moreover, we built a model to appraise the predictive value of RBPs for the survival of GC, some of which may serve as biomarkers for diagnosis and prognosis fo GC in the future.

Methods
Data collection and DERBPs analysis RNA sequence data and corresponding clinical information were downloaded from The Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov/) database, which contained 375 GC tissue samples and 32 paracancerous tissue samples. The genes expression data included more than 60,000 genes annotated by Ensemble Genes 82. We applied 'limma' package in R version 4.0.0 to standardize RBPs expression and estimate DERBPs with PDR value < 0.01 and |fold change|>0.5 between GC tissues and adjacent noncancerous tissues.

Gene ontology and pathway enrichment analysis
In order to research the most associated biological functions and pathways for DERBPs, we performed Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis with 'clusterPro ler' package in R version 4.0.0. The GO term enrichment analysis included three categories: molecular function (MF), cellular component (CC), and biological process (BP). Both P-value < 0.05 and FDR < 0.05 were used as the screening criteria.
Establishment of protein-protein interaction (PPI) network and modules selection DERBPs were upload to the Search Tool for the Retrieval of Interacting Genes (STRING) database (http://www.string-db.org/) [12] to provide a global perspective for the interactive relationship of them. Protein-protein interaction (PPI) network was constructed and visualized using Cytoscape 3.7.0 software, and the most important modules that both Molecular Complex Detection (MCODE) score ≥4 and ≥6 nodes were selected by the MCODE plug [13]. A P-value of less than 0.05 was considered statistically signi cant.
Construction and validation of overall survival (OS) risk prognostic model Identi cation of relevant prognostic candidate hub RBPs was performed from DERBPs in the PPI network by the univariate Cox regression analysis and multiple stepwise Cox regression. Following, 65a multivariate Cox proportional hazards regression model was built based on the hub RBPs. Risk scores for every patient were calculated according to the formula. Risk score value = EXP1*βlncRNA1+EXP2*β2… +EXPx*βx, EXP presents the expression levels of each RBP and β presents the regression coe cient from the multivariate Cox proportional hazards regression model. The patients were assigned to high-and lowrisk groups according to the median risk score as the cutoff value. The time-dependent receiver operating characteristic (ROC) curve was applied to evaluate the model prognostic power. Additionally, the other cohort GSE62254 includes 300 patients from Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62254) were used to justify whether the prognostic value of the model was credible. Statistical signi cance was judged with P-value <0.05.
Finally, the nomogram and calibration plots were generated using "rms" packages of R software.

Verifying expression and prognostic value of the hub RBPs
The prognostic value of the hub RBPs in GC was assessed by an online database Kaplan-Meier plotter (http://kmplot.com/analysis/) [14]. Meanwhile, the Human Protein Atlas (HPA) database (http://www. proteinatlas.org/) was used to detect the expression of them on the translational level in GC tissue.

Identi cation of DERBPs in GC
In total, 296 DERBPs were found between tumor and paracancerous tissue using 'limma' package at the threshold of P-value < 0.01 and |fold change|>0.5, among which 160 RBPs were up-regulated and 133 RBPs were down-regulated (Fig. 1a). The volcano plot shows the distribution of DERBPs (Fig. 1b).

Functional and pathway enrichment analysis of DERBPs
To research the function and potential mechanism of DERBPs, GO and KEGG analysis was done using 'clusterPro ler' package in R software. The results suggest that DERBPs in BP terms primarily enriched in ncRNA processing, rRNA metabolic process, rRNA processing, ribosome biogenesis and regulation of translation (Fig. 2a). Within the molecular function, DERBPs were notably enriched in catalytic activity on RNA, mRNA 3'-UTR binding, translation regulator activity, ribonuclease activity and single-stranded RNA binding ( Fig. 2a). At the cellular component level, DERBPs were mostly enriched in cytoplasmic ribonucleoprotein granule, ribonucleoprotein granule, preribosome, nucleolar part and small-subunit processome (Fig. 2a). Moreover, we found that DERBPs were signi cantly enriched in Ribosome biogenesis in eukaryotes, RNA transport, RNA degradation, mRNA surveillance pathway and Spliceosome (Fig. 2b).
PPI network analysis and identifying key module PPI network could re ect direct physical interactions and potential molecular functions between genes. We built a PPI network of DERBPs using the STRING database and Cytoscape software, consisting of 262 nodes and 2567 edges (Fig. 3a). Subsequently, MCODE plug-in was used to screen the vital cluster modules from the PPI network and the key modules were chosen, which incorporated 76 nodes and 1071 edges (Fig. 3b). According to the results of enrichment analysis, the RBPs in the top module were commonly involved in ribosome biogenesis, rRNA processing, preribosome, small-subunit processome, snoRNA binding, RNA helicase activity and Ribosome biogenesis in eukaryotes (Table 1).

Construction and validation of the prognostic signature
We carried out univariate Cox regression analysis for key RBPs in the PPI network to found out prognosisrelated RBPs , and attained 19 candidate genes associated with prognosis for further analysis ( Figure. 4a). Then, multiple stepwise Cox regression was used to explore the independent prognostic impact of 19 candidate genes for GC. As a result, we obtained seven hub RBPs and established a signature based on them for OS (Fig. 4b). GC patients were ranked according to risk score value and split into high and lowrisk groups using the median risk score value as the cutoff point (Fig. 5a). Kaplan-Meier survival plots revealed that the low-risk group had a higher OS than the high-risk group (Fig. 5c). Also, a time-dependent ROC curve demonstrated the prediction power of the seven-RBPs signature for OS in ve years achieved area under curves (AUCs) were 0.692 (Fig. 5e). To determine whether the seven-RBPs signature has similar prognostic value in another GC cohort, we built the model by applying the same formula for the GSE84437 set (Fig. 5b). Kaplan-Meier curves and log-rank test indicated that the low-risk group had longer survival time than the high-risk group in the GSE84437 set (Fig. 5d). Similar, time-dependent ROC curves evaluated that the model had a considerable prediction performance for GC patients (Fig. 5f).

Establishment of a nomogram based on hub RBPs
To provide clinicians with a quantitative method for predicting the prognosis in different years for GC patients, we integrated the hub RBPs to constructed a nomogram. The nomogram based score was calculated according to the seven RBPs on the point scale. Hence, the 1-year OS, 3-year OS and 5-year OS of each GC patient could be predicted by total points in the nomogram (Fig. 6a). Importantly, calibration plots showed that the nomogram performed well with the ideal model for predicting 3-year OS and 5-year OS of GC (Fig. 6b, c). Furthermore, we explored the prognostic value of seven-RBPs signature and clinical factors for patients with GC from TCGA set by Cox regression analysis. Univariate Cox regression analysis found that age, stage, pN, and the risk scores were obviously related to OS (Table 2). However, only age and risk scores were remarkably related to OS after multivariate Cox regression analysis of GC ( Table 2).

Validation for hub RBPs prognosis and expression
For examining the association of hub RBPs with the survival of GC patients, we performed log-rank test for hub RBPs through an online Kaplan-Meier plotter database (http://kmplot.com/analysis/). It was shown that all the hub RBPs were signi cantly correlated with OS of GC expect for SETD7 (Fig. 7). To verify whether the protein of the hub RBPs genes can be detected in GC, immunohistochemistry (IHC) data were retrieved from HPA web portal. The results of IHC revealed that SETD7 protein expression was increased in GC tissue, and MSI2, RNASE1 and RNASE3 protein expression were decreased in GC tissue (Fig. 8). However, BOLL protein expression did not exhibit differential expression in GC and normal tissues (Fig. 8).

Discussion
GC is one of the most commonly diagnosed cancers with high heterogeneity in the world, seriously endangering human health and life [15]. Therefore, exploring the molecular mechanism is critical to understand the pathogenesis and help improve the diagnosis and treatment of GC. RBPs participate in almost all the steps of the post-transcriptional regulatory layer, regulating the expression and function of each transcript in the cell and ensure stable maintenance of intracellular environments. In view of the central role of RBPs in the gene expression, dysregulation of RBPs may lead to several diseases, including cancers [16,17]. Several studies have provided some evidence that RBPs dysregulation is common in various cancers [18][19][20]. However, fewer papers have explored the expression and speci c functional role of RBPs in GC. In this study, we investigated to identify the DERBPs between GC tissue and adjacent tissues from TCGA database. We implemented gene ontology and pathway enrichment and constructed a PPI network of DERBPs after that. In addition, we utilized univariate Cox regression analysis and multiple stepwise Cox regression to screen relevant prognostic hub RBPs and build a seven RBPs-based signature for predicting the survival time of GC patients. Meanwhile, log-rank test analysis and time-dependent ROC analysis were applied to evaluate the prognostic value of the signature. These ndings may provide potential biomarkers and contribute to enlighten the pathological mechanism of GC.
GO and KEGG analysis indicated that DERBPs mainly enriched in ncRNA processing, rRNA metabolic process, rRNA processing, catalytic activity on RNA, mRNA 3'-UTR binding, translation regulator activity, cytoplasmic ribonucleoprotein granule, ribonucleoprotein granule, preribosome, ribosome biogenesis in eukaryotes, RNA transport, RNA degradation, mRNA surveillance pathway and spliceosome. Posttranscriptional regulation, including RNA processing, RNA degradation and translation is an essential aspect of the regulation of gene expression. Previous studies demonstrated that the disorder of posttranscriptional regulation is associated with the occurrence and development of various cancers [21][22][23]. Many RBPs bind to sequence-speci c motifs or RNA secondary structures through unique modular arrangements of individual RNA-binding domains to play the roles in the homeostatic regulation of gene expression [24,25], which in uence the progression of many diseases. For example, RBP MSI2a expression alleviates triple-negative breast cancer invasive abilities through stabilizing TP53INP1 mRNA and inhibiting ERK1/2 activity [26]. RBP CASC9 interacts with hnRNPL form a complex to regulate DNA damage signal and PI3K/AKT signaling pathway, in uencing tumor cell proliferation and apoptosis in vivo [27]. Ribonucleoprotein is the underlying basis for synthesizing all cellular proteins in all living organisms. Some research ndings have shown that ribonucleoprotein also is involved in tumor initiation and progression [28,29]. The PPI network was constructed on DERBPs by Cytoscape software and employed MCODE tool to select key modules. The functional and pathway enrichment analysis showed that the key module was related to ribosome biogenesis, rRNA processing, preribosome, small-subunit processome, snoRNA binding and Ribosome biogenesis in eukaryotes.
In addition, The seven prognosis related hub RBPs (RNASE3, RNASE, SETD7, BOLL, ADARB1, PPARGC1B and MSI2) were screened by univariate Cox regression analysis and multiple Cox regression analysis. Compared with single biomarkers, integrating multiple biomarkers into a single model may considerably increase the predictive accuracy [30]. Therefore, we built a signature based on the seven hub RBPs using multivariate Cox proportional hazards regression analysis for useful and sensitive prognosis of GC patients. The signature is capable of discriminating the GC patients from high-and low-risk groups and may serve as an independent prognostic factor in GC. The time-dependent ROC curve indicated that the signature has a moderate performance in survival prediction for GC patients. More importantly, The risk strati cation capability of the signature was con rmed in another independent set. Next, a nomogram was drawn to quantitatively predict survival time for clinical use. Kaplan-Meier plotter explored that most of the seven hub RBPs are associated with the survival of GC patients. The protein expression levels of seven hub RBPs genes in human tissues were obtained from HPA database, and the results indicated that the majority of RBPs were differently expressed between tumor and normal tissues.
In the present study, the seven-RBPs signature has been proven to be signi cantly connected with OS of GC. Among these hub RBPs, many of them have proved to be closely related to tumor occurrence and development. SETD7 is the only lysine methyltransferases seven family members which can methylate transcription factors [31]. SETD7 could be used as a prognostic indicator for breast cancer, downregulation of it suppressed expression of antioxidant enzymes and destabilized the redox status [32]. SETD7 and ISL1 may combine to form a complex on the ZEB1 promoter to promote tumorigenesis in GC cells [33]. BOLL, an ancestral gene of the Deleted in Azoospermia family, maintains normal functions of sperm [34,35]. BOLL functions as an oncogene because of enhancing proliferation and migration activities in the colon cancer cells, and BOLL protein expression was upregulated in colorectal cancer tissue [36]. ADARB1 was expressed at high levels in endometrial cancer, and observed a positive correlation between increased expression and invasion degree [37]. Another study reported that ADARB1 could inhibit glioblastoma cell growth via regulation of the CDC14B/Skp2/p21/p27 axis [38]. MSI2 is a popular molecule in digestive system tumors. MSI2 promoted hepatocellular carcinoma progression via the Wnt/β-catenin signaling pathway and may serve as an indicator to predict outcome of patients with hepatocellular carcinoma [39]. MSI2 expression levels were upregulated in GC tissues and associated with poor prognosis of GC. Furthermore, MSI2 induced migration, invasion and angiogenesis to increased proliferation and invasiveness of GC cells [40].
However, several limitations in this study should not be ignored. On the one hand, there is a big gap between the number of tumor and adjacent samples, which may affect the accuracy of the results. On the other hand, this study was a retrospective design, prospective clinical experimental and clinical data are needed to con rm these funding further. Moreover, some proverbial potentially signi cant clinical information, such as treatment plan, vascular invasion and perioperative data, are not provided in the TCGA database. Finally, the population in the database mainly came from western countries, this may present observation bias.

Conclusions
All in all, we comprehensively analyzed the expression, function, interaction and prognostic value of RBPs in GC through bioinformatic analysis. In addition, our research not only established an RBPs-based signature but also generated a nomogram to predict the prognosis of GC patients. Our funding may provide new insights into the roles of RBPs in GC and develop potential markers for guiding treatment and prognosis.    Figure 1 The signi cantly altered RBPs in GC samples. a Heatmap. b Volcano plot.