Identification of significant genes signatures and prognostic biomarkers in cervical squamous carcinoma via bioinformatic data

Background Cervical squamous cancer (CESC) is an intractable gynecological malignancy because of its high mortality rate and difficulty in early diagnosis. Several biomarkers have been found to predict the prognose of CESC using bioinformatics methods, but they still lack clinical effectiveness. Most of the existing bioinformatic studies only focus on the changes of oncogenes but neglect the differences on the protein level and molecular biology validation are rarely conducted. Methods Gene set data from the NCBI-GEO database were used in this study to compare the differences of gene and protein levels between normal and cancer tissues through significant pathway selection and core gene signature analysis to screen potential clinical biomarkers of CESC. Subsequently, the molecular and protein levels of clinical samples were verified by quantitative transcription PCR, western blot and immunohistochemistry. Results Three differentially expressed genes (RFC4, MCM2, TOP2A) were found to have a significant survival (P < 0.05) and highly expressed in CESC tissues. Molecular biological verification using quantitative reverse transcribed PCR, western blotting and immunohistochemistry assays exhibited significant differences in the expression of RFC4 between CESC and para-cancerous tissues (P < 0.05). Conclusion This study identified three potential biomarkers (RFC4, MCM2, TOP2A) of CESC which may be useful to clarify the underlying mechanisms of CESC and predict the prognosis of CESC patients.


INTRODUCTION
Cervical cancer now ranks fourth in the most prevalent cancers and it is the most common gynecological cancer in developing countries (Vu et al., 2018). Despite the increase in the incidence of cervical adenocarcinoma, cervical squamous carcinoma (CESC) is still the most common type of cervical cancer (Wang et al., 2004;Galic et al., 2012). Currently, a large number of gene mutations have been proved to be related to the pathogenesis of cervical cancer, which can be used as biomarkers for early detection, like DNA mutations occurring on the oncogenes tumor protein 53 (TP53) (Crook et al., 1992), phosphatase and tensin homolog (PTEN) (Yang et al., 2015). However, due to the difficulties of early detection and diagnosis, the survival rate of CESC patients still stays weak. Studies also showed that some biological markers can explain the pathogenesis of CESC and predict the consequences of this disease (Mao et al., 2019). Therefore, more reliable biological markers should be explored to comprehensively understanding the pathogenesis of CESC and guide treatment and prognosis.
With the developed bioinformatics and statistical analyses, the potential marker genes can be detected effectively, which shows great strength in the field of discovery and prediction of tumor markers, and plays a guiding role in the treatment and prognosis of the disease (Banwait & Bastola, 2015). Some biomarkers have been found in the field of cervical cancer, such as MicoRNA-425-5p and MicoRNA-489, which have been proposed for prognostic prediction Juan et al., 2018).
However, the presented biomarkers for clinical application are far from enough, and in the previous bioinformatics studies, most studies only focus on the changes of oncogenes, which increases the possibility of clinical inefficacy. On the basis of learning the expression of differential genes between cancer tissues and normal tissues, this study analyzed and compared the difference in protein level between cancer tissue and normal tissue, which provides stronger evidence for the validity of biomarkers found in our bioinformatic research.

Information of the microarray data
NCBI-GEO (Gene Expression Omnibus) is known as a free public database of microarray cohort. The gene profiles of GSE27678, GSE39001 and GSE7803 were obtained in this study. The three datasets were on the account of GPL570 platform, GPL201 platform and GPL96 platform, including 14 normal cervical tissues and 28 CESC tissues, 12 normal cervical tissues and 43 CESC tissues, 10 normal cervical tissues and 21 CESC tissues, respectively.

Identification of differentially expressed genes
The differentially expressed genes (DEGs) were analyzed by GEO2R to obtain the number of up-down-regulated genes (Barrett et al., 2013). The genes with |log Fold Change| ≥2 and P < 0. 05 were screened as differentially expressed genes.

Gene Ontology and Kyoto Encyclopedia of Genes and Genomes pathway analyses
Gene Ontology (GO) is an international standardized classification system of gene function, which provides a dynamic updating database to describe the attributes of genes and gene products in organisms (Ashburner et al., 2000). The main biological functions of differentially expressed genes could be determined by GO functional significance enrichment analysis. The GO items with q < 0. 05 were considered to be significantly enriched in DEGs.
The Kyoto Encyclopedia of Genes and Genomes (KEGG) database is a bioinformatics resource for linking genomes to life and the environment (Kanehisa et al., 2017). Based on the KEGG database, the enriched pathway analysis of DEGs was carried out to find out the important pathway.

PPI & module analysis
Cytoscape 3.8.0 is a software that was used for visualization and analyzation of complex network (Shannon et al., 2003). Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) is an application that could conduct protein interaction group research, genome research and proteome research (Doncheva et al., 2019). By mapping the information of DEGs to the STRING, we evaluated the protein-protein interaction (PPI) information of DEGs. Interactions experimentally validated with combined score >0.4 and were selected. Subsequently, we used another tool embedded in the Cytoscape named Molecular Complex Detection (MCODE) to clustering constructed functional module of PPI network (Bader & Hogue, 2003). The MCODE scores were set to be greater than 10 and nodes number more than 6. Functional and pathway enrichment for DEGs in the modules were also conducted, P < 0.05 was considered to have significant difference.

Survival analysis of significant genes in CESC and RNA expression of core genes
Kaplan-Meier (K-M) is a widely used method for estimating the survival rate of cancer patients and ''Survival'' package was applied in the R studio software (Rich et al., 2010). As for the compare of the magnitude of the difference in survival between the 2 groups, a Cox univariate hazard ratio (HR) was calculated. The clinical significance of each genes was also evaluated by performing the survival analysis of single gene in survival-related gene sets. A log-rank test was used to calculate the statistical significance of the survival difference between these 2 groups mentioned above, and the P value set as 0.05 was considered to be significant.
Gene Expression Profiling Interactive Analysis (GEPIA) is visualization tool for gene research (Tang et al., 2017). In this study, GEPIA was applied to analyze RNA expression of selected genes on the basis of thousands of simples from the TCGA database.

Specimen collection
The tissues or cells of CESC patients were collected from Xiangya Hospital of Central South University in order to verify the high expression of RFC4 in tumor tissues for molecular and protein levels. This study was proved by Medical Ethics Committee of Xiangya Hospital (No. 201912542). CESC Patients and the kin have signed a consent form, agreeing to use cervical tissue for scientific research.

Molecular biological verification of differences in gene expression
CESC tissues and para-cancerous tissues (para-CT) were selected from CESC patients to conduct the molecular validation of RFC4. The expression levels of RFC in CESC patients with different pathological stages were also compared. The pathological stage of I and II are regarded as early stage which including 4 I B1 patients, 7 I B2 patients, 3 I B3 patients, 3 II A1 patients and 1 II A2 patient. Stage III are divided into advanced stage and 17 patients in III C1 stage were included. Total RNA was extracted from CESC tissues and para-CT using Trizol Reagent (RNAiso Plus, TaKaRa, 9109) according to the manufacturer's protocols, and reverse transcribed into cDNA using a PrimeScript TM RT reagent Kit with gDNA Eraser (TaKaRa, RR047A-1). Gene expression levels were assessed by quantitative reverse transcribed PCR (qRT-PCR) with TB Green TM Premix (Tli RNaseH Plus, TaKaRa, RR820A) and specific primers: RFC4 forward: 5 -GGCAGCTTTAAGACGTACCATGG-3 ; RFC4 reverse: 5 -TCTGACAGAGGCTTGAAGCGGA-3 . The β-actin expression was used as the normalization control. Relative mRNA levels are analyzed using 2 − Ct method.

Verification of differences in protein expression
We adopted the cancerous tissues and para-CT of CESC patients to analyze the differences in protein expression by Western Blotting (WB) technology. The samples for WB analysis was separated using SDS-PAGE and transferred onto a PVDF membrane (Roche) which was blocked with 5% nonfat milk in Tris-buffered saline and incubated overnight at 4 • C with target antibodies against the following proteins: Anti-RFC4 antibody (ab156780, Abcam) and Anti-β-Actin antibody (ab115777, Abcam). After three times washing with PBST (10 min for each time), the membrane was incubated with species-appropriate HRP-conjugated secondary antibodies, the fluorescent signals were detected using SageCapture TM imaging system (SAGECREATION company).
Immunohistochemistry (IHC) assays were also performed to detected protein levels in CESC tissues and para-CT. The tissues were performed into 5-µm-thick tissue sections with formalin fixed and paraffin embedded. Subsequently, there sections were deparaffinized and rehydrated with xylene and graded ethanol respectively, followed by heated in antigen retrieval solution (EDTA, PH 9.0) and endogenous peroxidase inactivation with 3% H 2 O 2 . After blocking, the samples were incubated overnight at 4 • C with anti-RFC4 antibody (1:100, ab156780, Abcam). Then the slides were treated with the HRP-conjugated secondary antibody and stained with 3, 3 -diaminobenzidine until brown granules appeared in the membrane, cytoplasm, or nucleus. Finally, the sections were counterstained with hematoxylin at room temperature.

Screening for DEGs
Ninety-two cancer tissues and 36 normal tissues were selected from the three datasets in total, with the help of GEO2R tools, 211, 134 and 260 DEGs were extracted from GSE39001, GSE7803 and GSE27678. And Venn diagram was made by the Venn diagram software to investigate the commonly DEGs in all the three datasets. The results showed that there were 25 commonly DEGs in total and 18 of them were down-regulated while 7 were up-regulated ( Fig. 1 and Table 1).

Significant pathways identified in CESC
We investigated upregulated and downregulated DEGs to identify the most significantly enriched pathways in each group by GO and KEGG pathway analysis ( Fig. 2 and Table 2). With GO analyzing, the results indicated that (1) for biology processes (BP) , the most significantly enriched pathways of the DEGs were epidermis development, positive regulation of cell proliferation, peptide cross-linking, regulation of cell proliferation, positive regulation of cellular process, epidermal cell differentiation, skin development, keratinocyte differentiation, positive regulation of nuclear division, positive regulation of mitotic nuclear division; (2) for molecular function (MF), they were chemokine activity, chemokine receptor binding, calcium ion binding, collagen binding, CXCR chemokine receptor binding, growth factor activity, intergrin binding, cytokine activity, peptidase activity, acting on L-amino acid peptides, CCR chemokine receptor binding; (3) for    The results of KEGG analysis demonstrated that the most significant signaling pathways of DEGs were cell cycle, pathways in cancer, ECM-receptor interaction, arrhythmogenic right ventricular cardiomyopathy (ARVC), melanoma, PI3K-Akt signaling pathway, focal adhesion, vascular smooth muscle contraction, DNA replication and oocyte meiosis (Table 3).

Systematic analysis of core genes by PPI network
PPI network investigated the systematic interaction between the DEGs we got above. Twenty-five DEGs in total were mapped to the DEGs PPI network with 99 nodes and 270 edges. Seven up-regulated DEGs and 18 down-regulated DEGs were included in the PPI network. And then Cytotype MCODE was applied for further analysis of the DEGs in PPI network, and we got a result of 15 particular nodes being identified which were all up-regulated DEGs (Fig. 3).

Analysis of core gene signature in CESC using K-M plotter and GEPIA
To investigate the survival data of the 15 genes we identified, K-M plotter indicated that three (TOP2A, RFC4, MCM2) of them had a significant survival rate while other 12 genes had not (P > 0.05) ( Fig. 4 and Table 4). The expression of TOP2A, RFC4, MCM2 in normal tissue and CESC tissue was detected by GEPIA. The results showed that the expression of these three genes in CESC tissue was higher than that in normal tissue (P < 0.05) (Fig. 5).

RFC4 is validated to be overexpressed in CESC
By analyzing the data from the NCBI-GEO dataspace for mRNA expression in CESC patients, RFC4 gene was identified as an overexpressed gene in CESC patients. We collected 35 pairs of CESC patients for qPCR, the tissues of 6 pairs CESC patients were used for WB, 9 pairs CESC tissues and 4 normal cervical tissues for IHC. In order to validate our finding, total RNA was extracted from 35 paired CESC tissues and para-CT tissues, and qRT-PCR was conducted to measure the expression level of RFC4 gene. The result showed that the expression level of RFC4 on CESC tissues was significantly high compared with para-CT (P = 0.0197) (Fig. 6). And the expression of RFC4 in early stage CESC was significantly higher than that in advanced CESC (P = 0.0314) (Fig. 7). The same result was invested from WB. The results of WB analysis indicated that the RFC4 was overexpressed in CESC tissues compared to para-CT tissues (Fig. 8). A higher level of RFC4 expression on CESC tissues was observed from the result of IHC, and RFC4 protein was mainly concentrated in the nucleus (Fig. 9).  Table 4 The information of prognostic analysis of 15 core DEGs.

DISCUSSION
In order to identify more effective prognostic biomarkers in CESC, we used different bioinformatics methods to analyze three data sets based on NCBI-GEO database, including 92 CESC tissues and 36 normal tissues. A total of 25 DEGs were selected by GEO2R and Venn software, including seven up-regulated genes and 18 down-regulated genes. Then GO and KEGG pathway analysis were conducted, and the results of GO and KEGG indicated that the selected DEGs were significantly enriched in various cell pathways. Research reported that genes from these pathways could be associated with the pathogenesis and progression of cervical cancer. Nucleolar and spindle associated protein 1(NUSAP1) was a gene from spindle associated pathway, and it was reported to promote the metastasis of cervical cancer by activating Wnt/β-catenin signaling (Li et al., 2019). And studies showed that CXCL12/CXCR4 pathways was associated with HPV infection as a co-factor, which means a high risk to the incidence of cervical cancer (Meuris et al., 2016). Genes involved epidermis development were also associated with the high-risk HPV infection (Zhang et al., 2018;Chatterjee et al., 2019). After that PPI network was constructed using STRING software and MCODE analysis was conducted, and 15 particular DEGs were identified. Furthermore, by K-M plotter analysis we found three DEGs from the 15 which had a significantly better survival. The results of GEPIA showed that the expression levels of the three selected genes in CESC tissues were higher than that in normal tissues. To further validation, we performed RFC4 relevant molecule biological experiments and the results showed that compared with normal tissues, RFC4 was highly expressed in CESC tissues.
Being short for Replicant Factor C, RFC is a structure specific DNA-binding protein acting as a primer recognition factor for DNA polymerase (Zhou & Hingorani, 2012), which includes five subunits (RFC1-5). Among all five subunits of RFC complex, RFC4 has been reported to play an important role in DNA damage checkpoint and DNA replication E a r l y S t a g e ( Ⅰ 、 Ⅱ ) A d v a n c e d S t a g e ( Ⅲ pathways (Ellison & Stillman, 2003). In 2009, Arai M et al. reported that RFC4 was closely related to the prognosis of liver cancer (Arai et al., 2009). Besides liver cancer, RFC4 has been reported to be associated with several types of cancer, including prostate cancer, colon cancer non-small cell lung cancer and leukemia (LaTulippe et al., 2002;Jung, Choi & Kim, 2009;Erdogan et al., 2009;Barfeld et al., 2014). Research illustrated that up-regulated RFC4 expression found in neck squamous cell carcinoma and it was 3.4-fold higher than that in normal tissues (Slebos et al., 2006). Studies from Garnett et al. (2012) showed that RFC4 can be regulated by mutated RB1 in several types of cancers, suggesting that RFC4 could be a potential biomarker associated with the occurrence and prognosis of various cancers. Moreover, RFC4 was reported as an independent predictor of overall survival in breast cancer (Fatima et al., 2017;Niu et al., 2017).
In this study we observed RFC4 as a potential independent prognostic biomarker in CESC, and our results suggested that CESC patients with higher expression level of RFC4 may have a better overall survival. A possible reason might be that RFC4 was highly expressed throughout the cell circle process of proliferating cells, and tumor proliferation in situ will become slow with the development of the disease (Szymanska et al., 2018;Chaplain & Sleeman, 1993), which means a decrease in the expression of RFC4. Therefore, highly expressed RFC4 may suggest early stage CESC, which indicates better overall survival.
Several studies have proved that these three genes were associated with numerous types of cancer, but studies of RFC4 in CESC were rarely seen, and very few researches conducted Figure 9 IHC test of CESC. IHC declared that, in general, the RFC4 protein is highly expressed in tumor tissue sections, and is mainly concentrated in the nucleus, while normal cervical tissue and para-cancerous tissues are underexpressed.
Full-size DOI: 10.7717/peerj.10386/ fig-9 molecule biology validation. Therefore, our study shows that RFC4 is a potential biomarker for the predicting the prognosis of CESC and provides a direction for further study of CESC. What should be noted is that there are some limitations in this study. Clinical samples from one hospital may have either region or race difference. The expression level of RFC4 in different stages of CESC and clinical investigations should be conducted in our future study to validate our results further.

CONCLUSIONS
In conclusion, by using bioinformatics analysis we identified three genes (TOP2A, RFC4, MCM2) based on three microarray datasets. These three genes were suggested to have a significant effect on the prognosis of CESC, which could be key factors in the occurrence and progression of CESC. A high level expressed RFC4 was validated to exist in CESC tissues using clinical samples. Although further investigation and experiments needs to be conducted, the findings in our study could act as clinical biomarkers which would help us better understand the pathological process and predict the prognostic of CESC.