Identication of hub genes and their clinical value for predicting the development of prostate cancer from benign prostate hyperplasia by bioinformatic analysis


 BackgroundProstate cancer (PCa) and benign prostate hyperplasia (BPH) are commonly encountered diseases in elderly males. The two diseases have some commonalities: both are growth depend on hormone and respond to antiandrogen therapy. Some studies have shown that genetic factors are responsible for the occurrences of both diseases. There may be a correlation between BPH and PCa. MethodsThe GEO database can help determine the differentially expressed genes (DEGs) between BPH and PCa. Gene Ontology (GO) term enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were utilized to find pathways in which the DEGs were enriched. The STRING database can provide a protein–protein interaction (PPI) network, and Cytoscape software can find hub genes in PPI network. GEPIA can be used to analyze expression and survival data for hub genes. R software was used to progress regression analysis, decision curve analysis and built nomograph. UALCAN and The Human Protein Atlas was utilized to test the results. Finally, we made clinical and cell experiments to verify the results.ResultsSixty DEGs, consisting of 15 up-regulated and 45 down-regulated genes, were found based on the GEO database. Using Cytoscape, we found 7 hub gene in the PPI network. The hub gene expression was tested on TCGA database. Except CXCR4, all hub genes expressed differently between tumor and normal samples. Meanwhile, all hub genes exclude CXCR4 has diagnostic value in predicting PCa and their mutations are risk factors leading to PCa. The expression of CSRP1, MYL9 and SNAI2 changed in different tumor stage. CSRP1 and MYH11 could affect the disease-free survival (DFS). The same results reflected in different database. In addition, we also chose three hub gene, MYC, MYL9, and SNAI2, to validate their functions in clinical specimens and cells.ConclusionThese identified hub genes can help us to understand the process and mechanism by which BPH develops into PCa and provide achievable targets for predicting which BPH patients may later develop PCa.

between BPH and PCa was rst noted in the 1950s in studies of prostate glands. During the past 60 years, some studies have shown that BPH and PCa have some de nite associations. Sommers rst studied BPH and PCa on cadavers. In their work, BPH was found in 80% and 45% of cadavers with or without PCa, respectively [2]. Later, another study covered alike results [3]. In 2002, Hammarsten and Hogstedt reported that BPH which has a faster growing speed may be hazards for developing clinical PCa [4]. Another study proved that the volume of the prostate may be one of the reasons for the aggressiveness of PCa, and PCa located in small glands is more aggressive than that located in larger glands [5]. This means that BPH may affect the degree of malignancy of PCa. Orsted  were expressed differently between BPH and PCa samples [7]. Some studies have reported that gene expression could be a causal factor in the development of PCa from BPH and may even affect the degree of malignancy of PCa [8,9]. Microarray technology and bioinformatic analysis have been extensively wielded to analyze differentially expressed genes (DEGs) and can be used to nd functional pathways that will help us to better understand PCa [10]. Through gene expression pro ling, some investigations have found many DEGs that play critical roles in the process of the development of PCa from BPH [8,9]. However, dependence on a single microarray analysis may lead to inaccurate results. Therefore, examination of different microarray analyses can provide more reliable results.
The Gene Expression Omnibus (GEO) database can provide many microarray datasets. Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis and proteinprotein interaction (PPI) network analyses have been performed to help us understand the potential mechanisms for the occurrence and progression of diseases. Thus, we used bioinformatic analysis to nd key genes that may be important for development of PCa from BPH. Then, based on genetic and clinical data from The Cancer Genome Atlas (TCGA) database, the diagnostic model was built to predict the hub genes of diagnostic value for PCa. Tumor staging and survival time were also analyzed. Logistic regression was used for predicting the mutation of hub genes leading to PCa. The UALCAN and The Human Protein Atlas databases could test the expression of hub gene in normal specimens and PCa tissues. Finally, we validated the function of three DEGs, MYC, MYL9, and SNAI2, in PCa based on clinical specimens and C4-2 PCa cells.

Microarray data
The GEO database (http://www.ncbi.nlm.nih.gov/geo/) at the National Center for Biotechnology Information (NCBI) is a communal database that provides a genomics data repository of gene expression, chip, and microarray data [11]. The criteria for GSE data included in the study as follow: 1

Data handling and DEGs searching
The primary data were got and normalized by R language. According to comments of the documents, the expression matrix including probe ID was substituted by the corresponding gene ID, and if there were multiple probes that corresponded to the same gene, the average value was calculated using the R software for further study [12]. Then all genes of each data set were searched using the limma R package, and genes with an adjusted P-value < 0.05 and |log2fold change (FC)|>1 were considered DEGs. Then, we used the online web tool, Venn diagrams (http://bioinformatics.psb.ugent.be/webtools/Venn/) to nd the integrated DEGs. In addition, the up-regulated and down-regulated genes were downloaded for further study.

GO and KEGG pathway analysis of DEGs
The Database for Annotation, Visualization, and Integrated Discovery (DAVID) v6.8 (https://david.ncifcrf.gov/) was used to perform GO functional and KEGG pathway analyses of the integrated DEGs [13]. The GO functional analysis of integrated DEGs involved three parts: biological processes (BP), cell components (CC), and molecular functions (MF). P < 0.05 was considered statistically statistical differences [14].

PPI network and module analysis
A PPI network of the integrated DEGs was structured by online tool the Search Tool for the Retrieval of Interacting Genes (STRING) database with the default medium con dence (0.4) (http://www.stringdb.org/) [15]. It helped us to nd the key genes and critical gene modules participated in the promotion of BPH to PCa. Cytoscape software was used for reconstructing the PPI network, and module and GO analyses were carried out by two plug-ins in Cytoscape, Molecular Complex Detection (MCODE) and Biological Network Gene Ontology tool (BiNGO), to clarify the biological signi cance of gene modules from BPH to PCa. P < 0.05 indicated a signi cant difference, and these genes were designated as hub genes [16].

Construction of risk prediction model and survival analysis
Hub gene expression between normal prostate specimens and PCa tissues was compared using gene expression pro ling interactive analysis (GEPIA; http://www.gepia.cancer-pku.cn/) dependent on TCGA database [17]. Logistic regression was performed to screen the hazard ratios of hub genes changes leading to PCa. A nomogram was built to predict the risk value of the hub genes. A forest map was utilized to show the hazard ratios more intuitively. Moreover, the prognostic value of gene was enucleated by GEPIA. Then, overall survival (OS), disease-free survival (DFS) was analyzed too.

Construction of the diagnostic model and decision curve analysis
To further analyze the hub genes' diagnostic value for PCa, we collected the gene expression of hub genes and clinical data from TCGA databases (https://portal.gdc.com) [18]. After all data were collected, we used GraphPad Prism 7 (GraphPad Software, Inc., San Diego, CA) to draw the receiver operating characteristic (ROC) curve to re ect the accuracy of the model. Then decision curve analysis (DCA) was carried out by R software.

Expression of hub genes at different tumor stages
Tumor-Node-Metastasis (TNM) classi cation of malignant tumors is commonly used to assess the tumor severity. American Joint Commission on Cancer (AJCC) has another classi cation on tumor [19]. Study the relationship between altered gene expression and tumor can help understanding the effect of genetic changes on tumor progression. Clinical data such as TNM classi cation and ACJJ tumor classi cation was got from TCGA database. Then, we analyzed the hub genes' expression correlation with these clinical data.

Validation of hub genes based on different databases
Because the normal tissue samples from GEPIA including sequencing from both TCGA and GTEx database. So, we further analyzed the hub genes' expression in normal prostate samples and PCa specimens on the UALCAN database (http://ualcan.path.uab.edu/) and The Human Protein Atlas (https://www.proteinatlas.org/) [20,21].

Clinical specimen validation
Quantitative real-time PCR (RT-qPCR), Western blotting (WB), and immunohistochemistry (IHC) were used for analyzing clinical specimens. The total RNA was extracted from tumor and para-cancerous tissues of 6 PCa patients who were previously diagnosed with BPH utilizing TRIzol Reagent (Sigma-Aldrich, St. Louis, MO, USA). cDNA was transcribed using the reverse transcription kit (Advantage® RT-for-PCR Kit, Takara Bio Inc., Kusatsu, Japan). Finally, we measured the volume of cDNA using real-time PCR reagents and a kit (TB Green® Premix Ex Taq™ II, Takara Bio Inc.) according to the manufacturer's descriptions. The following primers of c-MYC, MYL9, SNAI2 and GAPDH were shown in Table 1. The 2 −ΔΔCt method was used to quantify mRNA expression levels. . The next day, 1xTBST was used to wash the membranes three times (10 min. each). Then, the membranes were incubated at room temperature for 1 h with a matched secondary antibody (HRP-labeled Goat Anti-Human IgG (H + L), Beyotime Biotechnology, Shanghai, China). Lastly, the membranes were exposed to X-ray lm.
Then, the expression of MYC, MYL9, and SNAI2 in 2 clinical patients' specimens was detected by IHC. Prostate sections were depara nized with xylene and then hydrated through a stepwise ethanol gradient. After antigen repair, endogenous peroxidase activity was blocked by incubation with 3% H 2 O 2 , and nonspeci c binding was blocked using 5% goat serum. Sections were incubated with anti-MYC, anti-MYL9, and anti-SNAI2 antibodies (Abcam UK) at 4°C overnight and then with horseradish peroxidaselabeled secondary antibody. Slices were scanned, and images were processed using an Olympus microscope (Olympus Corp., Tokyo, Japan) and Olympus cellSens software (Olympus Corp.).
Validation in C4-2 prostate cancer cell lines C4-2 PCa cells were transfected with a MYC breakdown plasmid, a MYL9 overexpression plasmid, and a SNAI2 overexpression plasmid (constructed by fenghbio company, Hunan, China). Then transfected C4-2 cells were used in invasion and proliferation experiments. Approximately 1*10 5 C4-2 cells and 150 uL 2% fetal bovine serum (FBS) (gibco, Thermo Fisher Scienti c, Waltham, MA, USA) + 1640 culture medium (sigma Darmstadt, Germany) was put in the upper chamber, and 10% FBS + 1640 culture medium was placed in the lower cubicle. After 36h, cells were xed with 4% paraformaldehyde xative solution. The cells were stained with crystal violet and observed by an Olympus microscope (Olympus Corp.). ImageJ was utilized to count cell numbers. About 1000 C4-2 cells were placed in each well of a 96-well plate. Each set was repeated three times. The proliferation of cells in 0h, 24h, 48h and 72 h were detected by Cell Counting Kit-8 (CCK-8). The optical density (OD) at 450nm was measured by enzyme labeling.

Statistical analysis
The matrix data was handled with R version 4.0.2 (Institute for Statistics and Mathematics, Vienna, Austria; https://www.r-project.org) and the statistical analysis was nished by SPSS 25.0 (IBM Corp., Armonk, NY, USA). For descriptive statistics, mean ± standard deviation was used for continuous variables with normal distributions, whereas the median (range) was used for continuous variables with abnormal distributions. Categorical variables were described by counts and percentages. Hazard ratios (HRs), the 95% con dence interval (95% CI), and P values were used as statistical metrics. Two-tailed P < 0.05 was deemed as statistically signi cant.

Identi cation of DEGs
Gene expression datasets GSE5377, GSE107479, and GSE30994 were acquired from the GEO database. All datasets were normalized by R language, and then the DEGs were screened using the limma R

GO and KEGG pathway analysis of DEGs
Try to nd the pathways which DEGs enriched, we used DAVID to nd the pathways. Up-regulated genes were enriched in genes involved in the regulation of the plasma membrane. The down-regulated genes were mainly enriched in genes involved in negative regulation of nitric oxide biosynthesis, regulation of extracellular exosome expression, and regulation of glutathione binding. In the KEGG pathway analysis, we also found that the down-regulated genes were mainly enriched in proteoglycans in cancer and focal adhesion. The KEGG pathways which DEGs enriched was shown in Fig. 2C (Fig. 2C). The GO analysis was shown in Fig. 2D. (Fig. 2D).

PPI network and the most signi cant module analysis
The PPI network of DEGs was constructed to help nd the hub genes. The most signi cant module was obtained using Cytoscape software, and we found 7 hub genes. (Fig. 2A). Moreover, the contact of the 7 hub genes was also analyzed. (Fig. 2B) Expression of hub genes in TCGA database To further con rm that the hub genes could be important factors leading to PCa, we used the GEPIA database to compare the hub genes' expression in normal samples and PCa specimens. As shown in Fig. 3, the hub genes were expressed differently in normal samples and PCa specimens, except for MYC and CXCR4. (Fig. 3) Then, we assessed the clinical data of all PCa patients in TCGA database. (Table 2)  . 4A and Table 3) This means that almost all the hub genes were meaningful for the diagnosis of PCa. In addition, decision curve analysis (DCA) is another model used in predicting diagnose value. We made DCA to evaluate all hub genes value in predicting PCa. We found that hub genes together have better diagnose value in predicting PCa. (Fig. 4B) Hub gene expression at different tumor stages Using the clinical data downloaded from TCGA database, we then analyzed hub gene expression at different tumor stages. We analyzed the hub genes' expression in different TNM classi cation and ACJJ classi cation of malignant tumors. We found that some hub genes expression will also change when PCa progression. For example, the expression of CXCR4 will increase signi cantly from T2 tumor stage to T3 tumor stage (P = 0.028). In addition, CSRP1 and MYL9 will decrease when tumor progressed from T2 to T4 (P = 0.032 and P = 0.047). The expression level of CXCR4 and SNAI2 will change when node metastasis happened (P = 0.022 and P = 0.012). In addition, MYC expressed differently in different ACJJ T1 and T2 stages (P = 0.034). (Fig. 5)

Risk prediction model and survival analysis
Logistic regression can be used to predict risk factors which can lead to disease. As a type of generalized linear model, logistic regression has been widely used in disease diagnosis. Here, we made logistic analysis to predict hub genes' expression in resulting PCa. The nomogram was constructed to forecast the probability of hub gene mutation leading to PCa. (Fig. 6A) A calibration curve was made to verify the nomogram and compare the nomogram risk and actual risk. (Fig. 6B) Single factor and multi-factor regression showed the hub genes' mutation risk in causing PCa. (Fig. 6C-D) In single factor logistic regression forest map, we found that except CXCR4, (P = 0.848) all other hub genes may be risk factors in the occurrence of PCa. However, in the multiple factor logistic regression forest map, we found that only SNAI2 (P = 0.04) and MYH11 (P = 0.024) may be risk factors in leading to PCa. The residuals plot and the normal P-P plot of standardized residuals of logistic regression were used to test the effect of the regression model. (Figure S1A-B) We further studied the relationship between patients' survival and hub gene expression. Some hub genes like CSRP1 and MYH11 could affect the length of disease-free survival (DFS). (Fig. 7) However, no hub genes had an effect on overall survival time. (Figure S2) Validation using other databases To verify the above results, we used the UALCAN and The Human Protein Atlas databases to compare the hub genes' expression in normal and tumor specimens. Like the results found above, the other 6 hub genes except CXCR4 were expressed differently in normal prostate tissue and PCa samples. (Fig. 8)

Validation in clinical specimens
Finally, we analyzed the hub genes' expression in our clinical specimens. MYC and CXCR4 was upregulated in tumor tissues than normal tissues However, CXCR4 expressed no difference in GEPIA and UALCAN. In addition, CXCR4 did not has diagnose value, we choose MYC, as the upregulation gene, for experiment validation. In addition, there are 5 hub genes down-regulated. All the down-regulated genes expressed differently in TCGA. In ROC curve analysis, we found that MYL9 and SNAI2 has the lowest AUC among 5 down-regulated hub genes. That means the two hub genes may have minimum authenticity among 5 down-regulated hub genes in predicting PCa. So, we nally choose MYC, MYL9 and SNAI2 as the target genes for further study. We included 6 patients who were diagnosed with PCa with previous diagnoses of BPH. We collected the cancer and para-cancerous tissues. We carried out RT-qPCR and Western blot experiments to compare the mRNA and protein expression of MYC, MYL9, and SNAI2.In addition, 2 paired samples from PCa patients were used in IHC experiment. We found that MYC was truly increased in both mRNA and protein level when PCa happened. (Fig. 9A-B) MYL9 expressed lower in tumor tissues than normal tissues at mRNA and protein level. (Fig. 9C-D). Meanwhile, SNAI2 was also down-regulated in PCa sample tissues than para-cancerous normal tissues. (Fig. 9E-F) At the same time, the IHC results re ect the same trend with the results from RT-qPCR and Western blot. (Fig. 9G-I) Hub genes in uence on C4-2 prostate cancer cell lines To further verify whether the three hub genes we found can affect cell ability, we constructed shMYC plasmid, oeMYL9 and oeSNAI2 plasmid. We verify the mRNA and protein expression level after cell transfected plasmid. We successfully knock-down MYC in C4-2 cell. (Fig. 10A-B). The oeMYL9 and oeSNAI2 C4-2 cell was constructed successfully, too. (Fig. 10C-F) After experiment cell line constructed, we made transwell and CCK8 experiment to compare the invasion and proliferation ability between normal control cell line and experiment cell line. We found that after transfected by shMYC, oeMYL9 and oeSNAI2 plasmid, the C4-2 prostate cancer cell invasion ability will decrease. (Fig. 10G). The same results re ect in CCK8 experiments, too. When C4-2 prostate cancer cell transfected with shMYC, oeMYL9 and oeSNAI2 plasmid, the cell proliferation will decrease. (Fig. 10H)

Discussion
PCa has a higher and higher incidence among all tumors in men, especially those over 70. BPH is another common encountered disease and affects nearly seventy percent of men older than 70 [22]. Therefore, both PCa and BPH are threats to the health of older men. In recent decades, because of changes in the global population structure and aging, elderly males have increased in number [23,24]. Thus, the incidence of PCa and BPH will become more common. Understanding the relationship between PCa and BPH may help better predict the occurrence of PCa and may relieve pressure on the medical system. Although PCa primarily arises in the epithelial cells in the peripheral zone, and BPH arises in the transition zone, PCa and BPH have important factors in common, such as growth depend on hormone and responsiveness to antiandrogen therapy [25]. Moreover, in ammation could be an underlying cause of both BPH and PCa. In a study of 180 men with suspected PCa who were biopsied at baseline and after 5 years of follow-up, the 5-year PCa incidence was 20% for men with biopsy specimens showing in ammation at baseline compared with 6% for men with no evidence of in ammation in baseline biopsies [26]. The Reduction by Dutasteride of Prostate Cancer Events (REDUCE) trial found that patients with BPH all have chronic in ammatory in ltration [27]. These studies all provide evidence of a relationship between BPH and PCa. Sommers rst reported the coexistence of BPH and PCa in 1957 [2].  [29]. Vascular smooth muscle contraction can also affect PCa development [30].
Hub genes, MYC, CXCR4, CSRP1, SNAI2, MYL9, ACTG2, and MYH11, were found by analyzing the PPIs of the DEGs. This result indicates that mutations in these genes may play signi cant roles in the development of PCa from BPH. MYC (MYC proto-oncogene) affects PCa progression due to a high-fat diet and plays a positive role in regulating the androgen receptor and androgen-receptor splice variants in PCa [31,32]. In addition to PCa, MYC is also correlated with the occurrence of multiple tumors. Furthermore, MYC activation can promote bladder cancer proliferation and invasion [33]. The higher expression of MYC was found in colorectal cancer patients, and it may promote tumorigenesis in colorectal cancer through regulation of the Wnt-and Ras-dependent signal transduction pathway [34,35]. MYC regulated by BRD4 can promote gastric cancer progression [36]. CXCR4 (C-X-C motif chemokine receptor 4) may promote PCa metastasis through regulation of phosphatidylinositol 4-kinase IIIα (PI4KIIIα) and SAC1 phosphatase [37]. In addition, CXCL12/CXCR4 can increase the malignancy of breast cancer and cervical cancer [38,39]. SNAI2 (snail family transcriptional repressor 2) can regulate prostate tumor progress, angiogenesis, and metastasis potentially by modulating the GSK-3β/β-catenin signaling pathway [30]. Moreover, research has shown that SNAI2 can promote the invasion and migration of clear cell renal cell carcinoma. Du et al. found that transactivation of SNAI2 and c-MET could promote colorectal cancer metastasis [40]. SNAI2 also participates in regulation of the initiation and metastasis of breast cancer cells [41]. MYL9 (myosin light chain 9) can predict malignant progression and poor biochemical recurrence-free survival of PCa [42]. MYL9 is also associated with the recurrence of colorectal cancer [43]. ACTG2 (actin gamma 2, smooth muscle) and MYH11 (myosin heavy chain 11) have an affection in the development of PCa [44,45]. ACTG2 can also affect hepatocellular carcinoma cell migration and tumor metastasis, and MYH11 may play a pivotal function in the progression of lung cancer and bladder cancer [46,47].
When analyzing hub gene expression, 2 up-regulated hub genes, MYC and CXCR4, were not expressed differently between normal prostate samples and PCa samples in the GEPIA database. These results may due to that the normal sequence from GEPIA includes data from both TCGA and GTEx database. Therefore, an additional analysis was conducted; we analyzed hub genes' expression in tumor and normal samples in UALCAN which include the sequence data only from TCGA database. We found that CXCR4 was not expressed differently in normal and PCa samples in UALCAN. We analyzed the value of hub genes in diagnosing PCa and tumor staging. According to the results, we found that all the hub genes, except CXCR4, could be used as diagnostic markers for PCa progression. In addition, some hub gene can also change when PCa progress. That means these hub genes can be indicators to predict disease progression. In addition, the hub genes were expressed differently in BPH and PCa samples, indicating these genes are potential predictors of PCa development from BPH. Some studies have reported that patients who develop PCa from BPH would have higher cancer-related mortalities.
Therefore, at the same time, we also carried out a survival analysis to determine whether the expression of these hub genes affect patients' survival times. We found that the hub genes do not impact overall survival time, but they can affect progression-free survival time.
In addition, we constructed regression model to value the mutation risk of hub genes in leading to PCa. In the model, we found that all the hub genes may be risk factors in causing PCa except CXCR4. In nomogram, we found that the mutation of CSRP1 is the highest risk factor in leading to PCa among all 7 hub genes. Furthermore, the hub genes function in tumor progression also be analyzed. Some hub genes such as CSRP1, MYL9 and SNAI2 will change when PCa progressed. These results re ect that the expression level change of these hub genes can be potential signal of disease progression.
Finally, we utilized our clinical specimens and C4-2 PCa cells to validate the hub genes' functions in promoting PCa development. We chose MYC, MYL9 and SNAI2. The same results were found in this experiment. The expression of MYC, MYL9 and SNAI2 were obviously changed at the mRNA and protein levels. IHC analysis veri ed this result. Experiments from C4-2 prostate cancer cell re ects that the 3 hub genes can also affect cell invasion and proliferation ability.         Immunohistochemical results of MYC, MYL9 and SNAI2 in 2 paired patients' specimens from both cancer tissues and paracancerous normal tissues. The expression level of mRNA and protein was used GAPDH as inner control. * represent P<0.05, ** represent P<0.01, *** represent P<0.001.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.