Identifying breast cancer subtypes associated modules and biomarkers by integrated bioinformatics analysis

Abstract Breast cancer is the most common form of cancer afflicting women worldwide. Patients with breast cancer of different molecular classifications need varied treatments. Since it is known that the development of breast cancer involves multiple genes and functions, identification of functional gene modules (clusters of the functionally related genes) is indispensable as opposed to isolated genes, in order to investigate their relationship derived from the gene co-expression analysis. In total, 6315 differentially expressed genes (DEGs) were recognized and subjected to the co-expression analysis. Seven modules were screened out. The blue and turquoise modules have been selected from the module trait association analysis since the genes in these two modules are significantly correlated with the breast cancer subtypes. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment show that the blue module genes engaged in cell cycle, DNA replication, p53 signaling pathway, and pathway in cancer. According to the connectivity analysis and survival analysis, 8 out of 96 hub genes were filtered and have shown the highest expression in basal-like breast cancer. Furthermore, the hub genes were validated by the external datasets and quantitative real-time PCR (qRT-PCR). In summary, hub genes of Cyclin E1 (CCNE1), Centromere Protein N (CENPN), Checkpoint kinase 1 (CHEK1), Polo-like kinase 1 (PLK1), DNA replication and sister chromatid cohesion 1 (DSCC1), Family with sequence similarity 64, member A (FAM64A), Ubiquitin Conjugating Enzyme E2 C (UBE2C) and Ubiquitin Conjugating Enzyme E2 T (UBE2T) may serve as the prognostic markers for different subtypes of breast cancer.


Introduction
Breast cancer, which affects approximately one of every nine women globally, is one of the most widespread cancers among female malignant tumors [1]. It is well known that breast cancer has high heterogeneity at the molecular level. According to the PAM50 molecular typing model, there are luminal A subtype, luminal B subtype, human epidermal growth factor receptor 2 (her2) positive subtype, and basal-like breast cancer subtypes [2,3]. Breast cancer is a complex disease with changed genetic and molecular characteristics. Understanding the nature of these changes can provide opportunities for an individualized approach to treatment. At the present time, whole genomic analysis is one of the most efficient ways of studying the diseases. Many researchers focus on prognostic or therapeutic markers for identification using differential expression analysis. In addition to differential gene expression, gene co-expression networks have been extensively explored for high-throughput sequencing data analysis [4]. One of the most frequently applied gene co-expression analysis methods is Weighted gene co-expression network analysis (WGCNA) which can explore the patterns of co-expressed genes from  a systematic level. This method provides unique insight into the structure and behavior of molecular interactions than individual gene. WGCNA has been used in many studies to identify disease-related modules and marker genes including cancer [5][6][7][8]. In the present study, RNA sequencing profiles of breast cancer from luminal A subtype, luminal B subtype, her2 positive subtype, and basal-like subtype were downloaded from The Cancer Genome Atlas (TCGA) database [9]. Further, differentially expressed genes (DEGs) were screened for the pairwise comparisons of the four groups (six comparison pairs). After screening, the union of the six lists of DEGs was subjected to the WGCNA to identify gene co-expression modules. The module trait association analysis and the gene functional enrichment reveals that the blue module plays a pivotal role across various subtypes of breast cancer. Further, hub genes for the blue module were identified and validated by the external database, constituting potential prognostic and therapeutic biomarkers for breast cancer subtypes.

Data process
The breast cancer RNA-Seq data and the corresponding clinical data were downloaded from the TCGA database [9].

Screening of DEGs
The edgeR package [10] was performed for differential expression analysis for breast cancer subtype pairs. Genes with fold changes ≥ 2, and fdr values <0.01 that adjusted for multiple testing from P-values by the Benjamini-Hochberg adjustment method [11] were identified to be differentially expressed.

WGCNA
A scale free, co-expression network was built by the R package WGCNA on the ground of RNA-Seq data. In this network, the interactions between gene pairs were represented by the Pearson's correlation coefficient values. Subsequently, the matrix of network topological overlap measure (TOM) was calculated to measure the connectivity of the pair of genes [12]. Hierarchical clustering of average linkage was preformed to identify gene co-expression modules on the basis of the network topology overlap [13]. The subtype representative modules were screened out on the ground of relationships between the modules and external clinical traits (correlation between the eigengene and sample subtypes) and the correlations between gene significance (GS) and module membership (MM) values. Significantly high correlation with P-value <0.01 for GS and MM values together with the correlation >0.75 between the module eigengene (ME) and the external trait were used to identify the subtype representative modules. Hub genes were recognized by GS > 0.2 and MM > 0.8 in a specific module.

Functional enrichment analysis for DEGs
The R package clusterProfiler process [14] was performed for the gene functional enrichment including gene ontology (GO) and KEGG pathway using fdr < 0.05 as the cutoff criteria.

Survival analysis
The website of GEPIA (http://gepia.cancer-pku.cn/) [15] was applied to analyze the RNA transcriptional data of tumors against normal samples in TCGA database (https://portal.gdc.cancer.gov/). The survival analysis of the hub genes was performed in GEPIA in which the samples were divided into two groups according to the gene expression level. Then the log-rank test algorithm was performed to determine the significance of expression for specific gene on survival. The log-rank test (P-value <0.05) was used to determine whether the expression of the hub gene was correlated with the patient's survival rate.

Validation of hub genes by independent dataset and quantitative real-time PCR
The differential expression of hub genes were validated in an independent dataset with accession number of GSE45827. There are 130 primary invasive breast cancer (41 Basal like, 30 Her2, 29 Luminal A and 30 Luminal B subtype) as well as 11 normal tissue samples. The survival analysis for hub genes were validated by Kaplan-Meier Plotter (https://kmplot.com/analysis/) which is an online tool to discover and validate the survival biomarkers [16].
In this study, we validated the vital genes by gene chip datasets. Furthermore, we used quantitative real-time PCR (qRT-PCR) to explore the expression level of hub genes. Human breast cancer MCF-7 cells were grown in Dulbecco's modified Eagle's medium (DMEM) added with 10% fetal bovine serum. Human normal MCF-10A cells were maintained in the mammary epithelial growth medium DMEM-F12 containing insulin, hydrocortisone, epidermal growth factor, horse serum, and cholera toxin. All cells were incubated at 37 • C in a humidified chamber with 5% CO 2 . Total RNA was extracted from cells using RNAiso Plus (Takara, Japan). Complementary DNA (cDNA) was synthesized using the One-Step PrimeScript miRNA cDNA Synthesis Kit (Takara, Japan) according to the manufacturer's instructions. qRT-PCR was performed using SYBR Premix ExTaq II (Takara, Japan). The relative expression of genes were calculated using the 2 − C t method, normalized to GAPDH, respectively. The primers used in the present study are shown in Supplementary Table S1.

Functions exploration for each hub gene in breast cancer
The functions for the hub genes were evaluated by the database of GeneMANIA (http://www.genemania.org) which features several bioinformatics methods: physical interaction, gene coexpression, gene colocation, gene enrichment, and website prediction [17,18]. In order to explore the functions for each hub gene in breast cancer, samples of dataset GSE45827 were divided into low (n=65) and high (n=65), according to each hub gene expression. The R package limma was used to identify DEGs between the two groups. DEGs with absolute log2FC > 1 and FDR < 0.01 were considered significant. GO enrichment, KEGG pathway analysis and single gene gene-set enrichment analysis were performed using R package clusterProfiler [14]. FDR < 0.05 was considered as significantly significant.

DEGs screening
Using the edgeR package, with threshold values of adjusted P-value (FDR) <0.01 and fold change ≥ 2, we identified the DEGs between each breast cancer subtype groups. A total of 6315 DEGs were identified by the six pairwise comparisons of the samples from the luminal A subtype, the luminal B subtype, the her2 positive subtype, and the basal-like subtype of breast cancer (Table 1; Supplementary Figure S1).

Construction of the co-expression network and identification of modules
WGCNA was performed to recognize modules connected with a variety of breast cancer subtypes. The 6315 DEGs were subject to the co-expression analysis. We analyzed the soft threshold power of the network topology with a β value from 1 to 20, and calculated the scale independence and mean connectivity of the co-expression network. Based on the research data, a suitable threshold of 5 was selected ( Figure 1A). The node degree distribution was calculated according to a power-law distribution, indicating that the network was scale-free ( Figure 1B). Next, the β = 5 was used to produce a hierarchical clustering gene tree (Figure 2A). The genes' expression level for a module is represented by ME which is the principal component for a specific module. There are two methods testing the interaction between each module and the corresponding breast cancer subtypes. And two modules were screened out with  Figure 3A,B demonstrate the interactions between the modules turquoise and blues as well as the genetic significance.

Functional enrichment analysis of genes within the identified modules
Gene functional enrichment was applied to the turquoise and blue modules. We observed that genes in the turquoise group are involved in hormone transport, hormone secretion, the estrogen signaling pathway, and cation transmembrane transporter activity related functions (Figure 4). The functional enrichment results demonstrate that the genes in the blue module are significantly associated with cancer such as cell cycle, oocyte meiosis, DNA replication, p53 signaling pathway, and so on ( Figure 5).

Survival and expression analysis of the hub genes
During the research process, a total of 96 hub genes were identified under GS > 0.2 and MM > 0.8 in the blue module. GEPIA was performed to analyze the overall survival and P<0.05 was considered to be statistically significant. Eight genes showed significant associations with the prognosis of patients ( Figure 6) and showed increased expression with the aggressive degrees (

Validation of hub genes by independent dataset and qRT-PCR
An independent dataset GSE45827 from GEO database was used to validate the expression status of the eight hub genes. All the eight hub genes showed significant up-regulation in breast cancer compared with normal, except UBE2T ( Figure 8A). They also exhibited increased expression with the aggressive degrees in the independent dataset ( Figure  8B). The survival analysis of the eight vital genes were validated by Kaplan-Meier Plotter by the microarray datasets which are in accord with the results from the GEPIA using the TCGA data ( Figure 9). Besides external datasets  validation, we also used qRT-PCR to validate the expression level of hub genes in breast cancer. The results showed the hub genes were significantly up-regulated in breast cancer cell line, except FAM64A ( Figure 10).

Potential functions for each hub gene in breast cancer
The database of GeneMANIA was used to investigate the functions of the hub genes ( Figure 11A). The results showed that these genes were associated with mitosis (FDR = 7.47e-10), nuclear division (FDR = 1.28e-08), organelle fission (FDR = 2.16e-08), G 2 /M transition of mitotic cell cycle (FDR = 4.99e-06), cell cycle G 2 /M phase transition (FDR = 4.99e-06), negative regulation of cell cycle process (FDR = 3.42e-05), and cell cycle checkpoint (FDR = 3.42e-05). For the functions of each hub gene in breast cancer, GO, KEGG pathway enrichment analysis, and GSEA were performed in GSE45827 and the results revealed that the major biological processes and KEGG pathways where hub gene participated were cell cycle, DNA replication, cell cycle checkpoint, and other cell cycle functions closely associated with the proliferation of tumor cells (Table 2 and Figure 11B).

Discussion
Breast cancer is a variety of cancers with high biological heterogeneity. The expression level of estrogen receptors (ERs), progesterone receptors (PRs), and human epidermal growth factor receptors 2 (HER2) can affect the survival rate of patients, and may help guide patient treatment and predict prognosis of breast cancer. Although breast cancer treatment measures have ameliorated in the last few decades, the capability to treat the different subtypes, especially Her2 positive and basal-like breast cancer, is still limited by the lack of precise molecular targets. Therefore, it is urgent to continue to explore the molecular mechanisms involved in various subtypes of breast cancer. In the present study, we applied gene transcriptional datasets from the TCGA database to identify potential biomarkers for the cancer subtypes and validated the results using the external datasets and qRT-PCR experiment. WGCNA was performed to explore gene co-expression modules associated with breast cancer subtypes. With a total of 6315 DEGs being evaluated to construct a co-expression network, 7 co-expressed modules were identified. The blue and turquoise modules in the present study were found to have the highest association with tumor subtypes. Further gene functional enrichment of each module revealed that the blue module is involved in functions of cell cycle, DNA replication, p53 signaling pathway and other functions commonly associated with cancer. Additionally, 96 hub genes with high connectivity were screened out from the blue module. Among them, eight genes including CCNE1, CENPN, CHEK1, PLK1, DSCC1, FAM64A, UBE2C, and UBE2T, were significantly negatively associated with the overall patient survival with log-rank test P-value <0.05 and exhibited increased expression with the aggressive degrees. The independent data and qRT-PCR showed that they were significantly up-expressed and associated with poor outcome in breast cancer except UBE2T and FAM64A. Furthermore, single gene functional analysis for each hub gene in breast cancer was conducted and revealed that they were all involved in cycle cycle process whose aberration was reported as the hallmarks of cancer [19] and were validated to be anti-cancer therapeutic targets [20]. CCNE1 is an oncogene and is one of the members of the cyclin family. Genes in this family are marked by the dramatic periodicity in protein abundance through the cell cycle. CCNE1 combines and activates the expression of CDK2 to regulate the cell cycle G 1 /S transition. Overexpression of this gene has been observed in many tumors, which results in chromosome instability, and thus may contribute to tumorigenesis [21]. Nakayama et al. [22] reported that the amplification of gene CCNE1 is related to poor survival rates in ovarian cancer patients. In the present study, CCNE1 is significantly overexpressed in luminal B, Her2 positive, and basal-like subtypes compared with normal samples (Figure 8), which suggests that the expression of CCNE1 is associated with the relatively aggressive degrees of cancer. Han et al. [23] showed that miR-497 and miR-34a can down-regulate CCNE1 by targeting the 3 -UTR of CCNE1 in lung cancer. CENPN forms part of the nucleosome-associated complex and is important for kinetochore assembly. This protein is bound to kinetochores during S phase and G 2 , and recruits other proteins to the centromere. CENPN is an important predictor for both breast carcinoma recurrence and mortality among smokers. Elevated expression of this gene is associated with significantly increased mortality and risk of recurrence in smokers in contrast with non-smokers [24]. CHEK1 is a cancer-related gene. Protein encoded by CHEK1 is a member of the Ser/Thr protein kinase family. It is necessary for checkpoint-mediated cell cycle arrest in response to DNA damage or the presence of unreplicated DNA. This protein can integrate signals from two cell cycle proteins (ATM and ATR) related to DNA damage responses, that also associate with chromatin in meiotic prophase I [25]. Tarek et al. [26] reported that at the mRNA level, high CHEK1 was associated with poor survival in the whole cohort of breast cancer and in the ER+ subgroup according to data analysis. PLK1 is a main mitotic kinase playing a pivotal role in the eukaryotic cells division [27]. Many researchers have demonstrated that PLK1 is highly expressed in various tissues for tumor, such as breast cancer, ovarian cancer, and prostate cancer [28,29]. A high expression of PLK1 is associated with aggressive characteristics, such as invasion of vascular, proliferative activity markers and detectable ERs shortage [29,30]. Similarly, analysis revealed that PLK1 was significantly up-regulated in the subtypes except luminal A, which is less aggressive. King et al. [31] showed a significant association between increased PLK1 levels in breast cancer cells and clinical outcome. PLK1 expression was associated almost exclusively with tumor grade. The survival data clearly validated that patients expressing increased PLK1 levels appear significantly reduced survival, which is in accordance with the finding that high levels of PLK1 are associated with poor patient outcome. DSCC1 is a component of an alternative replication factor C complex that loads proliferating cell nuclear antigens on to DNA during S-phase of the cell cycle [32]. In the present study, DSCC1 is significantly up-expressed in each breast cancer subtype except luminal A, compared with normal samples. This fact indicates that overexpression probably promotes the cell cycle transition and cell division, and consequently promotes cell growth. UBE2C is a cellular proto-oncogene and is essential for the mitotic cyclins destruction, for the anaphase-promoting complex regulation, and for progression of cell cycle [33]. Overexpressed UBE2C can cause mis-segregation of chromosome and the cell cycle profile alteration facilitating the proliferation of cell [34]. UBE2C overexpressed in various human cancers, including breast cancer and is correlated with tumor malignancy [35][36][37]. Although, FAM64A and UBE2T were not validated both in independent dataset and qRT-PCR, they still play important roles in cancer. FAM64A is reported to be associated with several cancers [38][39][40], aberrant expression was found in tumor samples, and was significantly correlated with patients' survival in multiple cancers, including breast cancer [41]. FAM64A can be combined with Twist1 to form a complex, which could bind with an E-cadherin promoter to inhibit the expression of E-cadherin, and thus promote the process of cell epithelial-mesenchymal transition. It was specifically up-regulated in triple-negative breast cancer (TNBC) which showed more aggressive clinical behavior [42]. To our best knowledge that MCF-7 cell line had a low degree of malignancy and the expression level was not significantly changed by qRT-PCR, we speculated that FAM64A was associated with the aggressive degree in breast cancer. UBE2T engage in the DNA repair pathway through catalyzing the Fanconi Anemia Complex monoubiquitination [43]. UBE2T has been reported to be associated with cancer progression and poor outcomes in several solid tumors [44,45]. It has been demonstrated that UBE2T has a crucial role in the regulation of BRCA1 in breast cancer [46]. Overexpressed UBE2T can facilitate cell cycle progression and avoid DNA repair by inducing the degradation of key regulators of functions such as p21 and BRCA1 [47]. Although eight hub genes were identified to be potential prognosis biomarkers for breast cancers, the further experiments need to be conducted to investigate the mechanisms of them in the cell cycle process in breast cancer.

Conclusion
Using the WGCNA method, an important gene co-expression module blue was identified which is closely associated with different subtypes of breast cancer. In addition, eight hub genes of CCNE1, CENPN, CHEK1, PLK1, DSCC1, FAM64A, UBE2C, and UBE2T were identified and validated to be the potential prognostic and therapeutic biomarkers for breast cancer subtypes.

Data Availability
The RNA-seq datasets were downloaded from TCGA; http://cancergenome.nih.gov. The GSE45827 was downloaded from GEO website; https://www.ncbi.nlm.nih.gov/gds/?term=GSE45827. Other data are supplied in the paper and the supplementary data.

Competing Interests
The authors declare that there are no competing interests associated with the manuscript.

Funding
This work was supported by the Scientific Research Management Project from Health Committee of Gansu Province [grant number GSWSKY2019-73].

Author Contribution
Yanwei Wang and Baohong Liu contributed to the design and conception of the present study, conducted computational experiments, performed and interpreted data, and drafted the manuscript. Yu Li revised the manuscript. Ailin Song conceived this project and participated in its design, helped in interpreting the data, and drafted and revised the manuscript. All authors read and approved the final manuscript.