Identification of breast cancer risk modules via an integrated strategy

Breast cancer is one of the most common malignant cancers among females worldwide. This complex disease is not caused by a single gene, but resulted from multi-gene interactions, which could be represented by biological networks. Network modules are composed of genes with significant similarities in terms of expression, function and disease association. Therefore, the identification of disease risk modules could contribute to understanding the molecular mechanisms underlying breast cancer. In this paper, an integrated disease risk module identification strategy was proposed according to a multi-objective programming model for two similarity criteria as well as significance of permutation tests in Markov random field module score, function consistency score and Pearson correlation coefficient difference score. Three breast cancer risk modules were identified from a breast cancer-related interaction network. Genes in these risk modules were confirmed to play critical roles in breast cancer by literature review. These risk modules were enriched in breast cancer-related pathways or functions and could distinguish between breast tumor and normal samples with high accuracy for not only the microarray dataset used for breast cancer risk module identification, but also another two independent datasets. Our integrated strategy could be extended to other complex diseases to identify their risk modules and reveal their pathogenesis.


AGING
Disease risk modules could be detected directly from undirected networks, i.e. gene/protein interaction networks and co-expression networks, by available tools. For example, Cytoscape plugin MCODE has been applied to disease risk module detection from interaction networks for lung adenocarcinoma [9], non-small-cell lung cancer [10], colorectal cancer [11] and inflammatory bowel diseases [12]. In co-expression networks, disease risk modules were detected using GraphWeb for pancreatic ductal adenocarcinoma [13], or by weighted gene co-expression network analysis (WGCNA) for atopic dermatitis [14] and Alzheimer's disease [15].
Disease risk modules could also be detected by merging or extending cliques. In graph theory, a clique in a network is a fully connected subgraph where every two nodes are connected by an edge [16]. Cliques with disease-related genes are associated with complex diseases and are of great value to uncovering disease pathogenesis since perturbation of any gene in a clique will directly destroy the function of its neighbors. Disease risk modules were detected by merging cliques highly overlapped for ankylosing spondylitis [17], congenital heart defects in Down syndrome [18] and narcolepsy [19], or by extending cliques for multiple diseases [20].
Mathematical programming has been used to solve network reconstruction problems to achieving globally optimal number of edges and number of quantitative differences [21]. Such kind of programming could also be applied to module identification for optimal disease association. Thus, in this paper, according to a multiobjective programming model for two similarity criteria maximization as well as significance in module score based on Markov random field (MRF), consistency score of functions and difference score for Pearson correlation coefficient (PCC), an integrated disease risk module identification strategy was proposed to identify breast cancer risk modules from a breast cancer-related interaction network. The association of these risk modules with breast cancer was validated by confirmation rate of literature review, functional enrichment analysis, and classification accuracy ( Figure 1).

Primary modules
After evaluating differential information of genes with our two measurements (see Materials and Methods), 218 differentially expressed genes (DEGs) and 1209 differential expression variance genes (DEVGs) were obtained. Hence, 1382 differential genes were screened out from the microarray dataset. 161 cliques containing DEGs or DEVGs with >4 nodes were mined from the breast cancer-related interaction network. After merging cliques/subgraphs with extent of overlapping for seed genes in different cliques S > 0.8 (see Materials and Methods), 6 primary modules were discovered (Table 1).

Candidate modules
Corresponding to primary modules, 6 candidate modules were discovered based on two criteria maximization using a multi-objective programming model (see Materials and Methods, Figure 2).
Since only seed genes were contained in Module 4 and 6, Module 1-3 and 5 were candidate modules for breast cancer risk module identification.

Breast cancer risk modules
Module score W based on MRF, consistency score F of functions and difference score for PCC were calculated for candidate modules and 1,000 random modules with the same number of genes. According to the significance of permutation tests for candidate modules (see Materials and Methods), 3 breast cancer risk modules were identified ( Figure 3) involving 16 seed genes and 44 nonseed genes ( Figure 4). 8 non-seed genes were in all three breast cancer risk modules.

Validation of breast cancer risk modules
The association of these risk modules with breast cancer was evaluated during the validation process from three aspects (see Materials and Methods).

Literature review
A literature review was carried out using the online database PubMed (https://www.ncbi.nlm.nih.gov/ pubmed) for non-seed genes in 3 breast cancer risk modules. It was showed that 75% (33/44) non-seed genes were associated with breast cancer (~70% for each module, Table 2), demonstrating disease association of risk modules identified by our integrated strategy.  Primary module 1  91  15  76  Primary module 2  61  14  47  Primary module 3  59  13  46  Primary module 4  6  5  1  Primary module 5  7  5  2  Primary module 6  4  3  1 In literature, 5 of 8 common non-seed genes in all three breast cancer risk modules were verified. Ling et al. found that the mRNA expression level of PIK3C3 was steadily increased during breast cancer progression and elevated at stage IV [22]. MAPK14, as one of the hub target genes in a PPI network constructed by Wang et al., had the potential to be used as candidate targets for breast cancer treatment [23].

Number of non-seed genes
For other non-seed genes in breast cancer risk modules, a transcriptomic signature of BMP4 signaling exhibited by primary ER + breast tumor patients was predictive of improved disease outcome or an improved biologic response to the treatment. This highlighted BMP4 and its downstream pathway activation as a therapeutic opportunity in ER + breast cancer [24]. Considerable evidence has implicated WT1 in the development, pathogenesis and therapy of breast cancer [25]. For example, WT1 expression levels in breast cancers were significantly higher than in control tissue [26]. WT1 has also been linked with in breast cancer malignant transformation, and its overexpression associated with reduced susceptibility to drug treatment [27]. ETS1 has versatile roles during the cellular processes of various types of cancers. It was often highly expressed in breast cancer samples and mediated migration and invasion of human breast cancer cells [28].
These non-seed genes in our risk modules played vital roles in the development, pathogenesis or signaling of breast cancer. Therefore, these genes could act as potential breast cancer genes.

Functional enrichment analysis
To assess the functional information of genes in breast cancer risk modules, genes in each module were tested for enrichment against KEGG pathways and GO functions (GO-terms of biological process and molecular function) using the Enrichr tool, respectively.
Breast cancer risk modules were enriched in numerous pathways and functions, especially breast cancer-related    ones ( Figure 5). It was worth noting that genes from all three risk modules, including seed genes and non-seed genes, were enriched in the "breast cancer" pathway ( Figure 6), indicating their roles in the process of breast cancer. Different locations of genes from different modules demonstrated various functions of modules.
Other pathways and functions enriched by single breast cancer risk module or all risk modules were also closely associated with breast cancer. "Positive regulation of cell migration" was also a breast cancer-related biological process, which was enriched by only non-seed genes in Module 1. Bisphosphonates, antiresorptive drugs, might be developed as a therapeutic option for breast cancer since it significantly decreased cancer cell migration in a dose-dependent manner [29]. Oxidative stress, the process of oxidative damage caused by Module-2enriched function "Regulation of reactive oxygen species metabolic process" [30], and "Regulation of inflammatory response" enriched by genes in Module 3, were both associated with breast cancer development [31].
For functions enriched by all risk modules, "PI3K-Akt signaling pathway" is an important signal transduction pathway in cells, which was closely associated with the lymph node metastasis of breast cancer, and could affect breast cancer progression and patient prognosis [32,33]. "Cellular senescence" is a complex process that was found to be a tumor-suppressive mechanism leading to suppressed breast cancer cell proliferation by inhibiting cell proliferation [34,35]. "Positive regulation of transcription" played significant roles in breast cancer development since it was the function enriched in by genes identified from many breast cancer-related researches [36]. It was also revealed that via participating in regulation of transcription biological processes, biological elements were involved in the progression of breast cancer [37]. An aberrant apoptotic process can lead to several pathological conditions, such as tumorigenesis and cancer metastasis [38]. Thus, mediating through active "regulation of the apoptotic process", drugs could effect on breast cancer cells [39]. DNAdependent protein kinase has an important role with DNA double-strand break repair. DNA-dependent "protein kinase activity" of peripheral blood lymphocytes is associated with risk of breast cancer since the activity in breast cancer patients was significantly lower than that in normal [40].

Figure 5. Pathways and functions enriched by breast cancer risk modules. AGING
Most genes in our breast cancer risk modules, especially non-seed genes, were enriched in breast cancer-related pathways or functions, some of which were also related to cancer hallmark-associated GO terms, such as "HALLMARK_APOPTOSIS" and "HALLMARK_ PI3K_AKT_MTOR_SIGNALING", which indicated the disease association of our risk modules.

Classification performance
With genes in breast cancer risk modules as features, breast tumor and normal samples were classified by the SVM classifier. Using LOOCV, all three risk modules achieved an accuracy of approximate 85% for breast tumor and normal samples of GSE15852, although only ~33% genes in each module were differential genes.
In order to compare the classification performance of breast cancer risk modules and that of only seed genes in these modules, the classification accuracy was also computed based on SVM classifier with seed genes as features. The classification accuracy was ~83% for seed genes in each risk module, which was inferior to that of breast cancer risk modules (Figure 7).
These results exhibited that our risk modules with seed genes and non-seed genes could distinguish between breast tumor and normal samples with higher accuracy than with only seed genes.
Additionally, to assess classification significance of risk modules, 100 random gene sets of the same size were selected from the breast cancer-related interaction network. AUC values were calculated utilizing SVM classifier with genes of random gene sets as features. The classification accuracy of breast cancer risk modules outperformed the accuracy of random gene sets of the same size (AUC = ~0.82). This showed that risk modules could classify breast tumor and normal samples effectively with significantly better performance than only seed genes or random selected ones. www.aging-us.com

AGING
Then, the classification with genes in breast cancer risk modules as features was conducted on another two datasets, GSE70947 from another platform and samples collected from The Cancer Genome Atlas (TCGA), the accuracy of which was compare with that of seed genes in risk modules (Table 3). Since the size of breast tumor in TCGA was much larger than that of normal samples, tumor samples with the same number as normal ones (113) were randomly selected. The genes in breast cancer risk modules could also classify breast tumor and normal samples of the same size accurately (>0.86).
Similar results that our risk modules with both seed genes and non-seed genes could distinguish between breast tumor and normal samples with higher accuracy than with only seed genes were also obtained for these two datasets.

DISCUSSION
In this paper, an integrated strategy was proposed to identify breast cancer risk modules according to a multi-objective programming model and significance in three scores. A total of 3 breast cancer risk modules were identified. ~70% non-seed genes in these risk modules were confirmed to play vital roles in the development, pathogenesis or signaling of breast cancer and could act as potential breast cancer genes. Most genes in our risk modules, including unconfirmed non-seed genes, were enriched in breast cancer-related pathways or functions. These risk modules could distinguish between breast tumor and normal samples with higher accuracy than seed genes in risk modules. These results indicated the disease association of breast cancer risk modules identified by our integrated strategy.
In order to illustrate the robustness of our risk modules, risk modules were re-identified using 90% samples randomly selected from GSE15852. The process was repeated 100 times. Genes of risk modules from random samples were compared with those of breast cancer risk modules from all samples ( Figure 8). More than 90% genes in risk modules from all samples were reidentified by random samples.  For the purpose of primary modules detection, cliques of the breast cancer-related interaction network were mined using Cytoscape MClique in our integrated strategy. MCODE and GraphWeb were also taken into consideration for primary module detection. It was showed that the number of genes in cliques/modules mined by MClique, MCODE and GraphWeb had great difference, while classification accuracy varied among the three module sets (Figure 9). Since cliques mined by MClique were smaller with more connections, larger AUC values and more similar genes, they were used for primary module detection in our integrated strategy.
A multi-objective programming model based on maximization of two criteria, MI and PCC, was employed for candidate module discovery. Candidate modules based on individual criterion, MI or PCC, were also discovered. Breast tumor and normal samples were classified with these modules as features ( Figure 10). Candidate modules discovered using both criteria could classify samples with higher accuracy and fewer genes than those using individual criterion in most cases, and were used for breast cancer risk module identification. To further demonstrate the effectiveness of our integrated strategy, genes removed from primary modules were compared to non-seed genes remained in breast cancer risk modules by classification accuracy. The accuracy of non-seed genes remained in breast cancer risk modules was higher than that of genes removed from primary modules ( Table 4). These results indicated that non-seed genes in breast cancer risk modules were more related to breast cancer than the removed ones, supporting the effectiveness of our integrated strategy, especially the multi-objective programming model.

Results of breast cancer risk modules and their
validation demonstrated the effectiveness of our choice for different steps of our integrated strategy. Nevertheless, there are some potential limitations to our study. First, known disease-associated genes were required for disease-related interaction network construction. Second, expression similarity and difference as well as functions of genes were exploited for disease risk module identification in our integrated strategy, which might be affected by datasets used in the process. The other breast cancer microarray dataset GSE70947 was used to screen differential genes and identify breast cancer risk modules with our integrated strategy. Another two risk modules were identified. With more differential genes in these modules (~55%), they had a high classification accuracy (~0.85), which was still inferior to that of our risk modules identified from GSE15852 (>0.89) for GSE70947. This showed the effectiveness of our integrated strategy and breast cancer risk modules we identified. With other types of information available, they could also be integrated into our integrated strategy to improve its performance or reveal the molecular mechanism of metastasis [41].
Subtype (Basal, Her2, Luminal A (LumA) and Luminal B (LumB)) classification for breast cancer is of great significance for its clinical diagnosis and treatment. To assess the subtype classification ability of breast cancer risk modules, expression values of genes in these risk modules were used to classify different subtypes. All risk modules could distinguish between subtypes with high accuracy (Table 5). AGING    In summary, breast cancer risk modules identified by our integrated strategy were confirmed to play critical roles in breast cancer by literature review, functional enrichment analysis, and classification accuracy. Our integrated breast cancer risk module identification strategy could be extended to other complex diseases for researchers to gain more thorough understanding of their pathogenesis.

Data
The microarray dataset GSE15852 (GPL96) was downloaded from Gene Expression Omnibus (GEO) database, which was composed of 43 human breast tumor tissues and their 43 paired normal tissues.  [42]. TGDB contains information about genes which are targets for cancer-causing mutations with their historical relevance [43]. ONGene is a literature-based database for human oncogenes [44]. The latest version of NCG contains information of cancer genes from manually curated publications [45]. 32 breast cancer-associated genes from at least two databases were referred to as seed genes in our analysis to increase the confidence of our seed genes ( Table 6).
A complete gene/protein interaction network is of fundamental importance for the understanding of diseases [46]. Human protein interaction data were integrated from the HPRD [47], STRING [48], and KEGG [49] databases. All products of seed genes were used to determine a breast cancer-related interaction network by extracting direct interactions between seed and other proteins. The resulting network was centered on seed genes with 13136 interaction relationships between 5202 genes.

Breast cancer risk module identification strategy
An integrated disease risk module identification strategy was proposed and used to identify breast cancer risk modules. First, differential genes were screened from a breast cancer microarray dataset, and primary modules were detected by merging cliques containing differential genes from a breast cancer-related interaction network. Then, candidate modules were discovered using a multiobjective programming model to maximize two similarity criteria. Finally, breast cancer risk modules were identified according to significance in module score based on MRF, consistency score of functions and difference score for PCC.

Detection of primary modules
Two measurements were used to evaluate differential information of genes: (1) after preprocessing, the significance analysis of microarrays (SAM) program was used for screening DEGs. The false discovery rate AGING Table 6. Breast cancer-associated genes and their source databases.
The p value was evaluated by the number of times V of 1000 random genes exceed that of the interested one. Genes with FDR-adjusted p value<0.05 were screened as DEVGs.
DEGs and DEVGs were differential genes for further analysis.
Cliques containing 4-8 genes were mined from the breast cancer-related interaction network using Cytoscape MClique. Only those containing differential genes were taken into consideration. These cliques were also centered on seed genes that overlapped with seed genes in other cliques. The extent of overlapping for seed genes in different cliques was evaluated by Simpson index: ∩ is common genes in cliques A and B, A and B indicate the number of genes in module A and B, respectively.
Cliques with 4 genes could be merged if the cutoff of Simpson index S was set to ~0.75. In this case, subgraphs with more than 2000 genes were obtained. Therefore, to identify more rational modules, cliques with 5-8 genes were used for clique merging. That is, each pair of cliques/subgraphs with S>0.8 were merged to form larger subgraphs until S for no subgraph pair was larger than 0.8. Subgraphs obtained at this step were named primary modules.

Discovery of candidate modules
Using a multi-objective programming model to maximize two criteria, candidate modules were discovered, under the hypothesis that genes more similar to seed genes may tend to be disease-related. The similarity of non-seed genes with seed ones in primary modules was measured by the sum of mutual information (MI) and the average of PCCs: Through multiple iterations that calculations were performed for each non-seed gene y against each seed gene x in primary modules, candidate modules that satisfied the requirements could be obtained.

Identification of breast cancer risk modules
Breast cancer risk modules were identified based on significance of permutation tests for three scores of candidate modules.

Module score W based on MRF
For a candidate module with m genes, a multivariate random variable was defined as the expression difference of these genes between tumor and normal samples. It was assumed that the expression difference formed a MRF, and thus, the expression difference of a gene only depended on the expression difference of its direct interacting neighbor genes.
Gibbs distribution was employed to specify the joint probability of f: where K is a constant that guarantees the probability sum to be 1, T is a temperature parameter controlling the distribution sharpness, and ( ) ( ) which represents the differential level of seed genes with the similarity between non-seed genes of a candidate module.
Therefore, the module score based on MRF was defined as ( ) where k is the number of interactions in the module M, C1 and C2 are the set of seed genes and non-seed genes in the module, fu and fv are expression differences assessed by the t-test between tumor and normal samples, and du and dv are the degree of non-seed genes u and v, respectively.

Consistency score F of functions AGING
The consistency scores of candidate modules were calculated based on functional consistency between seed genes and non-seed ones: Candidate modules with large values for all the three scores indicated their disease association.
To obtain the significance of permutation tests for each candidate module, 1,000 random modules with the same number of genes were constructed. All of the three scores were calculated individually for these random modules. Scores significantly greater than the random ones (permutation tests, p <0.05) were considered significant. Breast cancer risk modules were identified as candidate modules significant in all 3 scores.

Validation of breast cancer risk modules
Validation for association of risk modules identified using our integrated strategy with breast cancer was evaluated from three aspects: 1) confirmation rate by literature review, 2) functional enrichment analysis (adjusted p < 0.05 was considered statistically significant), including Gene Ontology (GO) function and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis, performed through a latest enrichment tool Enrichr (http://amp.pharm. mssm.edu/Enrichr/) [50], and 3) classification accuracy determined by area under the receiver operating characteristic curve (AUC) evaluated with a leave-oneout cross-validation (LOOCV) strategy after distinguishing breast tumor and normal samples by means of a support vector machine (SVM) classifier with genes in risk modules as features. The classification was conducted on not only the microarray dataset we used to identify breast cancer risk modules, but also another two independent datasets: one was another microarray dataset GSE70947 (GPL13607) downloaded from GEO composed of 148 human breast tumor tissues and 148 paired adjacent normal breast tissue, and the other was the expression data of 1102 breast tumor and 113 normal samples collected from the TCGA database (https://portal.gdc.cancer.gov/) [51].

AUTHORS CONTRIBUTIONS
W.L., G.D., J.Z., and L.C. contributed to the study design. G.D. and J.Z. contributed to data collection. W.L., G.D., and J.Z. performed statistical analysis, interpretation, and drafted the manuscript. E.H., Y.H., J.L., X.S., K.W., and L.C. revised the manuscript. All authors contributed to critical revision of the final manuscript and approved the final version of the manuscript. W.L., X.S., K.W., and L.C. provided financial support and study supervision.