Co-occurrence and Mutual Exclusivity Analysis of DNA Methylation Reveals Distinct Subtypes in Multiple Cancers

Co-occurrence and mutual exclusivity (COME) of DNA methylation refer to two or more genes that tend to be positively or negatively correlated in DNA methylation among different samples. Although COME of gene mutations in pan-cancer have been well explored, little is known about the COME of DNA methylation in pan-cancer. Here, we systematically explored the COME of DNA methylation profile in diverse human cancer. A total of 5,128,332 COME events were identified in 14 main cancers types in The Cancer Genome Atlas (TCGA). We also identified functional epigenetic modules of the zinc finger gene family in six cancer types by integrating the gene expression and DNA methylation data and the frequently occurred COME network. Interestingly, most of the genes in those functional epigenetic modules are epigenetically repressed. Strikingly, those frequently occurred COME events could be used to classify the patients into several subtypes with significant different clinical outcomes in six cancers as well as pan-cancer (p-value ≤ = 0.05). Moreover, we observed significant associations between different COME subtypes and clinical features (e.g., age, gender, histological type, neoplasm histologic grade, and pathologic stage) in distinct cancers. Taken together, we identified millions of COME events of DNA methylation in pan-cancer and detected functional epigenetic COME events that could separate tumor patients into different subtypes, which may benefit the diagnosis and prognosis of pan-cancer.


INTRODUCTION
DNA methylation (DNAm) is a major epigenetic modification, which is considered as an approach for disease diagnosis. An increasing number of studies have indicated that aberrant DNAm plays an important role in diverse diseases, especially cancers (Delpu et al., 2013;Stefansson et al., 2015;Tahara and Arisawa, 2015). For example, the hypermethylation of CpG island in promoter region of tumor suppressor genes have been observed in pediatric acute myeloid leukemia (Tao et al., 2014), bladder (Garcia-Baquero et al., 2014) and adult brain tumors (Hill et al., 2011) as well as hepatocellular carcinoma (Revill et al., 2013), which may lead to proliferative advantages and aggressive phenotypes during tumorigenesis (Suva et al., 2013).
Previous studies showed that the co-occurrence of gene mutations is frequently observed in two or more genes that tend to have mutations simultaneously in cancer patients (Kang et al., 2008;Zhang et al., 2017). Genes that have mutually exclusive mutations are generally involved in the same biological process (Szczurek and Beerenwinkel, 2014). Genomic alterations targeting similar biological processes could be mutually redundant, with one alteration being able to disrupt the affected process, thus identifying mutual exclusive events may facilitate discovering unknown functional interactions (Zhang et al., 2017). Detecting such patterns is crucial for identifying related novel cancerous pathways and potential treatment targets (Szczurek and Beerenwinkel, 2014). However, to date, co-occurrence (CO) and mutually exclusivity (ME) of DNA methylation in human cancers are less explored. Co-methylation has been reported as a new indicator for discovering functional associations between gene pairs in breast cancer (Akulenko and Helms, 2013). Recently, a number of algorithms have been developed for estimating the significance of ME and CO patterns between two genes (Canisius et al., 2016;Hua et al., 2016;Kim et al., 2017). Some of those tools can be used on DNA methylation data (Canisius et al., 2016), making it possible to comprehensively investigate the CO and ME events of DNAm in diverse The Cancer Genome Atlas (TCGA) cancers.
In this study, we first detected the CO and ME events of DNAm in 14 distinct cancers and explored the relationship between related gene pairs at gene expression and DNA methylation level. Then, we constructed a pan-cancer network with filtered Co-occurrence and mutual exclusivity (COME) events and identified functional epigenetic modules consisting of genes in the zinc finger family. We also found that the selected CO and ME events could be used to classify different types of tumors including pan-cancer into several subtypes with significantly different progression-free interval (PFI). Interestingly, the subtypes determined by COME events are significantly correlated with distinct clinical features, including age, gender, histological type, neoplasm histologic grade, and pathologic stage. Our results suggest that the COME events of DNA methylation could play important roles in tumorigenesis and may benefit the prognosis of different cancers.

Data Source
The gene expression and DNA methylation data of the TCGA project were downloaded from UCSC Xena 1 and preprocessed as we previously described (Ding et al., 2019;Ji et al., 2019). The clinical data matrix of TCGA cancers was downloaded from the 1 http://xena.ucsc.edu TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR) (Liu et al., 2018). The statistics of clinical information in 14 cancer types are listed in Supplementary Table S1.

Definition of Methylated and Unmethylated Genes
We first assigned the DNAm values for each gene with the average beta value of the probes mapped to promoter region, including TSS1500 (from −1,500 to −200 bp upstream of the TSS), TSS200 (region from −200 bp upstream to the transcription start site (TSS) itself), 1stExon (the first exon), and 5'UTR in order as previously described (Jiao et al., 2014;Sharma et al., 2016). According to previous studies (Sproul et al., 2011(Sproul et al., , 2012Heyn et al., 2016), a beta value threshold of 0.3 was used to separate methylated from unmethylated probes. In this study, we defined methylated (average CpG DNAm beta values within gene promoter >0.3) and unmethylated (average CpG DNAm beta values within gene promoter <0.3) genes at the threshold of 0.3 in the lack of a better way to dichotomize continuous DNA methylation beta values.

Identification of Co-occurrence and Mutual Exclusive Gene Pairs
We first convert the DNA methylation profile to a binary matrix, in which methylated genes were set to 1 in corresponding patients, and unmethylated genes were set to 0. Then we used DISCOVER (Canisius et al., 2016), a novel statistical independence test that assesses both COME gene pairs by counting how many samples have an alteration in both genes and comparing this to the number of samples expected to have such an overlap by chance if these alterations were independent. DISCOVER algorithm accepts a binary matrix that each row represents a gene and each column represents a patient as an input, then output the result of significant CO or ME gene pairs.

Filtration of COME Events
Firstly, Fisher's exact test was performed, for each COME event E i , a contingency table (a, b, c, d) was created as bellow:

Occurred Not Occurred
Tumor a b Normal c d In the table, a and b denote the number of tumor samples in which event E i occurred and not occurred, respectively, whereas c and d separately represent the number of normal samples in which event E i occurred and not occurred respectively. Then Fisher's exact test (SciPy package in Python) p-value was calculated to evaluate whether E i was significantly differentially occurred in this cancer type. Finally, frequently occurred COME events were defined as the events that were significantly differentially occurred in at least three different cancer types.

Construction of FEM Models
The FEM algorithm (Jiao et al., 2014) is a functional supervised algorithm, which uses a network of relations between genes (in our case, is frequently occurred COME network) to identify subnetworks where a significant number of genes are associated with a phenotype of interest (POI, in our case, is the differential expression and differential methylation). Differential expression and differential methylation analysis were implemented inside the FEM algorithm.

Unsupervised Consensus Clustering and Survival Analysis
K-means clustering in R package ConsensusClusterPlus (Wilkerson and Hayes, 2010) was used to perform consensus clustering. The optimal cluster number k was chosen depending on the elbow and CDF curve (Senbabaoglu et al., 2014). For survival analysis of the pan-cancer, the best cluster number was chosen as the one with the maximum average silhouette coefficient. Python package lifelines 2 was implemented in survival analysis, and the log-rank test was used to estimate the significance of different groups.

Gene Ontology and KEGG Pathway Enrichment Analysis
Gene Ontology (GO) biological process and KEGG pathway enrichment analysis were performed using the web-based gene annotation tools DAVID (Huang da et al., 2009a,b) and ToppGene (Chen et al., 2009), the terms with FDR ≤ = 0.05 were considered as significant.

Statistical Analysis
All statistical analyses were performed with Python3.5.2 on anaconda3-4.0.0. Kruskal-Wallis H-test and Chi-square test were performed with Python package SciPy (Jones et al., 2014).

Overview of Co-occurrence and Mutual Exclusivity Network of DNA Methylation in Different Cancers
To construct the COME network of DNA methylation in cancers, we first dichotomized the DNA methylation beta values in every sample with threshold of 0.3, the genes with average beta value ≥ 0.3 in promoter region are designated as methylated while the genes with average beta value lower than 0.3 in promoter regions were considered as unmethylated (see section "Materials and Methods"). Thus, a binary matrix was built for 2 http://lifelines.readthedocs.io/en/latest/index.html each of 14 different cancers, in which 1 represents methylated and 0 for unmethylated. Then DISCOVER algorithm (Canisius et al., 2016) was employed to detect the CO and ME events based on the binary alteration matrix of DNA methylation. A total of 2,670,651 CO and 2,457,681 ME events that were identified as significant by DISCOVER in 14 cancers (q-value ≤ 0.05, q-value was calculated by DISCOVER). The expression correlation between genes pair of CO is significantly higher than those of ME ( Figure 1A, p-value <0.001, independent Student's t-test). Moreover, gene pairs of CO events were mainly positively correlated at the DNA methylation level, whereas gene pairs of ME events were negatively correlated ( Figure 1B). In addition, co-methylated gene pairs tend to co-expressed ( Figure 1C, Pearson's correlation = 0.32, p-value = 0).
Then, to screen COME events that are associated with the tumor, Fisher exact test was performed in each cancer to screen the COME events that were significantly enriched in tumor or normal samples (p-value ≤ 0.05, see section "Materials and Methods"). After filtration, a total of 1,385,366 COME events (involving 7334 unique genes) were retained for further analyses [including 1,029,686 CO (involving 6894 genes) and 355,680 ME events (involving 6924 genes), the distribution of COME events in 14 cancers is shown in Figure 1D]. To explore the associations between COME events and cancers, we calculated the fraction of tumor suppressor genes (TSGs) and disease genes against all genes set involved in COME events. Strikingly, over 45% of the COME genes are cancer genes or TSGs ( Figure 1E). We further constructed pan-cancer networks based on CO and ME events. Interestingly, 5 out of the top 10 hub genes in the network are cancer genes, including ELF3, SLC10A4, ANXA9, DEFB118, KRT8 ( Figure 1F). Specifically, ELF3 was annotated as a cancer gene for rectum adenocarcinoma, colorectal neoplasms, hematologic diseases and breast neoplasms in the database of DriverDB , CoReCG (Agarwal et al., 2016), and DDMGD (Bin Raies et al., 2015) respectively. Moreover, aberrant methylation of hub gene AGR2 was reported to be associated with ovarian cancer (Sung et al., 2014), while MT3 was a putative tumor suppressor gene in pediatric acute myeloid leukemia (Tao et al., 2014).

Gene Pairs of COME Events Tend to Be Significantly Correlated Between DNA Methylation and Gene Expression
To build a reliable network of COME events in each cancer, we further screened frequently occurred events from the above 1,385,366 COME events that were significantly enriched in tumor or normal samples, as frequently occurred events, we considered events that were differentially occurred in at least three different cancer types. After filtering, we found that gene expression correlation and DNA methylation correlation between gene pairs of COME events tend to be more correlated than that of before-filtering (Pearson's correlation r = 0.55, p-value = 0, Supplementary Figure S1A). The correlations between gene expression and DNA methylation of most genes involved in COME gene pairs tend to be more negatively correlated than that of randomly generated random gene sets ( Figure 2A, random gene set was generated randomly with the same number of genes in each cancer). Gene functional enrichment analysis showed that the genes involved in CO events were mainly enriched in the pathways of Neuroactive ligandreceptor interaction, Nicotine addiction, Morphine addiction, cAMP signaling, Calcium signaling and the biological processes of chemical synaptic transmission, cell adhesion, neuropeptide signaling pathway and so on (Figure 2B and Supplementary  Table S2). While the genes of ME events were mainly associated with the development of the central nervous system and brain ( Figure 2C and Supplementary Table S2). We further built a pan-cancer cooperative network by merging the networks of each cancer (Supplementary Figures S1B-D). Intriguingly, most of the genes with a high degree (have the largest number of links) in the pan-cancer network showed enrichment for known cancerrelated genes or tumor suppressor genes ( Figure 2D). Aberrant DNA methylation of some of those top 10 genes with the highest degree has been reported to be associated with neoplasms, such FIGURE 2 | The features and functions of gene pairs in COME events. (A) Box plot of Pearson's correlation coefficient between gene expression and DNA methylation for the genes involved in COME events compared with that of random genes set (p-value was calculated by independent Student's t-test). (B,C) Enriched KEGG pathways and biological processes for the genes involved in CO and ME events (DAVID online web server). (D) Fraction of cancer genes and tumor suppressor genes (TSG) against the top N genes ranked by the degree in pan-cancer network. (E) Degree distribution of top 10 genes in pan-cancer network for each cancer and pan-cancer.

Zinc Fingers Gene Family Is Enriched in Functional Epigenetic Modules
To identify functional epigenetic modules, we integrated the gene expression and DNA methylation data from TCGA and frequently occurred COME networks constructed in each cancer using the FEM algorithm (Jiao et al., 2014), which can be used to effectively identify gene modules of coordinated differential methylation and differential expression in the context of a network. Many functional epigenetic modules were identified by FEM. Remarkably, six modules enriched in the zinc fingers gene family were identified in 6 distinct cancer types (Figures 3A-F). Most genes in these modules were hypermethylated and downregulated, indicating that genes of zinc fingers family may tend to be co-methylated and transcriptionally suppressed.
To further explore the associations between aberrant DNA methylation of zinc fingers gene family and neoplasms, we found that those genes were significantly enriched in regulation of transcription, DNA binding transcription factor activity, RNA polymerase II regulatory region sequence-specific DNA binding, Neuroactive ligand-receptor interaction, and so on (Supplementary Table S3). Besides, many of the genes in these modules were enriched in cytoband of 19q13.43, transcription factor binding sites of ZNF274 and they tend to have similar DNA methylation patterns.
We also examined whether those genes in the zinc fingers gene family can be used to distinguish tumor samples from normal samples in the above 14 cancers. Six genes (ZIK1, ZNF471, ZNF229, ZFP28, ZNF677, and ZNF582) shared among 6 modules were selected to build a logistic regression module. Compared with the models based on gene expression or DNA methylation along (AUC = 0.76), the module integrated both FIGURE 3 | Functional epigenetic modules identified from CO and ME network. The color of core indicate significant DNA methylation changes, color of border represent significant gene expression changes, edges represent COME events between two genes, and edge color indicates different event type (co-occurrence or mutual exclusivity). (A-F) Functional epigenetic modules identified from corresponding cancer type.
DNA methylation and gene expression data of the 6 genes showed better performance in distinguishing tumor samples from normal (AUC = 0.86, Supplementary Figure S2). Moreover, the clustering result also indicates that gene expression and DNA methylation profile of those 6 genes can effectively separate tumor samples from normal samples (Supplementary Figure S3). Most of those 6 genes are involved in function of nucleic acid binding, and some of those genes have been reported to be aberrantly methylated in tumors [such as ZIK1 (Borinstein et al., 2010), ZNF471 (Bhat et al., 2017)] and may serve as a marker of FIGURE 4 | Kaplan-Meier plot of PFI for distinct subtypes of 6 different cancers. Consensus cluster plot (top) and Kaplan-Meier survival plots (bottom) were separately shown for 6 disparate cancers, c1-c4 represent cluster 1 -cluster 4, and p-values were calculated by log rank test.
cancer [e.g., ZNF677 (Heller et al., 2015), ZNF582 (Lin et al., 2014)]. Consequently, our finding demonstrates that combing DNA methylation and gene expression data of those genes from zinc fingers family may be associated with tumorigenesis.

Co-occurrence and Mutual Exclusive Events Contribute to Prognosis in Human Cancers
To investigate whether the COME events are associated with cancer prognosis, Frequently occurred COME events (occurred in at least 3 distinct cancer types, Supplementary  Table S4) were used to perform consensus cluster and Kaplan-Meier survival analysis in each of those 13 cancers (no frequently occurred COME event was left in pancreatic adenocarcinoma, PAAD). Strikingly, we observed significantly different PFI among disparate subtypes in 6 different cancer types (p-value <0.05, logrank test, Figure 4), as well as in BRCA and lung squamous cell carcinoma (LUSC) (p-value = 0.07, Supplementary Figure S4).
Next, we performed Kruskal-Wallis H-test and Chi-square test to explore the association between COME subtypes and different clinical information (including age, gender, histological type, FIGURE 5 | COME subtypes correlate with distinct clinical features. (A,B) Age distribution of different COME subtypes identified in BRCA and UCEC, respectively. (C-J) Distribution of gender, histological type, neoplasm histologic grade and pathologic stage against COME subtypes in corresponding TCGA cancers. p-values for continuous variable were calculated by using Kruskal-Wallis H-test (A,B), and p-values for categorical variable were calculated by performing Chi-square test (C-J). p-value were calculated between all different groups.
neoplasm histologic grade, and pathologic stage). Interestingly, significant difference was observed between different clinical features and different subtypes in 5 cancers (Figure 5). For example, the distribution of age was significantly different in different COME subtypes in BRCA and uterine corpus endometrial carcinoma (UCEC) (Figures 5A,B, Kruskal-Wallis H-test, p-value <0.001), which is in accordance with previous studies that DNA methylation is associated with age (Horvath, 2013;Johnson et al., 2014). In kidney renal papillary cell carcinoma (KIRP), cluster 3 is enriched with female whereas cluster 2 is enriched with male ( Figure 5C). Moreover, cluster 2 of LUSC is significantly enriched with female, which is opposite to clusters 1, 3, and 4 ( Figure 5D). We also found that different COME subtypes have distinct distribution of histological types in BRCA and UCEC (Figures 5E,F). Furthermore, cluster 2 of UCEC is enriched with the histological type of serous endometrial adenocarcinoma and cluster 2 has a shorter PFI compared to other clusters (Figure 4). As for neoplasm histologic grade (Figures 5G,H), cluster 3 of kidney renal clear cell carcinoma (KIRC) is enriched in grade G3 and G4, and poorer prognosis was observed in this cluster (Figure 4). Similarly, G3 and high-grade patients are enriched in cluster 2 of UCEC, while cluster 2 has poorer survival probability compared to other subtypes (Figure 4). Patients of stage III and stage IV are enriched in cluster 3 and has the poorest clinical outcome in KIRC, while cluster 1 is enriched with the patients of stage III and stage IV and has the shortest PFI in KIRP (Figures 4, 5I,J). Collectively, the subtypes determined by COME pattern are correlated with various clinical features (age, gender, histological type, neoplasm histologic grade, and pathologic stage), which may explain why distinct COME subtypes have significantly different PFI.
To further explore the COME events in pan-cancer, we performed consensus clustering based on the frequently occurred COME events in 5442 tumor samples of 13 types of cancer. Thirteen clusters were identified by maximizing the average silhouette coefficient. The top 15 significantly enriched COME events in each cluster are shown in Figure 6A. Most of these clusters were significantly correlated with cancer tissue of origin (p-value <0.0001, Chi-square test, Supplementary Table S5). For example, clusters of C7, C9, C10, C11, and C13 were significantly enriched with patients from LIHC (liver hepatocellular carcinoma), HNSC (head and neck squamous cell carcinoma), COAD (colon adenocarcinoma), UCEC and PRAD (prostate adenocarcinoma), respectively (p-value <0.0001, hypergeometric test). While some other clusters, such as C1, contained a mixture type of BLCA (bladder urothelial carcinoma), BRCA, LUAD (lung adenocarcinoma), and LUSC (p-value <0.0001, hypergeometric test). Furthermore, those 13 clusters exhibited significantly different PFI via Kaplan-Meier analysis (p-value <0.0001, logrank test, Figure 6B). Among the 13 clusters, cluster C5 exhibited the best prognosis and the co-occurrence of CRMP1-GRM6 was the most significantly enriched event in this cluster (p-value <0.001, hypergeometric test). CRMP1 (collapsin response mediator protein 1) has been reported to be associated with medulloblastoma (Li et al., 2015) and gliomas (Mukherjee et al., 2009). Hypermethylation of the CpG sites on GRM6 (glutamate metabotropic receptor 6) was reported to be a hallmark of CIMP in clear cell renal cell carcinomas (Arai et al., 2012). In contrast, cluster C7 was associated with the poorest prognosis and co-methylation of GRB7-SLC45A4 was enriched in this group (p-value <0.001, hypergeometric test). GRB7 (growth factor receptor bound protein 7) was reported to play an important role in breast cancer progression (Lim et al., 2014). We further performed Kaplan-Meier survival analysis in the pan-cancer to verify whether the co-methylation of CRMP1-GRM6 and GRB7-SLC45A4 was associated with clinical outcome in pan-cancer. The result shows that patients with co-methylation of CRMP1-GRM6 have better outcomes (p-value <0.0001, logrank test, Figure 6C), whereas patients with co-methylation of GRB7-SLC45A4 exhibit significantly poorer prognosis (p-value <0.0001, log-rank test, Figure 6D).

DISCUSSION
In our study, we first identified 2,670,651 CO and 2,457,681 ME gene pairs in 14 different cancers based on the methylation profile from the TCGA project. Interestingly, the genes in functional epigenetic modules identified in six cancer types were mainly from the zinc finger gene family, and most of those genes were epigenetically repressed. Although several studies have reported the epigenetic silencing of the zinc finger gene family (Lleras et al., 2011;Severson et al., 2013;Gaykalova et al., 2015), we are the first to identify functional epigenetic modules of the zinc finger gene family in six cancer types by integrating gene expression and DNA methylation data in the context of COME networks. Methylation was reported to be the main mechanism for downregulation of tumor cell growth suppressor ZNF677 in non-small cell lung cancers (NSCLCs) and the methylation of ZNF677 could be used in the prognosis of NSCLCs (Heller et al., 2015). Furthermore, we identified a set of COME events that can divide tumor patients into different subtypes with significantly different clinical outcomes. Different COME subtypes were found to be significantly associated with distinct clinical features, such as age, gender, histological type, neoplasm histologic grade and pathologic stage. We also found that COME events could be used to divide tumor samples of pan-cancer into different subtypes with significantly different outcomes, which may benefit the prediction of the prognosis for pan-cancer.
This study is just the beginning to investigate and characterize the roles of COME of DNA methylation in human cancers. Our findings may contribute to the diagnosis and prognosis of human pan-cancer. The underlying mechanism and function of COME events in diverse cancers are still needed to be further studied in the future.

DATA AVAILABILITY STATEMENT
The TCGA data analyzed in this study was obtained from UCSC Xena (http://xena.ucsc.edu). This study fully complies with the TCGA publication requirements (http://cancergenome.nih.gov/ publications/publicationguidelines).

AUTHOR CONTRIBUTIONS
TS and WD designed the study and wrote the manuscript. WD and YH conducted the data analysis. WD, GC, and TS revised and finalized the manuscript. All authors read and approved the final manuscript.