MACMIC Reveals Dual Role of CTCF in Epigenetic Regulation of Cell Identity Genes

Numerous studies of relationship between epigenomic features have focused on their strong correlation across the genome, likely because such relationship can be easily identified by many established methods for correlation analysis. However, two features with little correlation may still colocalize at many genomic sites to implement important functions. There is no bioinformatic tool for researchers to specifically identify such feature pair. Here, we develop a method to identify feature pair in which two features have maximal colocalization but minimal correlation (MACMIC) across the genome. By MACMIC analysis of 3,385 feature pairs in 15 cell types, we reveal a dual role of CTCF in epigenetic regulation of cell identity genes. Although super-enhancers are associated with activation of target genes, only a subset of super-enhancers colocalized with CTCF regulate cell identity genes. At super-enhancers colocalized with CTCF, the CTCF is required for the active marker H3K27ac in cell type requiring the activation, and also required for the repressive marker H3K27me3 in other cell types requiring the repression. Our work demonstrates the biological utility of the MACMIC analysis and reveals a key role for CTCF in epigenetic regulation of cell identity.

and H3K27me3 width (H3K4me3&H3K27me3 broad colocalization) (Fig. S1E). We also 2 2 3 defined the top 500 genes by the rank product of H3K4me3 width and H3K27ac width 2 2 4 (H3K4me3&H3K27ac broad colocalization) (Fig. S1E). For the genes associated with 2 2 5 H3K4me3&H3K27ac broad colocalization, only 7 genes were not captured by broad 2 2 6 H3K4me3 or broad H3K27ac (Fig. S1C). However, for the genes associated with 2 2 7 H3K4me3&H3K27me3 broad colocalization, 421 (84.2%) genes were not captured by 2 2 8 broad H3K4me3 or broad H3K27me3 (Fig. S1D). Further, for the 2168 pathways with broad H3K4me3 or broad H3K27me3 (Fig. S1F). These pathways were mainly 2 3 2 related to somatic cell lineage specification (Fig. S1G), which agreed with the reported were not associated with the localization of each of these features. Here, we used mutual information as an indication for correlation because mutual 2 4 0 information is more general than other methods such as linear correlation. A large 2 4 1 mutual information value will indicate strong correlation that can be either positive or  To develop a simple method to prioritize feature pairs that have minimal correlation but a 2 4 6 maximal number of colocalizations, we first performed a systematic analysis of the 1 1 225 feature pairs derived from 6 chromatin features in 15 cell types (Table S1). We 2 4 9 analyzed 6 features, which formed 15 pairs with each other in each cell type and thus 2 5 0 resulted in 225 feature pairs in 15 cell types (Table S3). Most feature pairs displayed a 2 5 1 positive correlation between the mutual information value and the number of 2 5 2 colocalizations (Spearman correlation coefficient 0.46) (Fig. 1A). Similar results were PCA value (Fig. S2). However, there were a few feature pairs that displayed a large 2 5 5 number of colocalizations but small mutual information value (Fig. 1A). We therefore 2 5 6 developed a regression model that employed the mutual information value to determine 2 5 7 an expected number of colocalizations, and next utilized the normalized discrepancy of the MACMIC (Fig. 1B). We calculated the MACMIC scores for the 225 individual 2 6 0 feature pairs and found that the large MACMIC scores effectively prioritized feature pairs 2 6 1 that possessed large number of colocalizations but weak correlations (Fig. 1C). We successfully prioritized the feature pairs with minimal mutual information but substantial  scores between H3K4me3 and H3K27me3 were low in all the 14 somatic cell types 2 7 6 (from -0.68 to 0.67) ( Fig. 2A). Therefore, MACMIC analysis successfully recaptured 2 7 7 bivalent domains that were known to play a key role in embryonic stem cells. We next tested whether MACMIC analysis could successfully identify new feature pairs 2 8 0 that possess a large number of functionally important colocalizations but low correlation. We ranked a set of 79 chromatin features in H1 cells by the MACMIC scores between 2 8 2 the enhancer feature H3K27ac and each of these features (Fig. 2B) interaction, the CTCF 25 and its binding partner RAD21 26 , topped in the rank list (Fig. 2B).

8 7
We further performed analysis in 14 human somatic cell types that each had ChIP-seq 2 8 8 datasets for a set of 6 chromatin features from the ENCODE project 17 ( Table S1). The results showed that the MACMIC score between H3K27ac and the binding of CTCF was 2 9 0 significantly larger than MACMIC scores between H3K27ac and the other 4 features 2 9 1 including H3K27me3, H3K4me3, H3K9me3 and H3K79me2 (Fig. 2C). Moreover, 2 9 2 colocalization analysis for CTCF and H3K27ac found that CTCF binding sites had the 2 9 3 largest number of colocalizations with the broadest H3K27ac peaks (super-enhancers) 2 9 4 ( Fig. 2D). To test whether this higher frequency of colocalization was simply due to the 2 9 5 longer DNA sequences of super-enhancers, we performed a normalization by lengthening typical enhancers at the two ends of each enhancer, so that the DNA 2 9 7 sequences assigned to typical enhancers had equivalent sizes to those of super- Since super-enhancers were reported to regulate cell identity genes 24 , we determined to 3 0 4 investigate the role of CTCF in this regulation. We divided super-enhancers into two CSEs (or OSEs) marked genes Intriguingly, only the genes marked by CSEs were 3 0 9 significantly enriched in the pathways associated with cell lineage specifications, e.g., gene FOXG1 28 in neural cells (Fig. 3C, D). In contrast, some other genes, although also 3 1 6 displaying broad enrichment of H3K27ac, were depleted of CTCF binding sites, e.g., at 3 1 7 the gene ARF1 in endothelial cells and the gene PON1 in neural cells (Fig. 3C, D).

1 8
Intriguingly, there were typically multiple binding sites of CTCF located within the active with CSEs encoded transcription factors, whereas we did not observe this phenomenon 3 2 3 for the genes associated with OSEs (Fig. 3E). Further, the genes associated with CSEs whereas the numbers of connected network edges were similar for genes associated 3 2 6 with OSEs and random control genes (Fig. 3F). The differences between CSEs and 3 2 7 OSEs in their association with genes in cell lineage pathways were highly reproducible in 3 2 8 the other 15 primary cell types that we have analyzed (Fig. 3G). It was reported that the 3 2 9 establishment of cell type specific chromatin loops were important during cell 3 3 0 differentiation 29 . Consistently, we found that CSEs were enriched near chromatin loops 3 3 1 (Fig. S3A) and the boundaries of topologically associating domains (TADs) (Fig. S3B), whereas no significant differences in the sizes of the associated TADs were observed  To understand how CTCF regulates enhancer activity and in turn regulates cell identity,

CSE-and OSE-associated genes have similar expression levels and cell type
we first compared the expression levels of associated genes between CSEs and OSEs. Intriguingly, similar expression levels were observed between CSE-marked genes and OSEs genes of the same cell type had similar expression levels and cell type  We next compared the H3K27ac levels between CSEs and OSEs, as H3K27ac is a 1 6 binding of CTCF in 476 (96%) neural cell OSEs were retained in both HUVECs and 3 7 8 embryonic stem cells ( Fig. S4 bottom right). These results indicated that although the 3 7 9 loss of the activation state of CSEs may not require the loss of CTCF bindings, the 3 8 0 bindings of CTCF were required for the activation of CSEs and their associated genes. binding of CTCF might be also important for the repressions of these CSEs in the other 3 9 0 cell types. We first defined CSEs, OSEs, and a set of random control genes in HUVECs, and HUVECs. We found that the H3K27me3 signals in HUVEC showed a similar pattern at 3 9 6 the HUVEC CSEs as at the HUVEC OSEs, but are substantially weaker than at the 3 9 7 random control genes (Fig. 5A top). Intriguingly, only the CSEs of HUVECs, not the 3 9 8 OSEs of HUVECs or the random control genes, were marked by strong H3K27me3 3 9 9 signals in embryonic stem cells (Fig. 5A middle). These trends observed for H3K27me3 4 0 0 in embryonic stem cells were the same for H3K27me3 in neural cells (Fig. 5A bottom).

0 1
Similar results were observed when we defined CSEs, OSEs, and a set of random substantially weaker at the random control genes (Fig. 5B bottom). However, only the 4 0 6 CSEs of neural cells, not the OSEs of neural cells or the random control genes, 4 0 7 possessed strong H3K27me3 signals in embryonic stem cells (Fig. 5B middle). These HUVECs (Fig. 5B top). OSEs. Next, we analyzed these CSEs and OSEs in H3K27ac ChIP-Seq datasets from the CSEs and OSEs (Fig. S5C). However, the CSEs were associated with significant 4 2 0 enrichment of H3K27me3, whereas the OSEs showed little enrichment of H3K27me3, 4 2 1 when the H3K27me3 was analyzed in cell types different from the cell types that defined 4 2 2 these CSEs and OSEs (Fig. S5D). These analyses indicated that the CSEs, but not the H3K27me3 is also among the top feature pairs ranked by MACMIC score in H1 hESC genes of somatic cell types, the NR2F2 27 of endothelial cells (Fig. 5C left) and the 4 3 5 FOXG1 28 of neural cells (Fig. 5C right), were substantially attenuated after auxin 4 3 6 induced degradation of CTCF, and recovered after auxin was washed off (Fig. 5C). The  up regulated (Fig. 5D, Fig. 5E, Fig. S7 right). Taken together, these results suggested 4 4 6 that the CTCF in a given cell type was required for the repression of genes associated  was hard to capture the significance of such colocalizations on the basis of conventional 4 5 5 correlation analysis. In this study, we provide a new method to identify MACMIC, which 4 5 6 effectively prioritize the feature pairs with low genome-wide correlation but substantial modifications, e.g., H3K4me3, and the repressive histone modifications, e.g., H3K27me3.    identity genes was not known. repressions, consistent with the notion that epigenetic repression of cell identity genes of 4 8 7 a given cell type is critical in other somatic cell types (Fig. 5, S3). In response to the loss the CSEs of somatic cell types were dramatically reduced but restored after recovery of 4 9 0 CTCF function (Fig. 5). Intriguingly, the CTCF binding sites in embryonic stem cells at Recently, many epigenetic regulators have been proven to interact with CTCF in 4 9 8 different biological processes. For instance, BRD2 has been reported to directly interact 4 9 9 with CTCF during Th17 cell differentiation 45 . This report suggested that CTCF might be that CTCF might affect H3K27me3 modification by a process other than the spreading.

0 5
Considering that CTCF was reported to regulate Igf2 expression by direct interaction to limited availability of public datasets for human, we defined genes associated with 5 0 9 CSEs and OSEs in human HUVEC and neural cells, and analyzed homolog genes in and CTCF-AID conditions. To further validate our results, we used Interestingly, among the top-ranked feature pairs in H1-hESC, there are many pairs that 5 1 8 each was formed by a factor associated with chromatin structure and a factor associated combinations that each include a factor of cohesion complex and a factor associated which we found later is also very important for the cell identity regulation.  All public datasets used in this study are listed in Table S1 for ENCODE database and  Table S2 for GEO database. The code for MACMIC is available at the website GitHub, downloaded from Table S4.    Table S3. Datasets of MACMIC score and p value.  Table S4. Datasets of CSE-and OSE-associated genes.