A convolutional neural network-based regression model to infer the epigenetic crosstalk responsible for CG methylation patterns

Background Epigenetic modifications, including CG methylation (a major form of DNA methylation) and histone modifications, interact with each other to shape their genomic distribution patterns. However, the entire picture of the epigenetic crosstalk regulating the CG methylation pattern is unknown especially in cells that are available only in a limited number, such as mammalian oocytes. Most machine learning approaches developed so far aim at finding DNA sequences responsible for the CG methylation patterns and were not tailored for studying the epigenetic crosstalk. Results We built a machine learning model named epiNet to predict CG methylation patterns based on other epigenetic features, such as histone modifications, but not DNA sequence. Using epiNet, we identified biologically relevant epigenetic crosstalk between histone H3K36me3, H3K4me3, and CG methylation in mouse oocytes. This model also predicted the altered CG methylation pattern of mutant oocytes having perturbed histone modification, was applicable to cross-species prediction of the CG methylation pattern of human oocytes, and identified the epigenetic crosstalk potentially important in other cell types. Conclusions Our findings provide insight into the epigenetic crosstalk regulating the CG methylation pattern in mammalian oocytes and other cells. The use of epiNet should help to design or complement biological experiments in epigenetics studies. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04272-8.

Then, in the postnatal ovary, an oocyte-specific CG methylation pattern is established by de novo DNA methyltransferase DNMT3A [3]. In vitro studies revealed that DNMT3A contains protein domains that potentially interact with histone modifications and regulate recruitment of this protein to target loci [4,5]. Indeed, reverse genetics approaches began to reveal the impact of histone modifications, such as H3K36me3, on the establishment of CG methylation in mouse oocytes [6], but the entire picture of the epigenetic crosstalk regulating CG methylation patterns is unknown. This is partly due to the scarcity of oocyte samples available for molecular studies and the high cost required for a genome-wide epigenetic modification analysis. Previous studies showed that machine learning approaches are useful for predicting CG methylation patterns based on limited measurable features (mainly DNA sequence information) and thus may complement biological experiments [7][8][9][10]. However, most of the existing approaches aim at finding DNA sequences responsible for the CG methylation patterns and were not tailored for studying the epigenetic crosstalk.
In this study, we built a machine learning model named epiNet to predict CG methylation patterns based on other epigenetic features. Using epiNet, we identified biologically relevant epigenetic crosstalk between histone H3K36me3, H3K4me3, and CG methylation in mouse oocytes. This model also predicted the altered CG methylation pattern of mutant oocytes having perturbed histone modification and was applicable to cross-species prediction of the CG methylation pattern of human oocytes. It also identified the epigenetic crosstalk potentially important in other cell types. The use of epiNet should help to design or complement biological experiments in epigenetics studies.

Outline of epiNet: prediction of CG methylation patterns based on other epigenetic features
To aid in the identification of epigenetic features regulating CG methylation patterns, we developed a convolutional neural network-based regression model named epiNet, which predicts CG methylation patterns based on other available genome-wide epigenetic features. Contrary to the previous machine learning approaches, epiNet does not use DNA sequences as input, in an attempt to focus on the interactions between the epigenetic features. epiNet consisted of four layers, including one convolutional and one fully connected layer, and built a model for the prediction of CG methylation levels of 1-kilobase (kb) bins (genomic segments) based on the fragment per kb per million mapped reads (FPKM) values of input data ( Fig. 1a and "Methods" section). Datasets of the following features from fully-grown mouse oocytes (FGOs) or metaphase II oocytes were used as input: six histone modifications, chromatin accessibility, and transcription [6,[11][12][13][14][15][16][17] (Additional file 1: Table S1). To incorporate the effect of the epigenetic state of the neighboring regions, the mean of FPKM values of 10 nearby (5 upstream and 5 downstream) bins was also used (Additional file 1: Fig. S1a, b). CG methylation data were obtained from whole genome bisulfite sequencing (WGBS) [6,12].

Prediction of the epigenetic crosstalk responsible for CG methylation patterns by epiNet models
To examine to what extent the features cooperate to impact CG methylation, all possible combinations (n = 255) of the eight features were respectively used as input. We set aside datasets from chromosome 1 for performance test and used those from all other chromosomes for training and validation. As a result, a combination of all eight features (N = 8) gave the highest predictive power (correlation with the actual CG methylation data (R) = 0.96) (Fig. 1b, c). We then sought combinations of a smaller number of features that achieved good performance and found that even H3K36me3 alone had high predictive power (R = 0.88) (Fig. 1b, c). Our model based on H3K36me3 outperformed the baseline method of linear regression (R = 0.83) (Additional file 1: Fig. S1b), especially at gene bodies and intergenic regions, but not at CpG islands (Additional file 1: Fig. S1c). Importantly, when a randomly shuffled H3K36me3 dataset was used, there was no correlation with the actual data (R = 0.00) (Additional file 1: Table S2). A combination of H3K36me3 and H3K4me3 showed a higher correlation (R = 0.94), which was close to the value achieved by all of the features (R = 0.96) (Fig. 1b, c). This is consistent with the fact that, in FGOs, the genomic distribution of CG methylation shows the highest correlation with that of H3K36me3 (R = 0.83) and highest anti-correlation with that of H3K4me3 (R = − 0.50) (Fig. 1d). The predicted CG methylation patterns showed that the addition of H3K4me3 greatly improved the prediction at CpG islands (Additional file 1: Fig. S1d), which are rich in H3K4me3 [18]. Consistent with this, H3K4me3 alone showed a high predictive power at CpG islands (R = 0.85) (Additional file 1: Fig. S1c) but not in the entire genome (R = 0.58) (Additional file 1: Table S2). The further addition of H3K9me3, H3K9me2 and H3K27me3, which were either correlated or anti-correlated with CG methylation (R = 0.52, − 0.46 and − 0.40), slightly improved the prediction (R = 0.95) (Fig. 1b).
Are H3K36me3 and H3K4me3 biologically relevant to CG methylation? Previous gene knockout (KO) studies on histone modification enzymes showed that the depletion of H3K36me3 (by Setd2 KO, see below) causes a genome-wide loss of CG methylation (with occasional local gains) in oocytes [6] and that the depletion of H3K4me3 (by Mll2 KO) causes local changes in CG methylation (more losses than gains) [11] (Additional file 1: Fig. S2) (see also Fig. 2b for H3K36me3 depletion). These results suggest that the two features, which showed the highest contribution to the in silico prediction of the CG methylation pattern, are biologically relevant. This in turn suggests that prediction using epiNet may aid experimental biologists in tasks such as the selection of KO targets among the epigenetic modification enzyme genes.

Prediction of the altered CG methylation patterns of mutant oocytes having perturbed histone modifications
We next investigated whether epiNet, when trained with the wildtype dataset, would predict the CG methylation changes upon perturbation of an epigenetic feature. Only one previous study, an oocyte-specific KO of Setd2 (the only H3K36me3 enzyme) [6] provided a dataset that could be used to answer this question (Additional file 1: Table S1). In this analysis, we used 50-kb bins due to the resolution limit of the data. When H3K36me3 and H3K4me3 from Setd2 KO FGOs were used as input, we observed a moderate correlation between the predicted and actual CG methylation patterns (R = 0.71). However, the performance dropped when H3K36me3 (R = 0.33) or H3K4me3 alone was used (R = 0.15) (Fig. 2a). A closer examination of the predicted pattern revealed that a combination of H3K36me3 and H3K4me3 reproduced not only the global loss, but also the local gains, of CG methylation in Setd2 KO FGOs (Fig. 2b). Although the local gains in CG methylation coincided with the reductions in H3K4me3, H3K27me3 and H3K27ac in KO FGOs [6], the inclusion of the H3K27me3 and H3K27ac data caused a deterioration in predictive performance (R = 0.65-0.68) (Fig. 2c).

Cross-species application of epiNet models to human oocytes
We next examined whether epiNet could be used to predict the CG methylation pattern of human oocytes, which are precious and not easily accessible. The model would be especially useful if it performs reasonably well based on a limited number of epigenetic features or a limited amount of the feature data. Other than CG methylation data Cross-species application of epiNet models to human oocytes. a Pearson correlation coefficients between the predicted and actual CG methylation patterns of human FGOs. The features used to predict the CG methylation pattern were H3K4me3, H3K27me3 and chromatin accessibility. The species from which the training and test data originated are indicated. When training and testing were performed in the same species, the test data were from chromosome 1 and training and validation data were from the rest of the chromosomes. For cross-species testing, H3K4me3, H3K27me3 and chromatin accessibility data from all mouse chromosomes other than chromosome 1 were used to build an epiNet model. Then, H3K4me3, H3K27me3 and chromatin accessibility data from the entire human genome were used to predict the human CG methylation pattern. b A representative genome browser shot showing predicted CG methylation patterns of human FGOs. The actual CG methylation, H3K4me3 enrichment, H3K27me3 enrichment, and chromatin accessibility of human FGOs are shown for comparison. RefSeq genes are indicated at the bottom [19], only H3K4me3, H3K27me3 and chromatin accessibility data were available from human FGOs [20] (Additional file 1: Table S1). We first tested whether the same features of mouse FGOs could predict the mouse CG methylation pattern and observed good performance (R = 0.87) (Fig. 3a, b). We then used the human dataset for testing (chromosome 1) and training/validation (all other chromosomes) and observed similarly good predictive performance (R = 0.86). Notably, when the human dataset (the entire genome) were applied to an epiNet model trained with the mouse dataset (chromosomes other than chromosome 1), we still observed a reasonably high correlation between the predicted and actual CG methylation patterns (R = 0.77) (Fig. 3a, b). This suggests that the cross-species application of an epiNet model is possible within the mammalian class.

Application of epiNet to other cell types
We then investigated whether epiNet can be applied to cells other than oocytes. We focused on three cell types: mouse embryonic stem cells, human embryonic stem cells, and human neuronal progenitor cells, all of which have de novo CG methylation activity (and actively create a CG methylation pattern) as oocytes [21,22]. Importantly, a common set of data comprising CG methylation and five histone modifications was available for these cell types (Additional file 1: Table S1) [23][24][25]. When all possible combinations (n = 63) of the five features (histone modifications) were respectively used as input, we found that a combination of all five features (N = 5) gave the highest correlation with the actual CG methylation data in these cell types (R = 0.79-0.89) (Fig. 4a, b, c, d). We then found that even H3K4me3 alone had a high predictive power (R = 0.71-0.81). This contrasts with the highest performance achieved with H3K36me3 in oocytes but is consistent with the fact that, in all three cell types, the genomic distribution of CG methylation  Table S3). A combination of H3K4me3 and the best second feature (either H3K27me3 or H3K36me3, depending on the cell type) showed a higher correlation (R = 0.76-0.86), which was close to the value achieved by all of the features.
Are these features biologically relevant to CG methylation in these cell types? In mouse embryonic stem cells, a H3K4me3 depletion (by Mll2 KO) and a H3K27me3 depletion (by Eed KO) respectively caused a genome-wide gain and redistribution of CG methylation [26,27]. Thus, the two features showing the highest contribution to the in silico prediction were indeed biologically relevant. While the biological relevance of the features in human embryonic stem cells and neural progenitor cells awaits experimental validation, our results from epiNet will give us a hint to design future studies.

Discussion
In this study, we built epiNet, a machine learning model to predict genome-wide CG methylation patterns of mammalian oocytes based on a limited number of other epigenetic features. epiNet captured the crosstalk between the epigenetic features and found that a combination of H3K36me3 and H3K4me3 can predict the CG methylation patterns of mouse oocytes quite accurately. Are the two histone modifications biologically relevant to CG methylation? DNMT3A, an enzyme responsible for the de novo CG methylation in oocytes, contains a PWWP domain, which recognizes H3K36me3, and an ADD domain, which recognizes H3K4me0 (regions devoid of H3K4me3) [4,5]. Previous gene knockout (KO) studies showed that the depletion of H3K36me3 in oocytes causes a genome-wide loss of CG methylation (with occasional local gains) [6] and that the depletion of H3K4me3 causes local changes in CG methylation (more losses than gains) [12]. These results suggest that the two features are indeed biologically relevant. This in turn suggests that prediction using epiNet may aid experimental biologists in tasks such as the selection of KO targets among the epigenetic modification enzyme genes. The epiNet model is also applicable to cross-species prediction of the CG methylation pattern in human oocytes, although the predictive power seems somewhat lower.
The CG methylation establishment in mouse oocytes has been viewed as a transcription-coupled event, based on the previous studies on individual loci [28] and the entire genome [16]. While the CG methylation level is generally high in transcribed regions, our results suggest that the transcriptome data is not directly associated with the CG methylation pattern in mouse oocytes (Fig. 1b, d). Rather, a transcription-coupled histone modification, H3K36me3, is the major contributor to the CG methylation prediction in oocytes. This is consistent with the recent report that transcription likely impacts CG methylation through histone modifications and chromatin remodeling in mouse oocytes [29].
Lastly, we confirmed that epiNet is applicable to mammalian cells other than oocytes, such as embryonic stem cells and neuronal progenitor cells, the outcome of which suggested that different histone modifications play dominant roles in the formation of the CG methylation pattern in different cell types. The biological relevance of the prediction was confirmed by the gene KO studies in mouse embryonic stem cells [26,27]. While the biological relevance in other cell types awaits experimental validation, epiNet should help to design biological experiments in future studies.

Conclusions
The epiNet models could predict the CG methylation patterns of mammalian oocytes and embryonic stem cells accurately based on datasets from a limited number of other epigenetic features and could infer the crosstalk between the features and CG methylation. The available gene KO data suggest that the crosstalk inferred by epiNet is indeed biologically relevant. It also has the advantage of cross-species application. Our findings provide insight into the epigenetic crosstalk in mammalian oocytes and embryonic stem cells and demonstrate the usefulness of machine learning approaches in designing or complementing biological experiments in epigenetics studies.

Data processing
The datasets used in this study are summarized in Additional file 1: Table S1. The sequence reads were trimmed to remove low quality bases and adaptor sequences using Trim Galore 0.3.3 [30]. To obtain a CG methylation pattern, WGBS reads of mouse oocytes were mapped to mouse genome mm10 using Bismark 0.20.0 [31]. The WGBS data of human oocytes were downloaded from the NBDC human database [19]. The methylation levels of CG sites covered by 5-100 reads were extracted for downstream analyses. Bins with less than five informative CG sites were excluded. To obtain a histone modification or a chromatin accessibility pattern, ChIP-seq or CUT&RUN reads (for histone modification) or DNase-seq or ATAC-seq reads (for chromatin accessibility) of mouse and human oocytes were mapped to mm10 or human genome hg19 by bowtie2 2.2.9 [32]. Duplicate and low-quality reads (MapQ < 5) were removed using Picard 2.6.0 [33]. To obtain a transcription profile, RNA-seq reads of mouse oocytes were mapped to mm10 by HISAT2 2.0.5 [34]. The WGBS and histone modification data of human cell types (other than oocytes) were downloaded from the NIH Roadmap Epigenomics database [35]. In order to be used by epiNet, CG methylation levels (between 0 and 100%) were scaled between 0 and 1. FPKM values of other features were scaled between 0 and 1 by assigning the 95th percentile value as 0.95.

The structure and application of epiNet
The model predicts CG methylation levels of bins from FPKM values of epigenetic features. It was implemented on python 3.6.8 [36], Tensorflow 1.14.0 [37] and Keras 2.2.4 [38]. The model consisted of four layers, including one convolutional layer and one fully connected layer. The genome was divided into 1-kb or 50-kb bins. Scaled CG methylation levels and FPKM values of input features were calculated for these bins. For each bin with N features, we constructed an N × 2 feature matrix s where s i1 is the FPKM value of the ith feature in the bin and s i2 is the mean FPKM value of the ith feature from 10 nearby bins. The input matrix s is first transformed by a twodimensional convolutional layer with a kernel size of 2 for each feature, followed by the rectified linear unit (ReLU) function. The output, X fi , at the filter f (total 64 filters) for the ith feature is represented as follows: where w fk is the weight of convolutional filter f of kth element of input matrix s i of the ith feature. All outputs in the previous layer are compiled into a single N × 64 matrix. This single matrix is flattened to a 64 N-dimensional vector Y. Y is transformed to a 256-dimensional vector Z: where W (2) and b (2) are the 256 × 64 N weight matrix and 1 × 64 N bias matrix, respectively. The final output layer transforms Z to predicted CG methylation level C: where W (3) and b (3) are a 1 × 256 weight matrix and 1 × 256 bias matrix respectively.
Otherwise noted, we used data from chromosome 1 as the test set, data from chromosomes 2 and 3 as the validation set, and data from all other chromosomes as the training set for both mouse and human. Training was performed by fitting the model on the training set with a batch size of 100 and by optimizing the hyperparameters through minimizing the mean squared error F in total V bins of the validation set, until there was no more reduction in F for 20 epochs: where C v and C v are the actual and predicted CG methylation levels of the vth bin respectively. The final model performance was evaluated on the test dataset, by calculating the Pearson correlation coefficient between the actual and predicted CG methylation patterns.

Data visualization and statistical analysis
Genome browser shots were generated using Integrative Genomics Viewer [40]. Pearson correlation coefficients between the eight input features and CG methylation were determined by deepTools 3.3.1 [41].

Supplementary Information
The online version contains supplementary material available at https:// doi. org/ 10. 1186/ s12859-021-04272-8. 1: Fig. S1. Incorporation of the mean FPKM value of the neighboring region around the bin as input. Fig. S2. The CG methylation patterns of mouse FGOs depleted of H3K36me3 or H3K4me3. Table S1. Data used in this study. Table S2. Performance of epiNet based on actual versus randomly shuffled data. Table S3. Pearson correlation of histone modfications with CG methylation in cell types other than oocytes.