A DNA methylation state transition model reveals the programmed epigenetic heterogeneity in human pre-implantation embryos

During mammalian early embryogenesis, expression and epigenetic heterogeneity emerge before the first cell fate determination, but the programs causing such determinate heterogeneity are largely unexplored. Here, we present MethylTransition, a novel DNA methylation state transition model, for characterizing methylation changes during one or a few cell cycles at single-cell resolution. MethylTransition involves the creation of a transition matrix comprising three parameters that represent the probabilities of DNA methylation-modifying activities in order to link the methylation states before and after a cell cycle. We apply MethylTransition to single-cell DNA methylome data from human pre-implantation embryogenesis and elucidate that the DNA methylation heterogeneity that emerges at promoters during this process is largely an intrinsic output of a program with unique probabilities of DNA methylation-modifying activities. Moreover, we experimentally validate the effect of the initial DNA methylation on expression heterogeneity in pre-implantation mouse embryos. Our study reveals the programmed DNA methylation heterogeneity during human pre-implantation embryogenesis through a novel mathematical model and provides valuable clues for identifying the driving factors of the first cell fate determination during this process.

determination. For example, expression heterogeneity of Carm1 in the 4-cell stage mouse embryo has been reported to cause later imbalanced expression of Sox21, which is critical in the determination of cell differentiation into ICM or TE cells [6,7]. Two recent studies have indicated that the cause of Carm1 expression heterogeneity can be traced back to the 2-cell stage and is associated with the expression heterogeneity of two long non-coding RNAs, LincGET [8] and Neat1 [9]. Considering the robustness of early embryogenesis, such expression heterogeneity has been suggested to be the determinate result of a programmed process [10,11], i.e., a set of cellular instructions. However, to the best of our knowledge, the suggested program causing such heterogeneity is largely unexplored. In addition to gene expression levels, DNA methylation levels have been reported to be heterogeneous among cells in pre-implantation embryos [12], likely as a result of a programmed process. As DNA methylation plays important roles in gene transcriptional regulation, the expression heterogeneity that emerges during early embryogenesis could at least partially be explained by DNA methylation. In contrast to gene expression, DNA methylation exists in a binary state for each CpG dyad, and the transition of its state during each cell cycle can be described quantitatively [13]. In addition, DNA methylation can be measured at a single CpG resolution in single cells [12,[14][15][16]. Therefore, it is promising to elucidate the set of instructions causing such epigenetic heterogeneity during early embryogenesis by building a mathematical model.
Several mathematical models have been built to quantitatively describe the DNA methylation state transition across cell cycles [17][18][19][20][21][22][23][24]. Two models built in the early 1990s specified that the DNA methylation state in a CpG dyad can be one of three states, unmethylated, hemi-methylated, and methylated, and that the type of DNA methylation state transition can be either de novo methylation or methylation maintenance [17,18]. Genereux et al. and Sontag et al. improved the models by considering de novo methylation on different DNA strands separately [19,20]. Fu et al. and Arand et al. considered the discriminates among the DNA methyltransferases (DNMTs) based on hairpin bisulfite sequencing (hairpin-BS-seq) data [21,22]. McGovern et al. and von Meyenn et al. further incorporated 5-hydroxymethylcytosine (5hmC) into models to reflect the contribution of active demethylation to the DNA methylation state transition [23,24]. Recently, Busto-Moner et al. enabled genome-wide quantification of subcell cycle kinetics of methylation maintenance using replication-associated bisulfite sequencing (Repli-BS-seq) data [25]. Researchers have proposed two types of methods to estimate the parameters of the models. One type of method relies on the acquisition of a DNA methylation equilibrium state. For example, Genereux et al. applied a maximum likelihood to estimate the parameters when the equilibrium state was reached [19], and Sontag et al. used a Markov chain model with a defined steady state [20]. The other type relies on the availability of specific genomics data. For example, Arand et al. calculated the efficiencies of DNMTs, i.e., the parameters of the transition model, based on hairpin-BS-seq data upon knockout (KO) of each DNMT [22]. von Meyenn et al. applied partial differential equations to estimate the parameters by using reduced representation bisulfite sequencing (RRBS) data, hairpin-BS-seq data, and TET-assisted bisulfite sequencing (TAB-seq) data [24]. Busto-Moner et al. used maximum likelihood estimation (MLE) to estimate the parameters quantifying the remethylation kinetics of nascent DNA post-replication relied on Repli-BS-seq data [25].
Although several mathematical models are available to describe the DNA methylation state transition, none of them can be applied to study DNA methylation heterogeneity during mammalian early embryogenesis using single-cell DNA methylome data, mainly for three reasons. First, the DNA methylation level changes dramatically across each cell cycle during early embryogenesis, demonstrating that the DNA methylation level is far from an equilibrium state. Therefore, models that incorporate parameter estimation methods based on an equilibrium state are not suitable. Second, active removal of DNA methylation is prevalent during mammalian early embryogenesis [1][2][3]; thus, models that do not consider active demethylation cannot be applied to this process. Third, due to technical limitations, technologies for profiling 5-methylcytosine (5mC) and 5hmC states together within the same cell are still not available, so models requiring both 5mC and 5hmC information are not suitable. Overall, a new DNA methylation state transition model is needed to study DNA methylation heterogeneity during mammalian early embryogenesis.
In this study, we developed a DNA methylation state transition model, MethylTransition, for characterizing methylation changes during one or a few cell cycles at singlecell resolution. MethylTransition introduces a methylation state ratio vector with 5 discrete states to describe the overall pattern of DNA methylation states for a given cell. To link the two methylation state ratio vectors before and after a cell cycle, MethylTransition uses a transition matrix comprising 3 parameters separately representing the probabilities of DNA methylation maintenance, active demethylation, and de novo methylation. The model estimates the parameters via a matrix approximation strategy with the Newton-Raphson method. Unlike previous models, MethylTransition does not need an equilibrium state or a 5hmC profile. Therefore, it is suitable for the elucidation of DNA methylation heterogeneity during mammalian early embryogenesis based on single-cell bisulfite sequencing (scBS-seq) data. We applied MethylTransition to scBS-seq data during human pre-implantation embryogenesis, and we found that the DNA methylation heterogeneity that emerges is largely determined by the initial DNA methylation state at the zygote stage (i.e., the 1-cell stage) and by a set of instructions represented by the model parameters. We further discussed the potential regulatory effects of DNA methylation heterogeneity in early embryos. The source code for Methyl-Transition can be found at https://github.com/TongjiZhanglab/MethylTransition [26].

Computational framework of MethylTransition
To quantitatively characterize the DNA methylation state transition across a single cell cycle based on scBS-seq data, we developed MethylTransition, a probabilistic model with robustly estimated parameters. MethylTransition relies on the assumption that changes in the DNA methylation state at a CpG site across a single cell cycle occur in three steps: passive demethylation during DNA replication, active DNA methylation alteration by DNA methylation-modifying enzymes, and DNA methylation state combination during the combination of non-sister chromatids (Fig. 1a). A CpG site contains two CpG dyads in a diploid cell, while a CpG dyad contains two complemented CpG dinucleotides (Additional file 1: Fig. S1a). To illustrate the state transition probability at each step, we used 1 as a symbol to represent methylated CpG dinucleotides and 0 to represent unmethylated CpG dinucleotides. Thus, the DNA methylation state for a CpG dyad can be one of the following types: unmethylated (0-0), hemi-methylated (0-1 or 1-0), or fully methylated (1-1). In the first step, i.e., DNA replication, MethylTransition assumes that the newly synthetized strand is unmethylated. The transition probabilities among DNA methylation states for a CpG dyad are displayed in the upper panel of Fig. 1b (see the "Methods" section for details), where a represents the probability of parental DNA in one of two daughter strands and is 0.5 for the symmetric division. In the second step, i.e., enzyme action, the DNA methylation state in a CpG dyad can be changed by different enzymes. According to the distinct roles of DNA methylation-modifying enzymes, MethylTransition denotes the probability of de novo methylation on an unmethylated CpG dyad as u, the probability of methylation maintenance on a hemi-methylated CpG dyad as p, and the probability of active Fig. 1 Computational framework of MethylTransition. a Schematic overview of DNA methylation changes during mitosis. MethylTransition relies on the assumption that changes in the DNA methylation state at a CpG site across a single cell cycle occur in three steps: passive demethylation during DNA replication, active DNA methylation by DNA methylation-modifying enzymes, and DNA methylation state combination during the combination of non-sister chromatids. The open circle represents a CpG dinucleotides with an unmethylated C, and the solid circle represents a CpG dinucleotides with a methylated C. In the first step, DNA replication, the newly synthetized strand (gray) is unmethylated. In the second step, enzyme action, the DNA methylation state in a CpG dyad can be changed by different enzymes. In the third step, nonsister chromatid combination, the observed DNA methylation state of a CpG site in a single cell is the combination of the DNA methylation states from both homologous chromosomes for diploid organisms. b Transition probabilities among DNA methylation states during DNA replication and enzyme action. We used 1 as a symbol to represent a CpG dinucleotides with a methylated C and 0 to represent a CpG dinucleotides with an unmethylated C. The DNA methylation state for a CpG dyad can be unmethylated (0-0), hemi-methylated (0-1 or 1-0), or fully methylated (1-1). The upper panel shows the transition probabilities among states during DNA replication, where a represents the probability of parental DNA being present in one of two daughter strands and is 0.5 for the symmetric division. The lower panel shows the transition probabilities among states during enzyme action, where u is the probability of de novo methylation on an unmethylated CpG dyad, p is the probability of methylation maintenance on a hemi-methylated CpG dyad, and d is the probability of active demethylation on a methylated or hemi-methylated CpG dyad demethylation on a methylated or hemi-methylated CpG dyad as d. The transition probabilities among DNA methylation states for a CpG dyad are shown in the lower panel of Fig. 1b (see the "Methods" for details). In the third step, i.e., non-sister chromatid combination, the observed DNA methylation state of a CpG site in a single cell is the combination of the DNA methylation states of both homologous chromosomes for diploid organisms. There are five possible methylation states at each CpG site: unmethylated (labeled S1), quarter-methylated (S2), half-methylated (S3), three-quarter-methylated (S4), and fully methylated (S5) (Additional file 1: Fig. S1a; see the "Methods" section for details). Then, MethylTransition generates a matrix to describe the transition probability among the five states in this step (see the "Methods" section for details). Overall, MethylTransition constructs a probability matrix (T A ) with introduced parameters to quantitatively characterize the DNA methylation state transition across a single cell cycle.
Within the framework of MethylTransition, a transition frequency matrix could theoretically be calculated by measuring the DNA methylation state of a CpG site before and after a cell cycle, and then the parameters u, d, and p could be estimated. However, due to the high dropout rate of public scBS-seq data, the percentage of CpG sites both sequenced in two scBS-seq datasets is generally less than 20% (Additional file 1: Fig.  S1b), and it is difficult to accurately assign a methylation state to a CpG site based on scBS-seq data. To overcome the above difficulties, we propose two assumptions. First, the DNA methylation states of CpG sites in the promoter of a gene can be represented by the average DNA methylation level in this promoter. This assumption is largely true for the mammalian genome [27], and the various DNA methylation levels in promoters are much smaller than those in other regions (Additional file 1: Fig. S1c). Second, the activity levels of DNA methylation-modifying enzymes in genomic regions with similar chromatin status are largely consistent. Based on these two assumptions, for genes with similar chromatin status at their promoters, each gene's promoter can be assigned one of the five methylation states (S1 to S5) using scBS-seq data, and the methylation state ratio vector of the genes in each methylation state can be used to present the overall pattern of methylation states for those genes within a given cell (see the "Methods" section for details). Given two separate methylation state ratio vectors before and after a cell cycle, the transition frequency matrix T B linking the two vectors can be calculated. Then, MethylTransition can estimate the parameters u, d, and p by minimizing the difference between the transition probability matrix T A and the calculated transition frequency matrix T B , while a cost function is constructed and optimized by using the Newton-Raphson method [28] (see "Methods" section for details). In addition to estimating the transition parameters with given initial and terminal DNA methylation states, MethylTransition can also calculate the methylation state ratio vector after a cell cycle with given initial DNA methylation states and a set of transition parameters. MethylTransition has been implemented as an R library (Additional file 1: Fig. S1d), and the source code has been released on GitHub (https://github.com/TongjiZhanglab/ MethylTransition).

Performance evaluation of MethylTransition
To evaluate the performance of MethylTransition, we used three different strategies. First, we tested the robustness of parameter estimation with respect to cell-to-cell variation. We applied MethylTransition to all available high-quality scBS-seq data at different stages of human pre-implantation embryos [12], and the parameters were estimated for each pair of two cells from the adjacent stages. The variances of parameters estimated by using different cells from the same stage were small (Fig. 2a), confirming the robustness of MethylTransition in parameter estimation with respect to cell-to-cell variation.
Second, we examined the robustness of parameter estimation for different dropout rates of scBS-seq data. We applied MethylTransition to a pair of scBS-seq datasets from two adjacent human pre-implantation stages, the zygote stage and the 2-cell stage, with average promoter DNA methylation levels of 16.8% and 15.0%, respectively. When all 22,056 promoters with sequenced CpG sites in both datasets were used to construct the methylation state ratio vectors, the parameters were estimated as u = 0.03, d = 0.23, and p = 0.76. To evaluate the effect of dropout on parameter estimation, we randomly sampled fractions of the 22,056 promoters to represent dropout rates from 10 to 90% (with 10% as the interval), and the subsets of genes were used to construct the Each dot represents the parameter value estimated using a pair of scBS-seq data. All available high-quality published scBS-seq data (more than 20% of all CpG sites detected) were used. b Box plots showing the robustness of MethylTransition's parameter estimation for datasets with simulated dropout rates from 10 to 90%. c Graph showing the biological meaning of MethylTransition's parameter estimation for a series of simulated DNA methylation transition conditions (gradually changes from decreasing condition to equilibrium condition and then to increasing condition) methylation state ratio vectors and then to estimate the parameters (see "Methods" section for details). Across the different dropout rates, the estimated parameters were consistent, with slightly increased standard deviations at high dropout rates (Fig. 2b). Our results demonstrate that the parameter estimation method of MethylTransition is robust even with a high dropout rate of scBS-seq data.
Third, we evaluated whether the estimated parameters could represent the levels of corresponding DNA methylation-modifying activities. The dramatic DNA methylation decrease between the human zygote stage and 2-cell stage was characterized by a higher probability of active demethylation (d = 0.23) and a lower probability of methylation maintenance (p = 0.76) than that in the equilibrium condition (d = 0.09 and p = 0.90, respectively), consistent with the findings of previous studies [29]. To confirm the biological meanings of the estimated parameters, we gradually altered the methylation state ratio vector at the 2-cell stage to represent a series of DNA methylation transition conditions (from decrease to equilibrium, then to increase), and the altered vectors were used to estimate the parameters (see the "Methods" section for details). Consistent with our expectations, the probability of active demethylation (d) decreased dramatically from DNA methylation decreasing conditions to the equilibrium condition; this decrease in d was coupled with rapid increases in the probability of methylation maintenance (p) and the probability of de novo methylation (u), which clearly increased from the equilibrium condition to increasing conditions (Fig. 2c). To further evaluate the interpretability of parameters, we extended the application of MethylTransition to bulk cell-level BS-seq data with methylation enzymes knockout (see the "Methods" section for details). We collected the BS-seq data in wild-type and two types of enzyme-knockout mouse embryos (Dnmt1 −/− , Dnmt3a/b −/− ) at the E8.5 stage [30]. The Dnmt1 −/− embryos displayed the smallest p, and the Dnmt3a/b −/− embryos showed the smallest u, which are in agreement with the known function of these DNMTs (Additional file 1: Fig. S2a). In addition, we collected the BS-seq data in wild-type and two types of enzyme-knockout zebrafish embryos (Tet3 −/− , Tet1/2/3 −/− ) at 24 h post-fertilization (hpf) [31]. Both Tet3 −/− embryos and Tet1/2/3 −/− embryos showed smaller d than wild-type embryos, and Tet1/2/ 3 −/− embryos showed the smallest d (Additional file 1: Fig. S2b). Our results confirmed that the estimated parameters can reflect the levels of DNA methylation-modifying activities.

Programmed DNA methylation heterogeneity in human pre-implantation embryos
To comprehensively investigate the cause of DNA methylation heterogeneity in early embryogenesis, we reanalyzed scBS-seq data in human pre-implantation embryos. We observed that 31.7~42.6% of the promoters showed DNA methylation state polymorphism among cells within a 4-cell stage embryo, and the percentage increased to 54.4~65.7% for 8-cell stage embryos, confirming that DNA methylation states are quite heterogeneous among cells in pre-implantation embryos [12]. To quantitively measure such heterogeneity, we used public scBS-seq data from a single embryo for each stage and calculated a heterogeneity score for each promoter based on the DNA methylation states from different cells in the same embryo (Additional file 1: Fig. S3; see the "Methods" section for details). When classifying the promoters into five classes based on their DNA methylation states in the zygote stage, we found that the DNA methylation heterogeneity in 4-cell and 8-cell stage embryos was significantly different among classes, while promoters with heavier DNA methylation states in the zygote stage tended to have higher DNA methylation heterogeneity in the 4-cell and 8-cell stages (Fig. 3a, b). Given such an association between the initial methylation states and the DNA methylation heterogeneity at later stages, we hypothesized that the DNA methylation heterogeneity in early embryos might be an intrinsic output of a programmed process following a set of cellular instructions. Student's t test was performed for comparisons between adjacent classes (***p value < 0.001; **p value < 0.01; *p value < 0.05;p value > 0.05). The embryo with the largest number of high-quality scBS-seq data (more than 20% of all CpG sites detected) at each stage were used to calculate the DNA methylation heterogeneity. c Estimated parameters of the first three cell cycles during human embryogenesis. The lines link the mean values of the parameters in each cell cycle. The gray shaded areas represent the 95% confidence intervals around the mean values. d, e Box plots of the predicted DNA methylation heterogeneity of promoters in the 4-cell stage (d) and 8-cell stage (e) during human embryogenesis. Promoters were classified into 5 classes based on their DNA methylation states in the zygote stage. For each box, the point indicates the average DNA methylation heterogeneity of the class. Student's t test was performed for comparisons between adjacent classes (***p value < 0.001). f 3D plots of predicted DNA methylation heterogeneity scores at the 8-cell stage for simulated values of p and d for five classes of promoters. Here, the parameter u is a constant (u = 0.03). The colors represent the predicted heterogeneity scores To test the above hypothesis, we applied MethylTransition to the first three cell cycles of human embryogenesis using scBS-seq data from a single embryo for each stage. To simplify the calculation, we used all genes' promoters to estimate the parameters. For each combination of cell pairs between embryos in adjacent stages, the parameters u, d, and p were estimated, with substantial robustness among combinations (Fig. 3c). The highest value of the parameter d, i.e., the probability of active demethylation, occurred during the transition between the 2cell and 4-cell stages, while the highest value of the parameter p, i.e., the probability of methylation maintenance, occurred during the transition between the 4cell and 8-cell stages. The parameter u, i.e., the probability of de novo methylation, was consistently low during this process. Given the initial methylation states of the promoters at the zygote stage and the estimated parameters of the first three cell cycles, MethylTransition was applied to predict the DNA methylation heterogeneity for each promoter in the 4-cell and 8-cell stages, respectively (Fig. 3d, e; see the "Methods" section for details). The promoters with heavier DNA methylation states in the zygote stage clearly displayed a higher predicted DNA methylation heterogeneity in the 4-cell and 8-cell stages, consistent with the trends in observed heterogeneity among classes with distinct initial methylation states. Such predictable heterogeneity demonstrates that the DNA methylation heterogeneity in early embryos is largely an intrinsic output of a programmed process with given initial methylation states and a few DNA methylation state transition parameters.
To investigate the impacts of initial methylation states and DNA methylation state transition parameters on this programmed heterogeneity, we calculated the predicted scores of DNA methylation heterogeneity in the 8-cell stage with simulated values of MethylTransition parameters for five classes of promoters with distinct initial methylation states (Fig. 3f). To simplify the simulation, we kept u = 0.03, which was supported by the absence of nuclear DNMT3a and DNMT3b during this process [32], and used the same values of p or d for all three cell cycles. We observed that for any given class, the predicted scores of DNA methylation heterogeneity varied dramatically for different combinations of p and d, while with a given set of p and d values, the predicted scores tended to be higher in classes with heavier initial methylation states. In a typical mammalian cell, most promoters are either unmethylated (S1) or fully methylated (S5); in addition, during a typical mammalian cell cycle, the probability of methylation maintenance (p) is very high, while the probability of active demethylation (d) is quite low [33,34]. From the simulation, we observed low predicted heterogeneity (score < 0.2) in the S1 and S5 classes for combinations of high p (> 0.94) and low d (< 0.05) values, indicating that DNA methylation heterogeneity is unlikely to be introduced during the cell cycle in a typical mammalian cell. Unlike typical mammalian cells, the cells in early embryos display clearly higher d and lower p values, which can intrinsically cause initial methylation state depended on DNA methylation heterogeneity during early embryogenesis. Taken together, our results indicate that the largely programmed DNA methylation heterogeneity in early embryos is a natural result of the unique probabilities of methylation maintenance and active demethylation that occur during this process.

Association of sequence and epigenetic features with DNA methylation heterogeneity
Although the DNA methylation heterogeneity in early embryos was largely predictable by MethylTransition with a few parameters, some promoters showed substantially higher or lower scores of observed methylation heterogeneity than predicted, especially promoters in the S1 and S2 classes (Fig. 4a, Additional file 1: Fig. S4a), indicating that additional features may have significant impacts on the DNA methylation heterogeneity of promoters in these classes. In order to investigate which sequence or epigenetic Fig. 4 Impacts of sequence and epigenetic features on DNA methylation heterogeneity. a Differences between the observed and predicted DNA methylation heterogeneity scores in human 8-cell stage embryos. Promoters were classified into 5 classes based on their DNA methylation states in the zygote stage. The size of each solid circle represents the number of promoters. Each gray box shows the 1st and 3rd quartiles of the predicted heterogeneity scores in each class. b Bar plot of the importance of sequence and epigenetic features for DNA methylation heterogeneity at the 8-cell stage, as determined by random forest analysis. The features include CpG ratio, chromatin accessibility, and H3K4me3 and H3K27me3 levels at the 8-cell stage. c Box plots of the differences in CpG ratio between different categories of promoters in the S1 and S2 classes. The promoters in the S1 and S2 classes were divided into three categories: higherheterogeneity gene promoters (HHGs; with observed heterogeneity higher than the 3rd quantile of the predicted heterogeneity score of the class), model-predictable heterogeneity gene promoters (MHGs), and lower-heterogeneity gene promoters (LHGs; with observed heterogeneity lower than the 1st quantile of the predicted heterogeneity score of the class). Student's t test was performed for comparisons between adjacent categories (***p value < 0.001). d Estimated parameters for different groups of promoters with distinct CpG ratios during the first three cell cycles of human embryogenesis. The promoters were grouped into three groups with high (≥ 0.6), moderate (between 0.4 and 0.6), or low (< 0.4) CpG ratios. The lines link the mean values of the parameters for each cell cycle. The gray shaded areas represent the 95% confidence intervals around the mean values. Red indicates the parameter u, blue indicates the parameter d, and yellow indicates the parameter p. e Box plots of the differences in the mean squared error between the original prediction obtained using uniform sets of u, d, and p for all promoters and three conditional predictions based separately on CpG ratio grouping, chromatin accessibility grouping, and H3K4me3 signal grouping. The mean squared error was calculated as the average squared difference between the predicted methylation heterogeneity score and the observed score. Student's t test was performed for comparisons between the original prediction and each of the conditional predictions (***p value < 0.001; **p value < 0.01) features may strongly associate with DNA methylation heterogeneity, we performed random forest analysis to identify relevant features from sequence patterns (CpG ratio), chromatin accessibility, and histone modification patterns (H3K4me3, H3K27me3) in human pre-implantation embryos (see the "Methods" section for details). Among these features, the CpG ratio showed much greater relevance to promoter DNA methylation heterogeneity than other features. Both chromatin accessibility and H3K4me3 signals showed moderate relevance (Fig. 4b). For the S1 and S2 classes, the promoters with lower heterogeneity scores than predicted showed significantly higher CpG ratio, chromatin accessibility, and H3K4me3 signals than those with heterogeneity scores similar to the predicted scores (Fig. 4c, Additional file 1: Fig. S4b, c), suggesting that the limited performance of MethylTransition for the S1 and S2 classes is likely due to these sequence and epigenetic features. To examine whether DNA methylation state transition parameters are variable among promoters with distinct CpG ratios, we divided promoters into three groups with high (≥ 0.6), moderate (between 0.4 and 0.6), or low (< 0.4) CpG ratio and applied MethylTransition to the first three cell cycles of human embryogenesis based on the three groups of promoters separately. The sets of the estimated parameters u, d, and p were different among groups of promoters with distinct CpG ratios (Fig. 4d). The group of promoters with a high CpG ratio showed lower values of u and p, together with higher values of d, than the other two groups, consistent with the findings of previous studies that CpG islands tend to be protected from DNA methylation [35]. We further predicted the DNA methylation heterogeneity for each promoter in the 8-cell stage using parameter values estimated based on CpG ratio grouping, and the predicted heterogeneity showed a significantly smaller mean squared error to the observed than the original prediction obtained using uniform u, d, and p values for all promoters (Fig. 4e). In addition to assessing DNA methylation state transition parameters among promoters with distinct CpG ratio, we also calculated the parameters among promoters with distinct chromatin accessibility or distinct H3K4me3 signals (Additional file 1: Fig. S4d, e). The heterogeneity predicted using parameters estimated based on chromatin accessibility or H3K4me3 signal grouping also showed a significantly smaller mean squared error to the observed than the original prediction (Fig. 4e). Taken together, our results demonstrate that the features of CpG ratio, chromatin accessibility, and distinct H3K4me3 signals are strongly associated with DNA methylation heterogeneity by diverse DNA methylation state transition parameters.

Potential regulatory effects of DNA methylation heterogeneity
As DNA methylation at promoters plays important roles in gene transcriptional regulation, we hypothesized that the largely programmed DNA methylation heterogeneity at promoters can affect the heterogeneity of gene expression. To test this hypothesis, we first calculated the correlation between promoter DNA methylation heterogeneity scores and expression heterogeneity scores in human 8-cell stage embryos (see the "Methods" section for details), and the two types of heterogeneity scores indeed showed a positive correlation (cor = 0.18; Fig. 5a). Next, we investigated the temporal association between promoter DNA methylation heterogeneity and expression heterogeneity. The genes with high DNA methylation heterogeneity promoters at the 4-cell stage showed significantly larger expression heterogeneity increasement from that stage to the 8-cell stage than other genes (Additional file 1: Fig. S5a). Differently, the genes with high expression heterogeneity at the 4-cell stage showed similar promoter DNA methylation heterogeneity increasement from that stage to the 8-cell stage with other genes (Additional file 1: Fig. S5b), which implied that DNA methylation heterogeneity at promoters might affect the heterogeneity of the gene expression. Due to restrictions preventing the editing of human embryos, we then used mouse embryos to further clarify the effects of DNA methylation heterogeneity on expression heterogeneity. Similar to the case in human embryos, the DNA methylation heterogeneity that emerged during mouse early embryogenesis was also largely determined by the initial DNA methylation state at the zygote stage (Additional file 1: Fig. S6). In this study, we generated singlecell RNA-seq data for cells in the 8-cell stage from Stella −/− mouse embryos, which have been reported to have much higher DNA methylation levels than normal embryos [36]. Upon comparing the expression heterogeneity scores of Stella −/− mouse embryos to those of normal embryos, we identified 6.97-fold more genes with significantly elevated heterogeneity scores in Stella −/− embryos than genes with reduced scores including genes encoding reported lineage-determining factors of ICM and TE (Neat1 [9], Tfap2c [37], Yap1 [38,39]) (Fig. 5b). The genes with elevated expression heterogeneity scores in Stella −/− embryos showed significantly lower initial DNA methylation levels at promoters in the zygote stage in normal mouse embryos than other genes (Fig. 5c), indicating that increases in initial DNA methylation levels can increase the expression heterogeneity in later stages, especially for genes with initially minimally methylated promoters. Considering that promoters with heavier DNA methylation in the zygote stage tended to have higher DNA methylation heterogeneity in later stages, the increased expression heterogeneity in Stella −/− embryos might be mediated by the largely programmed DNA methylation heterogeneity at promoters. We further hypothesized that the largely programmed DNA methylation heterogeneity at promoters might contribute to the first cell fate determination. To test this hypothesis, among 7471 expressed genes in ICM or TE, we defined 899 ICM-or TEspecific genes as the list of genes related to the first cell fate determination in humans (see the "Methods" section for details). At the 8-cell stage, that list of genes showed significantly higher DNA methylation heterogeneity at promoters, together with higher expression heterogeneity, than other expressed genes (Fig. 5d, e), indicating that the separation between the ICM and TE could at least partially be explained by the emerged expression heterogeneity and by largely programmed DNA methylation heterogeneity at promoters. Among the list of genes related to the first cell fate determination, 64 genes showed both high DNA methylation heterogeneity at promoters and expression heterogeneity at the 8-cell stage ( Fig. 5f and Additional file 2: Table S1), including the known ICM marker genes GDF3 [40], CUBN [41], and IGF1 [42] and the known TE marker gene GCM1 [43]. The clear association between DNA methylation heterogeneity and the first cell fate determination suggests that consideration of DNA methylation heterogeneity may provide clues for identifying driving factors of driving cell fate determination during human pre-implantation embryogenesis.

Discussion
Although expression and epigenetic heterogeneity occurring before the first cell fate determination in mammalian early embryogenesis has been suggested to be a determinate result of a program [10][11][12] (i.e., a process with initial states and a set of cellular instructions), to the best of our knowledge, no studies have been designed to illustrate such a program. In this study, we took advantage of the quantitative features of scBSseq data to elucidate the set of instructions causing DNA methylation heterogeneity during early embryogenesis by building MethylTransition, a mathematical model that enables characterization of the DNA methylation state transition during single or a few cell cycles. The results of this study clearly demonstrate that DNA methylation heterogeneity at promoters in early mammalian embryos is largely an intrinsic output of a programmed process with given initial methylation states and a few DNA methylation state transition parameters. It should be clarified that the cell cycle itself does not naturally result in DNA methylation heterogeneity. Calculations based on MethylTransition showed that DNA methylation heterogeneity is unlikely to be introduced during the cell cycle in typical adult mammalian cells, which have different initial methylation state distributions and DNA methylation state transition parameters than cells in early mammalian embryos. This study illustrates that the largely programmed DNA methylation heterogeneity in early mammalian embryos is a determinate result of the unique probabilities of methylation maintenance and active demethylation that occur during early embryogenesis.
The process of the first cell fate determination in early mammalian embryos is associated with distinct expression and epigenetic patterns between ICM and TE cells [6,7]. This study showed a clear association between DNA methylation heterogeneity and the first cell fate determination, and the results suggest that the expression heterogeneity that emerges before the first cell fate determination could be at least partially caused by largely programmed DNA methylation heterogeneity. In addition to DNA methylation, multiple layers of factors participate in transcriptional regulation. A previous model study based on mouse embryogenesis data proposed that the first cell fate determination-related expression heterogeneity is initiated by random segregation at each cleavage division and strengthened by competing transcriptional circuits [5]. In human embryos, the major phase of zygotic genome activation begins at the 8-cell stage, much later than in mouse embryos (in which it occurs at the 2-cell stage) [44], indicating that humans have a larger time window for the accumulation of DNA methylation heterogeneity before the large-scale transcription begins. As a supplement to that model study, this study suggests that the initial source of expression heterogeneity in human embryos could be at least partially mediated by the largely programmed DNA methylation heterogeneity at promoters. We suspect that consideration of DNA methylation heterogeneity could provide valuable clues for identifying driving factors of the first cell fate determination during human pre-implantation embryogenesis.
Unlike other mathematical models, MethylTransition was designed to characterize the methylation state changes that occur during one or a few cell cycles at singlecell resolution, and its application could be extended to other biological processes with dramatic DNA methylation changes, such as oogenesis [45]. Despite its unique features, MethylTransition has two technical limitations. First, due to the high dropout rates of public scBS-seq data, MethylTransition relies on the assumption that the activity levels of DNA methylation-modifying enzymes in a cell are constant across the genome, which is inconsistent with numerous reports that both sequence and epigenetic features affect the activity of these enzymes [25,[46][47][48][49]. This limitation could be partially resolved by applying this framework to a subset of promoters with similar sequence or epigenetic features. Second, MethylTransition assumes that de novo methylation, methylation maintenance, and active demethylation are three independent events. However, this may not always be true, as crosstalk between de novo methylation and methylation maintenance has been reported [50][51][52]. Those studies indicated that the functions of the DNA methylation-modifying enzymes are complex and not independent. Furthermore, some factors, for example, Stella, have distinct roles in the modulation of DNA methylation at different mouse development stages [36,53]. Although the parameters of MethylTransition integrated the function of various enzymes to represent the overall levels of DNA methylation-modifying activities, incorporating the crosstalk of those activities may further improve the interpretability of this model. Future versions of MethylTransition could be extended to consider the conditional probabilities among the three DNA methylation-modifying activities.

Conclusions
To quantitatively characterize the DNA methylation state transition during mammalian early embryogenesis, we presented MethylTransition, a novel DNA probabilistic model for characterizing methylation changes during one or a few cell cycles at single-cell resolution. By applying MethylTransition to scBS-seq data during human pre-implantation embryogenesis, we elucidated that the DNA methylation heterogeneity that emerges at promoters is a determinate result of the unique probabilities of methylation maintenance and active demethylation that occur during early embryogenesis. This study further suggested that consideration of programmed DNA methylation heterogeneity provides valuable clues for identifying driving factors of the first cell fate determination.

Computational framework of MethylTransition
We used 1 as a symbol to represent methylated CpG dinucleotides and 0 to represent unmethylated CpG dinucleotides. Thus, the DNA methylation state for a CpG dyad in one chromosome could be one of the following types: unmethylated (0-0), hemi-methylated (0-1 or 1-0), or fully methylated (1-1). The calculated ratios of CpG dyads in these four methylation states are x 1 , x 2 , x 3 , and x 4 , and the ratio vector M s is: MethylTransition relies on the assumption that changes in the DNA methylation state at a CpG site across a single cell cycle occur in three steps. In the first step, subsequent to the passive demethylation of DNA with DNA replication in the cell, the ratio vector M s changes to M r , which can be represented as: where a represents the probability of parental DNA in one of two daughter strands and is 0.5 for the symmetric division. M r is the ratio vector after DNA replication.
In the second step, the DNA methylation state in a CpG dyad can be altered by DNMTs or by DNA demethylases. We denoted the probability of maintaining methylation at a hemi-methylated CpG site as p, the probability of de novo methylation at a CpG site as u, and the probability of active demethylation at a methylated CpG site as d. The change in DNA methylation under the actions of various enzymes can be expressed as: where M e represents the ratio vector after actions have been exerted by various enzymes.
In the third step, for diploid organisms, the observed DNA methylation state of a CpG site in a single cell is the combination of the DNA methylation states from both homologous chromosomes. Such combination results in five possible DNA methylation states at each CpG site: unmethylated (labeled S1), quartermethylated (S2), half-methylated (S3), three-quarter-methylated (S4), and fully methylated (S5). These states can be experimentally measured, and the ratios of CpG sites in these five methylation states (S 1 , S 2 , S 3 , S 4 , and S 5 ) can be calculated. Thus, the DNA methylation state changes from mother cell to daughter cell are represented by: where S 1 ′ , S 2 ′ , S 3 ′ , S 4 ′ , and S 5 ′ represent the ratios of CpG sites in the five DNA methylation states in a daughter cell, and x ij indicates the probability that the DNA methylation states are changed from state j in the mother cell to state i in the daughter cell. The element in the ith row and the jth column of the probability matrix T re • T sr described above is denoted as t ij . The relationships between x ij and t ij are calculated as follows: Based on the above three steps, the changes in DNA methylation states during one cell cycle can be represented by the above probability matrix T A with a few parameters. Then scBS-seq data are used to infer the ratio of each DNA methylation state within a single cell. To imitate the DNA methylation state transition before and after the cell cycle, the state transition frequency matrix (T B ) is calculated using scBS-seq data of a cell from the early stage (mother cell) and a cell from the adjacent late stage (daughter cell) as follows: where o ij is the observed frequency at which the DNA methylation state is changed from state j in the mother cell to state i in the daughter cell. MethylTransition estimates the parameters u, d, and p by minimizing the difference between the transition probability matrix T A and the calculated transition frequency matrix T B . We defined a cost function as the square of the Euclidean distance between T A and T B [28]: The best parameter was obtained by minimizing J(u, d, p). Although it is unrealistic to expect an algorithm to find global minima, many numerical optimization techniques can be applied to find local minima. According to the definition of probability, the parameters u, d, and p range from 0 to 1. Considering the reliability and efficiency of the algorithm, we used the Newton-Raphson method as the optimization algorithm.
As the estimates of parameters with different initial values in the Newton-Raphson method are distributed around the true value, MethylTransition randomly selects three numbers in the range of 0 to 1 as the initial values of u, d, and p to estimate the parameters and repeats the analysis 100 times. The parameters that minimize the cost function value are prioritized.

Calculation of the DNA methylation state ratio vector
We assigned a DNA methylation state S to each gene promoter region (± 2 kb around the transcription start site (TSS)) for a given cell, as follows: where m is the average methylation level of the promoter. To avoid the missing value problem of scBS-seq data, the promoters with state as NA are removed from the parameter estimation. The DNA methylation state ratio vector contains 5 items representing the fractions of the promoters assigned as S1, S2, S3, S4, and S5, and the vector is used to present the overall pattern of methylation states for a given cell.

Performance evaluation for the robustness of subsampling
We calculated the DNA methylation state vectors of two adjacent human preimplantation stages based on two scBS-seq datasets (GSM2481558 and GSM2481533).
To evaluate the effects of dropout on parameter estimation, we used a bootstrapping approach. Briefly, we randomly sampled fractions of 22,056 promoters to represent the dropout rates from 10 to 90% (with 10% as the interval) with 100 rounds of replacements for each dropout rate. In each sampling, the selected promoters were used to construct a methylation state ratio vector and a transition frequency matrix, which were used to estimate a set of parameters.

Performance evaluation for biological meanings
We used the original transition frequency matrix calculated based on the two scBS-seq datasets (GSM2481558 and GSM2481533) as the decreased methylation transition condition, and we used an identity matrix to represent the equilibrium condition. To simulate a gradual change in transition conditions from decreasing to equilibrium, we generated 9 transition frequency matrixes to represent the linear switch from the original transition frequency matrix to an identical matrix with equal intervals. We further transposed and scaled the original transition frequency matrix as the increased transition condition. To simulate a gradual change in transition conditions from equilibrium to increasing, we generated 9 transition frequency matrixes to represent the linear switch from the identical matrix to the transposed and scaled matrix with equal intervals. We estimated the sets of parameters based on the original transition frequency matrix, the identical matrix, the transposed and scaled matrix, and 18 generated transition frequency matrixes.

Extension of MethylTransition on bulk-cell BS-seq data
Each high coverage CpG site (coverage > 4) in a given bulk cell BS-seq data was assigned a DNA methylation state S, as follows: where m b is the methylation level of the CpG site. The sites with state as NA are removed from the parameter estimation. The transition frequency matrix was calculated using two paired BS-seq data. We assigned an approximate number of cell cycles based on the knowledge of development timing (n = 5 for 8-cell to E8.5 mouse embryo and n = 10 for epiboly to 24 hpf zebrafish embryo). The logarithm transformation of the transition frequency matrix was used to estimate the parameters during multiple cell cycles, with the assumption that the parameters of each cell cycle are consistent.

Calculation of DNA methylation heterogeneity and expression heterogeneity
The DNA methylation heterogeneity of a given promoter in a pre-implantation embryo represents the promoter's polymorphism of methylation states across cells in the same embryo. For each promoter, we calculated the proportions of cells belonging to each of the five methylation states. Then, we calculated the inequality of the methylation state distribution combined with the Gini index to quantitively measure the heterogeneity (Additional file 1: Fig. S3). If x i is the proportion of the methylation state i, and there are n states (n = 5), then the DNA methylation heterogeneity H methyl is given by: The expression heterogeneity of a given gene in a pre-implantation embryo represents the polymorphism of gene expression across cells. For each gene, we calculated the expression heterogeneity H expr as the squared coefficient of variation (CV 2 ) of the expression level across the cells in the same embryo.

Prediction of DNA methylation heterogeneity
With given parameters (u, d, p) and an initial DNA methylation state ratio vector (M i ), the DNA methylation transition matrix (T (u, d, p) ) and the terminal DNA methylation state ratio vector (M t ) can be calculated by the equations mentioned in the "Computational framework of MethylTransition" section. For each promoter, the probability that its DNA methylation state at the zygote stage will change to any of the five states in the 2-cell stage can be calculated. Then, a random sampling method (based on the R function sample()) was used to assign each promoter a state in each of the 2-cell stage cells based on the calculated probability and the initial DNA methylation state in the zygote stage cell. For different cells in the same stage, different random seeds were used. The procedures for the prediction of the state transition from the 2-cell stage to the 4-cell stage and those for prediction of the state transition from the 4-cell stage to the 8-cell stage were the same. For each promoter, the DNA methylation heterogeneity H methyl was calculated as mentioned above with the predicted DNA methylation states for each cell in 4-cell or 8-cell stage embryos.

Identification of relevant features of DNA methylation heterogeneity
The CpG ratio was calculated for each promoter, as previously described [54]. The chromatin accessibility on each promoter was also calculated using a published assay for transposase-accessible chromatin using sequencing (ATAC-seq) data for human 8cell stage embryos [55]. The H3K4me3 and H3K27me3 signals on each promoter were derived from published ChIP-seq data from human 8-cell stage embryos [56]. Based on the differences between the observed and predicted DNA methylation heterogeneity scores for the gene promoters, the promoters in classes S1 and S2 were divided into three categories: higher-heterogeneity gene promoters (HHGs; with observed heterogeneity higher than the 3rd quantile of the predicted heterogeneity score of the class), model-predictable heterogeneity gene promoters (MHGs), and lower-heterogeneity gene promoters (LHGs; with observed heterogeneity lower than the 1st quantile of the predicted heterogeneity score of the class). With the values for the above sequence and epigenetic features and the category labels, we performed random forest analysis using the RandomForestRegressor function in the Python package sklearn [57] to evaluate the importance of each feature to the DNA methylation heterogeneity.

Identification of human first cell fate determination-related genes
We first excluded genes that were not expressed in blastocysts (genes with FPKM values less than 1 in ICM and TE cells). We then identified differentially expressed genes between ICM and TE cells using DESeq2 with a threshold of p value < 0.05 and fold change > 4. These differentially expressed genes were considered as the genes related to human first cell fate determination.
Single-cell RNA-seq library generation and sequencing of Stella −/− mouse embryos Stella −/− mice have been described previously [36]. RNA was isolated from each single cell of two 8-cell stage Stella −/− embryos with the Smart-seq2 protocol. To generate RNA sequencing libraries, a KAPA Hyper Prep Kit was used following the manufacturer's instructions. The adapters used were from a KAPA Single-Indexed Adapter Kit. Paired-end 150-bp sequencing was further performed on a HiSeq X Ten platform (Illumina) at Novogene. The animal experimental procedures were performed according to the Tongji University Guide for the use of laboratory animals.