ChromDMM: a Dirichlet-multinomial mixture model for clustering heterogeneous epigenetic data

Abstract Motivation Research on epigenetic modifications and other chromatin features at genomic regulatory elements elucidates essential biological mechanisms including the regulation of gene expression. Despite the growing number of epigenetic datasets, new tools are still needed to discover novel distinctive patterns of heterogeneous epigenetic signals at regulatory elements. Results We introduce ChromDMM, a product Dirichlet-multinomial mixture model for clustering genomic regions that are characterized by multiple chromatin features. ChromDMM extends the mixture model framework by profile shifting and flipping that can probabilistically account for inaccuracies in the position and strand-orientation of the genomic regions. Owing to hyper-parameter optimization, ChromDMM can also regularize the smoothness of the epigenetic profiles across the consecutive genomic regions. With simulated data, we demonstrate that ChromDMM clusters, shifts and strand-orients the profiles more accurately than previous methods. With ENCODE data, we show that the clustering of enhancer regions in the human genome reveals distinct patterns in several chromatin features. We further validate the enhancer clusters by their enrichment for transcriptional regulatory factor binding sites. Availability and implementation ChromDMM is implemented as an R package and is available at https://github.com/MariaOsmala/ChromDMM. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
For over a decade, next-generation sequencing technologies have produced massive data amounts to quantify chromatin features, including nucleosomal histone modification locations, transcriptional regulatory factor (TRF) binding sites and chromatin accessibility (Boyle et al., 2008;Mardis, 2007;Park, 2009). These chromatinfeature signals are routinely formed as counts of aligned sequencing reads at consecutive non-overlapping genomic windows (or bins) along the entire genome or a short DNA stretch. These coverage signals at regulatory elements, such as enhancers, are often investigated to understand the underlying biological mechanisms in the regulation of gene expression. Moreover, the signals can be visualized as heatmaps by aligning them within a genomic window centred at the loci (see Fig. 4 as an example). The average aggregate patterns of the coverage signals, illustrated on top of the heatmaps, reveal the positional correlations and recurrent patterns in the signals. However, the set of analysed genomic regions can be biologically heterogeneous; in other words, it consists of multiple unknown subclasses. Therefore, the aggregate plot derived from all regions falsely displays the superposition of several different chromatin signatures. Consequently, we need a clustering method to reveal the subclasses.
The clustering method must consider the following properties of the chromatin-feature data. First, the data are heterogeneous containing sparse count data as well as varying coverage intensities and patterns. Second, the anchor positions of regulatory elements are typically uncertain; genomic regions need shifting, that is, the coverage signals need alignment with respect to each other to refine the aggregate patterns. Third, chromatin features can be asymmetric concerning the anchor points due to directional biomolecular mechanisms, such as transcription. Therefore, the coverage signals need strand-orientation (flipping).
Several methods have been proposed for the epigenetic data clustering, such as hierarchical clustering (Kundaje et al., 2012;Nielsen et al., 2012) and k-means (Groux and Bucher, 2019;Heintzman et al., 2007;Ye et al., 2011). The hierarchical clustering tool CAGT by Kundaje et al. (2012) groups chromatin profiles at functional genomic elements into clusters using k-medians algorithm. This procedure is followed by merging redundant clusters through the hierarchical agglomerative clustering utilizing either correlation or Euclidean distance. CAGT implements profile flipping but no shifting. Some clustering methods such as ChromaSig (Hon et al., 2008), CATCHprofiles (Nielsen et al., 2012) and ChExMix (Yamada et al., 2019) examine the chromatin-feature enrichment in the entire genome instead of clustering predefined sets of genomic elements. ChromaSig (Hon et al., 2008) is a clustering method that implements both shifting and flipping, and it assumes that the read counts are normally distributed. CATCHprofiles (Nielsen et al., 2012) is another hierarchical clustering approach combined with pairwise alignment. CATCHprofiles merges, aligns and orients the most similar profile pairs to remaining profiles iteratively based on correlation or Euclidean distance. This results in a very exhaustive search. ChExMix (Yamada et al., 2019) is designed to cluster the ChIP-seq or the higher resolution ChIP-exo (Rhee and Pugh, 2011) and ChIPnexus (He et al., 2015) read count footprints together with the DNA sequence information at the TRF binding sites. The footprints and DNA motifs are not equal at all binding sites, for example, due to the TRF of interest interacting with distinct sets of other regulatory proteins. ChExMix models read counts as being generated by a mixture of binding events and their subtypes along the entire genome. The model is formulated as a probabilistic mixture model with multinomial component distributions and assuming sparsityinducing Dirichlet priors on the mixture weights and binding event subtypes. The multinomial parameters, that is, the binding event positions, are assumed to follow a Bernoulli distribution. The model parameters are estimated using expectation-maximization (EM) algorithm, and as a result one obtains the responsibility of each binding subtype at each binding event in generating each sequenced read. ChExMix also considers the orientation of the footprints and allows small shifting. A probabilistic mixture model designed to cluster chromatinfeature signals at regulatory elements was introduced by Nair et al. (2014). The model is denoted as ChIP-partitioning and it considers the above-mentioned requirements for the chromatin-feature clustering method. ChIP-partitioning models the statistical variation in the coverage signals using independent Poisson distributions. Nair et al. (2014) demonstrated that ChIP-partitioning outperforms the hierarchical and k-means clustering methods, particularly when clustering low-coverage count data. However, next-generation sequencing data are typically overdispersed; the data variation is larger than expected by the Poisson distribution. Therefore, many overdispersed models have been proposed, for example, for RNA-seq data analysis (Robinson et al., 2010). Moreover, previous studies on clustering chromatin features do not provide rigorous probabilistic methods for clustering multiple chromatin features simultaneously. The previous methods also lack a principled method for determining the unknown number of clusters.
We propose a probabilistic clustering method ChromDMM that exploits the discrete, sparse, heterogeneous and overdispersed nature of the sequencing data. ChromDMM builds on the mixture of Dirichlet-multinomial compound distributions originally proposed for clustering microbial data (Holmes et al., 2012). We extend the model to account for the presence of multiple epigenetic coverage signals at the same genomic locus, so that each mixture component exhibits a set of Dirichlet-multinomial compound distributions. We also extend the model with the profile shifting and flipping features that can probabilistically account for the inaccuracies in the positions and strand-orientations of the clustered genomic elements. In addition, owing to the regularization of the mixture component parameters, ChromDMM can smooth the chromatin-feature patterns at successive bins along the genomic regions. Finally, our probabilistic model can naturally utilize the well-known model selection methods to determine the optimal number of clusters. The following Section 2 presents the ChromDMM model and its inference in detail. Section 3 analyses the performance of ChromDMM on simulated and real chromatin-feature data and compares its performance against ChIP-partitioning (Nair et al., 2014) and SPar-K (Groux and Bucher, 2019).

Materials and methods
The data for a chromatin feature across N genomic loci is represented as a N Â L matrix X ¼ ½x 1 ; . . . ; x N T , where x i ¼ ½x i1 ; . . . ; x iL T denotes the data for the ith genomic window. The length of x i is defined by the size of the genomic locations W and resolution B as L ¼ W=B. For example, data extracted in W ¼ 2000 base pair (bp) windows centred at the anchor points with the resolution B ¼ 40 bps results in coverage signals of length L ¼ 50. The element x ij denotes the number of sequencing reads whose starting position (5 0 end) is aligned to bin j of locus i. To be exact, x ij denotes, for example, the histone modification ChIP-seq read counts minus the sequencing-depth normalized control counts (see Supplementary Section S4.2). Collectively, the data for M chromatin features are represented as a N Â ML matrix X Ã ¼ ½X ð1Þ ; X ð2Þ ; . . . ;

Product Dirichlet-multinomial mixture model
The read counts x across the L bins are naturally modelled by the multinomial distribution with parameters p ¼ ½p 1 ; . . . ; p L T ( P L j¼1 p j ¼ 1). We further assume the multinomial parameters p are distributed according to a conjugate Dirichlet distribution with hyperparameters a. Marginalizing out the multinomial parameters from the joint distribution of x and p results in an overdispersed Dirichlet-multinomial compound distribution parameterized by a. The Dirichlet-multinomial distributions can be utilized as the component distributions in a mixture model for the probabilistic clustering.
Compared with the standard Dirichlet-multinomial mixture model (Holmes et al., 2012), we implement two extensions. First, we assume that the likelihood of x Ã is a product multinomial distribution, each multinomial with the chromatin-feature-specific parameters p ðmÞ . This enables modelling several chromatin features simultaneously. Second, we assume the parameters p Ã ¼ ðp ð1Þ ; . . . ; p ðMÞ Þ to have a mixture prior with K mixture components; each component k is a product of M Dirichlet distributions again with the chromatin-feature-specific hyperparameters a k ¼ ½a Dirichlet ðp ðmÞ ja ðmÞ k Þ; where p ¼ ðp 1 ; . . . ; p K Þ denotes the mixture weights. Holmes et al. (2012) showed that compounding the multinomial distribution with the Dirichlet mixture prior results in an analytically tractable likelihood. Similarly, in the case of the product-multinomial with the product-Dirichlet mixture prior, the parameters of the productmultinomial can also be marginalized analytically to derive a closedform expression for the likelihood of X Ã as (see Supplementary Section S1.4 for a detailed derivation) Instead of seeking to obtain the maximum-likelihood estimates for the model parameters, we adopt the Bayesian approach by introducing a prior distribution for the component parameters a Ã .
To account for the correlations between the (expected) read counts at consecutive bins along the chromatin signal, we define a regularized Gamma hyperprior for the mixture component parameters a Ã as where all a ðmÞ kj have their own independent Gamma prior with fixed shape g and rate parameters, and the regularization terms also have their own independent Gamma prior with shape g h and rate h parameters. Inclusion of the regulatory terms in the prior favours smooth mixture component parameters. A more detailed expression of the proportional distribution of the prior pða Ã Þ is shown in Supplementary Equation (S9) and an example of the effect of the regularization is demonstrated in Supplementary Section S1.6. Mixture models involve the latent cluster membership variables z; each observed x Ã i is associated with a corresponding unobserved categorical latent variable z i . The variable The proposed model is parameterized by h ¼ ða Ã ; pÞ and is presented as a directed acyclic graph in Supplementary Fig. S1 together with the distributions of individual components.

The EM algorithm
The posterior log pðX Ã jhÞ þ log pðhÞ cannot be maximized directly. Instead, the MAP estimates for h and the probabilistic cluster assignments are obtained by an iterative approach, the EM algorithm (Bishop, 2006). For the derivation of the EM algorithm assume a distribution for Z, qðZÞ. Then, the Jensen's inequality provides a lower bound for the posterior distribution where log pðX Ã ; ZjhÞ is the complete data log-likelihood and the constant term is independent on h. Assuming some initial estimates for the parameters h old and defining qðZÞ ¼ pðZjX Ã ; h old Þ, the lower bound (without the constant term) is where the expectation is wrt the posterior probabilities of the cluster assignments conditional on the current parameter estimates h old , that is, The likelihood term pðx ðmÞ i jhÞ is the Dirichlet-multinomial compound distribution for the mth chromatin feature. For a detailed derivation, see Supplementary Section S1.8.In the EM algorithm, E-steps and an M-steps are repeated, until convergence of the lower bound Qðh; h old Þ. In the Estep, the posterior probability that a sample i belongs to a cluster k given the current parameter estimates h old is obtained using the standard Bayes rule as i jz ik ¼ 1; h old Þ is the likelihood of the sample i conditioned with cluster k, that is, the product Dirichlet-multinomial distribution. The term pðz ik ¼ 1jh old Þ corresponds to the mixture weight p k . In the M-step, as a closed-form solution of a Ã that maximizes Qðh; h old Þ is unattainable, the lower bound is maximized wrt a Ã using Broyden-Fletcher-Goldfarb-Shanno (BFGS) method provided in R (Broyden, 1970). In addition, the component parameters a ðmÞ kj are constrained to be positive by a reparameterization k ðmÞ k ¼ log a m k and by re-defining the prior for k Ã accordingly using the multivariate change of variables method. For more details on deriving the equations for the model inference, see Supplementary Sections S1.7-S1.15. The steps of the EM algorithm are summarized in Algorithm 1. In the initilization, the cluster membership probabilities E½z ik are obtained with the soft k-means clustering (MacKay, 2003) on concatenated chromatin features (see details in Supplementary Section S4.1). For a given number of clusters, the EM algorithm is run multiple times each with different random initialization. Note that it is trivial to parallelize the computation across the multiple runs as well as across varying numbers of clusters.

Chromatin feature profile shifting and flipping
We extend the product Dirichlet-multinomial mixture model with shifting and flipping features. For profile shifting, we first define the maximum amount of shifting, for example, 400 bp, both upstream and downstream. With a given bin size (e.g. B ¼ 40bp), this results in S ¼ 2Â400bp 40bp þ 1 ¼ 21 possible shift states, where the shift state s ¼ Sþ1 2 corresponds to no shift. In addition, the length of the Dirichlet parameters a ðmÞ k is extended from L to L þ S À 1. When evaluating the likelihood model from Equation (1) for a shift state s, we use the corresponding L-length subset of the extended Dirichlet parameters for each mixture component k, denoted as a Ã ks ¼ ða Ã k;s ; a Ã k;sþ1 ; . . . ; a Ã k;sþLÀ1 Þ. For profile flipping, we either compute the likelihood model definition with a shift state s (using again the L-length subset of the Dirichlet parameters) if f ¼ 1, or reverse the order of the Dirichlet parameters if f ¼ 2. Formally, we denote the shifting and flippingaware likelihood model for a single genomic locus as pðx Ã ja Ã sf ; pÞ. For each locus, we can define prior probabilities for the shift and flip states. The prior shift state probabilities for the genomic locus i are denoted as n i ¼ ðn i1 ; . . . ; n iS Þ, where P S s¼1 n is ¼ 1. If the genomic loci and their anchor points are defined using ChIP-seq summits, then the prior for shift states can be defined, for example, as a pyramid-shaped prior that has the highest probability at the ChIPseq peak summit (corresponding to no-shift state) and linearly decreasing the prior to zero beyond the maximum shift state. Similarly, we can define prior flip state probabilities f i for each locus In ChromDMM with the shifting and flipping features, the latent cluster membership variables are re-defined as follows: z iksf ¼ 1 if the sample i originates from the cluster k, has shift state s and has strand-orientation f; otherwise, z iksf ¼ 0. These latent variables are stored in N Â K Â S Â 2 matrix (or tensor) Z. We show in Supplementary Section S2.3 that the EM algorithm can be derived similarly as in Section 2.2, resulting in the following lower bound for the posterior distribution of parameters h //E-step: 8 Compute pðZjX; h old Þ, i.e., Eðz ik Þ; 9 //M-step: 10 k ðÃ;newÞ ¼ argmax k Ã Qðh; h old Þ using BFGS 11 Update mixing weights p: where Note that the above mixture model can be applied (i) only with shifting, (ii) only with flipping or (iii) with both shifting and flipping. In the case of (i), we simply drop the index f and the corresponding sums and in the case of (ii), we simply drop the index s and the corresponding sums. For more details on the derivations of equations needed to infer the shifting and flipping-aware model, see Supplementary Section S2.
After learning the model parameters with the EM algorithm, we infer the final cluster assignmentk i for each sample i by marginalizing the shift and flip states. Similarly, we choose the final flip and shift states,f i ;s i , that maximize the posterior given the optimal clusterk i by marginalizing the shift and flip states, respectively,

Choosing the number of clusters and identifiability aspects
For probabilistic clustering methods, the Bayesian model selection is commonly used to guide the selection of an appropriate number of clusters K. While the exact computation of the marginal likelihood is impractical, we can directly apply the commonly used approximative methods, such as the Bayesian information criterion (BIC) (Schwarz, 1978) or the Akaike information criterion (AIC) (Akaike, 1973). There are inherent unidentifiability issues in ChromDMM results. Firstly, as in any clustering method, the inferred cluster labels can be switched between two clusters without affecting the clustering accuracy. Secondly, unless informative prior for strandorientation is provided, the flip state indexes (1 or 2) can always be reversed. For biological interpretation, the aligned and flipped profiles need to be visualized after clustering and compared with the underlying directionality of the genomic region, such as direction of transcription 3 0 ! 5 0 or 5 0 ! 3 0 , if known. The learned shift state is also affected by the learned flip state. While evaluating the performance of ChromDMM and other methods on simulated data, we consider these aspects.

Data simulation and choice for hyperparameters
We used simulated data to investigate the clustering accuracy of ChromDMM, ChIP-Partitioning and SPar-K when the data contain varying number of chromatin features and varying read coverages. For comparison, we repeated some experiments on the simulated data presented by Nair et al. (2014). We simulated data containing two clusters using the R-code presented in Nair et al. (2014) (see their Supplementary Material, page 13). We generated 1000 samples per cluster using low-coverage parameter values f ¼ 0.5 and f ¼ 1. We found that at these low-coverage parameter values, most of the simulated profiles were zero vectors. Thus, we first generated 10 000 samples for both clusters and randomly selected 1000 non-zero vectors for both clusters. The data generation was repeated 100 times. An example of simulated dataset is presented in Supplementary Fig. S8. The cluster-wise aggregate profiles are Gaussian-shaped with varying location of the mean and variance.
To generate more realistic simulated data, we first clustered data for four chromatin features (H3K4me1, H3K27ac, RNA polymerase II and MNase-seq) at 1000 enhancers from the ENCODE project We performed hyperparameter sweeps on the simulated data to determine robust default values for the ChromDMM model. Briefly, for the Gamma prior for the Dirichlet parameters a ðmÞ kj (parameterized by hyperparameters g and ), we observed that the results were not sensitive to hyperparameter values and conclude that the choice of g ¼ 1:1 and ¼ 0:1 results in a good performance ( Supplementary Fig. S5). We also performed the prior predictive checks using the ancestral sampling of the data from prior hyperparameters and demonstrated that the amount of variation generated from the prior is comparable to the variation in the real data ( Supplementary Fig. S6). Similarly, we chose the hyperparameter values for the regularization term h m k . The Gamma prior with mean 1 and variance 0.1 corresponding to hyperparameters g h ¼ h ¼ 10 resulted in a robust clustering performance (see Supplementary Fig.  S7 and Supplementary Section S4.5 for more details). We set the above hyperparameter values as defaults, but a user can, for example, perform prior predictive checks for his/her data and adjust the hyperparameters, if necessary.
We investigated the ability of AIC and BIC to choose the correct number of clusters (two) for the simulated data. We fitted ChromDMM with varying the number of clusters (from 1 to 3). The proportions of cluster numbers selected by AIC and BIC in 100 simulated datasets are presented in Supplementary Fig. S16a for 1000 samples and in Supplementary Fig. S16b for 6000 samples. We conclude that AIC and BIC detect the correct number of clusters more reliably when the coverage of the chromatin modifications and/or the number of samples increases, although BIC tends to underestimate the number of clusters. The computation times of the three methods (with default parameters) to cluster simulated data containing two clusters and two chromatin features both with coverage 100 were: half an hour (ChromDMM), ca. 10 min (ChIP-Partitioning) and seconds (SPar-K).

ChromDMM infers accurate clusters
We compared ChrommDMM against ChIP-partitioning and SPar-K (both applied with the default parameters) in clustering simulated data that were generated as in Nair et al. (2014). The clustering performance of ChromDMM exceeded the performance of ChIPpartitioning and SPar-K (Fig. 1). Similarly as in Nair et al. (2014), we also report the Pearson correlation coefficients between the true cluster-wise aggregate patterns and the inferred aggregate patterns (Supplementary Fig. S9). In general, the correlations are similar to values obtained by Nair et al. (2014). For the first cluster, the correlations obtained by ChromDMM are lower than for ChIPpartitioning, whereas for the second cluster, the correlations obtained by ChromDMM are slightly higher.
We compared ChrommDMM against ChIP-partitioning and SPar-K in clustering more realistic simulated data that contained two clusters and two chromatin features (H3K4me1 and RNA POL II). The clustering performance of ChromDMM exceeded the performance of ChIP-Partitioning ( Fig. 2 and Supplementary Fig. S10). SPar-K performed poorly in these comparisons, particularly when the coverages were low (Supplementary Fig. S11). Similar results were obtained on data containing only a single feature ( Supplementary  Fig. S12). For comparison, Figure 2 and Supplementary Fig. S10 present also results for an experiment where ChromDMM was fitted either on concatenated chromatin profile data or without the regularization term. The regularization improved the clustering performance especially when the coverage for the first chromatin feature (H3K4me1) was low, whereas the use of non-concatenated chromatin profile data resulted in only a marginal improvement in this simulation setting.

Clustering accuracy improves with the number of features
We experimented with the number of chromatin features; beginning from a single feature (H3K4me1), the number of features was increased to four by adding RNA Pol II, H3K27ac and MNase-seq. Supplementary Fig. S13a shows how the clustering accuracy increases together with the number of chromatin features for a signal coverage of 10. Supplementary Fig. S13b presents similar results for a varying coverage, where the coverage of the first set of chromatin features was 10 (H3K4m1, H3K27ac) and the coverage of the second set of chromatin features was 50 (RNA Pol II, MNase-seq). We conclude that for ChromDMM and ChIP-partitioning, the clustering performance increases as a function of the number of chromatin features, whereas for SPar-K the improvement is less consistent. Regardless of the number of chromatin features, ChromDMM obtains the best performance.

ChromDMM infers accurate shift and flip states
The simulated data were also artificially shifted and flipped as described in Supplementary Section S4.4. Briefly, the random shifts were constrained to be multiple of the data resolution (B ¼ 40 bp) and drawn from the Skellam distribution with mean zero and a variance that included the randomly sampled shifts between -400 and þ400 bp. Similarly, the flip states were sampled randomly with equal probability for both strand-orientations. For more details, see Supplementary Algorithm S3.
The clustering accuracy of ChromDMM, ChIP-partitioning and SPar-K was demonstrated on the randomly shifted and flipped simulated data. We experimented with four versions of ChromDMM: (i) ChromDMM with the regularization term and with the pyramidshaped shift state prior, (ii) same as (i) but with concatenated chromatin features, (iii) ChromDMM without the regularization term and with the pyramid-shaped shift prior and (iv) ChromDMM with a uniform prior for the shift states and with the regularization term. The clustering accuracies of the methods are presented in Fig. 3a. Methods considering the concatenated chromatin features, including ChIP-partitioning and SPar-K, performed poorly in these comparisons and notably they failed to improve their performance while increasing the coverage values. ChromDMM outperformed other methods and its clustering performance was further improved by both the informative shift state prior and the regularization, particularly when the coverage of either chromatin feature was low.
The methods were compared with their accuracy to correctly infer the shift and the flip states of the genomic regions. ChromDMM and ChIP-partitioning infer the most probable shift and flip states for each sample i from the latent variable probabilities shown in Equations (5) and (6), respectively, whereas SPar-K outputs the inferred shift and flip states separately. The flip state error was defined as the proportion of incorrectly inferred flip states in a given experiment (recall the identifiability aspects from Section 2.4). Similarly, the shift error for each sample was computed as the absolute difference between the true shift and the inferred shift in nucleotides. The average shift error over all N samples was reported as the final shift error for a single experiment.
The flip errors for the simulated shifted and flipped data containing two chromatin features and two clusters are presented in Supplementary Fig. S14. On average, the flip errors decreased as the coverages increased and they were lower for the ChromDMM methods compared with ChIP-partitioning and SPar-K. The flip errors were only slightly affected by whether the ChromDMM fit was inferred without the shift prior or without the regularization. The resulting shift errors for the simulated data are shown in Fig. 3b. Again, the average shift errors decreased as the coverages increased and the shift errors were lowest for the ChromDMM methods. The shift state inference of ChromDMM was further improved by the informative shift prior and the regularization of the mixture component parameters a Ã . In contrast to the other methods, the shift errors for ChIP-partitioning remained high even with large coverage values. This likely results from the cluster-assigned patterns drifting from the profile centre positions, that is, ChIP-partition selects a profile whose unimodal peak or valley between the two-modal peak is shifted far from the profile centre and aligns the rest of the profiles  Fig. S15d). The cluster patterns inferred by SPar-K also drift ( Supplementary Fig. S15e), whereas ChromDMM centres the peaks and valleys to the profile centres ( Supplementary Fig. S15c). This desirable behaviour of ChromDMM stems partly from the robustness of the probabilistic treatment and the shift state prior. Finally, we investigated the ability of AIC and BIC to choose the correct number of clusters (two) in the simulated shifted and flipped data ( Supplementary Fig. S17). In contrast to the simpler model studied in Section 3.1, the more complex ChromDMM model with a high number of parameters is heavily penalized by AIC and BIC, and thus require higher coverage and larger number of samples to detect the correct number of clusters.

ChromDMM reveals distinctive enhancer clusters
We applied ChromDMM, ChIP-partitioning and SPar-K with the flip and shift state inference to cluster ENCODE data containing 10 chromatin features extracted at enhancer regions. For the details of the data, preprocessing and the definition of the enhancers, see Supplementary Section S4.2 and Osmala and Lä hdesmä ki (2020). Based on the ChromDMM fit, the enhancers were assigned to the most probably clusters and their profiles were re-aligned based on the inferred shift and flip states, for example for visualization. As a result, ChromDMM separated enhancers into six clusters, each with distinctive and refined combinations of chromatin feature patterns. Three of the six clusters are visualized as heatmaps and aggregated patterns in Fig. 4 (the full set of clusters and chromatin features are presented in Supplementary Fig. S18). In contrast, ChIP-partitioning and SPar-K failed to identify distinctive patterns and to refine the profile alignment and strand-orientation (Supplementary Figs. S20 and S21).
The ChromDMM enhancer clusters possess characteristic combinations of chromatin feature pattern shapes, spacings and signal strengths. The first cluster has symmetric and high enrichment of histone modification and MNase-seq signals with a steep decline of the signals in the middle of the profiles, indicating a nucleosome-free region. In addition, the nucleosome-free region is surrounded by a regular array of well-positioned nucleosomes. In contrast, in the clusters 2, 3 and 4, the nucleosome-free region and the wellpositioning of the nucleosomes are obscured compared with the other clusters. Thus, the enhancers in these clusters may possess closed chromatin or mobile nucleosomes. The clusters 4-6 have asymmetricity in histone modification enrichment (clusters 4 and 5), in nucleosome positioning (clusters 5 and 6) and in RNA POL II occupancy (clusters 4 and 5). The asymmetricity in the RNA POL II ChIP-seq signal may reflect the direction of transcription. In addition, in the asymmetric clusters, the histone modifications are enriched on either of the two nucleosomes immediately flanking the anchor position (cluster 5) or spread widely (clusters 4 and 6).

Biological validation of the inferred clusters
The enhancer clusters revealed by ChromDMM, ChIP-partitioning and SPar-K were investigated for the enrichment of the binding sites of transcription factors (TFs) and other regulatory proteins, collectively referred to as TRFs. The ChIP-seq peaks for 220 TRFs were downloaded from ENCODE. For each TRF-cluster pair, a significance test for the enrichment of a given TRF at the cluster was performed by the GAT tool (Heger et al., 2013). A large majority of the enrichments were significant according to the q-value threshold 0.01. To reveal differences in the TRF enrichment between clusters, the fold enrichments were visualized as a heatmap, where the enrichments corresponding to q-value larger than 0.01 were masked out (see Supplementary Fig. S19 for ChromDMM clusters). The fold enrichments for TRFs which were significantly enriched in at least one ChromDMM cluster and simultaneously not enriched in at least one another cluster are presented in Fig. 5. For more details, see Supplementary Method S4.3. The enrichment of TRFs in ChromDMM enhancer clusters reveals the potential biological significance of the distinctive chromatin feature patterns. Clusters 3 and 4 with obscured nucleosomefree regions and nucleosome positioning have less enrichment of TRFs than the four other clusters. In contrast, the first cluster with symmetric and strong signals has an enrichment for a large number of TRFs. Similarly to cluster 1, cluster 5 with strong asymmetry in the histone modification and RNA POL II signals has a high TRF enrichment. Asymmetric cluster 6 with strong average H3K27ac, H3K9ac and DNase-seq signals differs from the other clusters with unique enrichment for RNA binding and processing-related proteins (HNRNPK, FUS) and TFs SMAD2 and YBX3. In addition, clusters 1 and 6 have enrichment for the largest component and core scaffold of the TFIID basal TF complex (TAF7). Moreover, clusters 1, 5 and 6 are enriched for Scaffold attachment factor B1 (SAFB), a protein that binds DNA regions that are bound to the nuclear scaffold. Interestingly, SAFB may be involved in attaching the base of the chromatin loops to the nuclear scaffold and serving as a molecular base to assemble a transcriptosome complex in the vicinity of the actively transcribed genes (Nayler et al., 1998). For comparison, the TRF enrichments at ChIP-partitioning and SPar-K clusters are visualized in Supplementary Fig. S22 and S23.

Conclusions
Exploring epigenetic datasets provides crucial information on key biological mechanisms such as gene regulation. An example of such data mining is the clustering of epigenomic signals and other chromatin features at regulatory elements, such as enhancers, to reveal the combinations of chromatin features with varying signal magnitudes and profile shapes. To appropriately account for the sparse, discrete, heterogeneous and overdispersed nature of the chromatin-feature data, probabilistic clustering methods have been developed.
We have proposed ChromDMM, a product Dirichletmultinomial mixture model that provides a probabilistic method to cluster multiple chromatin-feature coverage signals extracted from the same locus. By employing simulated data, we demonstrated that the accuracy of ChromDMM increases with the increasing number of chromatin features. This indicates the need for a principled approach that considers the multiple chromatin features simultaneously when clustering regulatory elements. Moreover, we demonstrated that ChromDMM outperforms the previous methods ChIPpartitioning and SPar-K in clustering accuracy, particularly when the chromatin-feature coverages are low. In addition, ChromDMM learns the shift and flip states more accurately compared with ChIPpartitioning and SPar-K. The accuracy of ChromDMM to infer the clusters and shift states is further improved by mixture component parameter regularization and an informative shift state prior. Finally, we confirmed that BIC and AIC can detect the correct number of clusters.
We illustrated that ChromDMM identifies clusters with distinct epigenetic patterns when applied to ENCODE data containing 10 chromatin features quantified at enhancers. Moreover, the identified clusters are enriched for different sets of TRFs, suggesting that the clusters may vary in their biological characteristics. ChromDMM may therefore be a valuable method to reveal potential functionally distinct subclasses of regulatory elements.