Inference of 3D genome architecture by modeling overdispersion of Hi-C data

Abstract Motivation We address the challenge of inferring a consensus 3D model of genome architecture from Hi-C data. Existing approaches most often rely on a two-step algorithm: first, convert the contact counts into distances, then optimize an objective function akin to multidimensional scaling (MDS) to infer a 3D model. Other approaches use a maximum likelihood approach, modeling the contact counts between two loci as a Poisson random variable whose intensity is a decreasing function of the distance between them. However, a Poisson model of contact counts implies that the variance of the data is equal to the mean, a relationship that is often too restrictive to properly model count data. Results We first confirm the presence of overdispersion in several real Hi-C datasets, and we show that the overdispersion arises even in simulated datasets. We then propose a new model, called Pastis-NB, where we replace the Poisson model of contact counts by a negative binomial one, which is parametrized by a mean and a separate dispersion parameter. The dispersion parameter allows the variance to be adjusted independently from the mean, thus better modeling overdispersed data. We compare the results of Pastis-NB to those of several previously published algorithms, both MDS-based and statistical methods. We show that the negative binomial inference yields more accurate structures on simulated data, and more robust structures than other models across real Hi-C replicates and across different resolutions. Availability and implementation A Python implementation of Pastis-NB is available at https://github.com/hiclib/pastis under the BSD license. Supplementary information Supplementary data are available at Bioinformatics online.

Because little experimental data is available to characterize the true population of 3D DNA structure, we first compare the different structural inference methods by using simulated data. We construct three ensembles of datasets with varying coverage, dispersion, and counts-to-distance mapping. All simulations use a consensus architecture obtained by running Pastis-MDS, applied to the first chromosome of the 75 th replicate of the KBM7 nearly haploid human cell line data from Rao et al. Rao et al. [2014] at 100 kb.
We generate simulated datasets using the model C ij ∼ NB(βd α ij , βr) . Note that we do not simulate biased data requiring ICE normalization, to focus on the architecture inference part. Since d ij is given by the consensus architecture used to simulate the counts, for all pairs of beads (i, j), the total number of counts in the count matrix (a.k.a. coverage) is on average K = β i,j d α ij ; hence we control the coverage in a simulation with the β parameter by setting β = K/ i,j d α ij . We generate the first ensemble of 100 datasets to study the influence of coverage. We use α = −3, which yields a count-to-distance mapping consistent with the one obtained from polymer physics theory [Grosberg et al., 1988, Lieberman-Aiden et al., 2009. We vary the parameter β such that the expected number of reads ranges between 10% and 100% of the original dataset (10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%), and we set the dispersion parameter r to be the one fitted as described above to the KBM7 Hi-C dataset (r = 49.9), obtaining 100 datasets by repeating each configuration 10 times with 10 random seeds.
The second collection of 100 simulated datasets is to study the influence of overdispersion. We keep α = −3 and set the parameter β such that the expected number of reads is 100% of the original dataset. We set the dispersion parameter r to be the one estimated on the original KBM7 contact maps multiplied by γ, where γ ∈ {0.1, 0.2, . . . , 1}. Varying γ thus varies the dispersion, and the smaller the dispersion parameter is, the more overdispersed the datasets are, and thus the harder the inference is likely to be. For each set of parameters, we generate 10 datasets using 10 different random seeds, thus yielding 100 datasets.
Finally, we generate an ensemble of 70 datasets to measure how well methods perform when provided with an incorrect counts-to-distances mapping. To do so, we vary α ∈ {−1.5, −.2, · · · − 4., −4.5}, fix β for each simulation so that the number of reads is as the original dataset, keep the dispersion parameter r as fitted on the KBM7 Hi-C dataset, and repeat each simulation 10 times using 10 random seeds. This third ensemble of datasets enables us to compare metric methods, for which the counts-to-distance mapping is fixed or provided by the user a priori, versus non-metric methods, for which the counts-to-distance mapping is inferred jointly with the structure from the data.

Real datasets
We also apply our method to publicly available Hi-C data from the chronic myelogenous leukemia cell-line KBM7 [Rao et al., 2014], and whole-genome (already processed) Hi-C datasets from S. cerevisiae [Duan et al., 2010], D. melanogaster [Sexton et al., 2012], and A. thaliana [Feng et al., 2014].
The KBM7 cell line has the nice property of being nearly haploid: apart from chr 8 and a small part of chr 15, all chromosomes are haploid. We downloaded the first two replicates (experiment 75 and 76) and processed the data with HiC-Pro [Servant et al., 2015] to obtain intra-chromosomal maps at 1 Mb, 500 kb, 250 kb, 100 kb, and 50 kb.

Measures of 3D model similarities
To assess how well methods work, one needs to use similarity measures of 3D structures. In this work, we use two such measures: (1) the root mean square deviation (RMSD) and the Spearman correlation.
The RMSD is a commonly used metric to compare two structures X ∈ R 3×n and X ∈ R 3×n . It is defined as where s ∈ R is a scaling factor, R ∈ R 3×3 is a rotation matrix, and t ∈ R 3 is a translation factor. The RMSD is sensitive to the scale of the structure X. Thus, when the scale of X is not known, we apply a robust rescaling strategy to X, such that 99% of the beads fit into a sphere of a predefined diameter. Note that the RMSD will be greatly affected by the presence of outlier beads. High RMSD indicates a poor reconstruction. The second measure of similarity we use is the Spearman correlation of pairwise distances of the two structures. The Spearman correlation metric is much more robust to outliers than RMSD. Thus, some methods could have high RMSD in the presence of outliers but still achieve a high Spearman correlation.  Supplementary Table 1: 3D inference methods Incomplete list of methods to perform 3D structure inference from Hi-C data (1) is it a consensus or a ensemble based inference? (2) What type of methods is it (MDS-based, statistical modeling, or others)? (3) is the software available or not (to the best of our knowledge)? (4) What programming language is it written in? Software with availability marked with a checkmark and an asterisk indicates either a problem with the link provided in the article or partial availability (i.e., SuperRec's code is not publicly available, but a binary can be downloaded and used on linux platforms). Rows highlighted in yellow corresponds to methods comparable in scope with Pastis-NB (infers non-diploid consensus structures from bulk HiC data and are none-organisms specific). In bold are the methods selected for comparison in this paper. The raw contact count matrix c suffers from many biases, some technical (from the sequencing and mapping) and others biological (inherent in the physical properties of chromatin) [Imakaev et al., 2012, Yaffe andTanay, 2011]. Imakaev et al. [2012] proposed a simple iterative correction and eigenvalue decomposition method called ICE to estimate the biases and normalize the data. This method relies on two primary assumptions. First, the biases of each entry c ij can be written as the product of two biases b i and b j associated with loci i and j; thus, if c N is the normalized contact count matrix, then c ij = b i b j c N ij . Second, the total number of contacts associated with each locus should be equal. From these assumptions, we can formulate a nonconvex optimization problem to estimate the vector of biases b. The problem can be solved exactly using an iterative procedure akin to the Sinkhorn algorithm. We apply the ICE method to all of the data used in this study. Prior to normalization, we filter out rows and columns having the fewest number of contacts to avoid degeneracies during the iterative correction (see Supplementary Table 2).