A minimal G$\bar{\textrm{o}}$-model for rebuilding whole genome structures from haploid single-cell Hi-C data

We present a minimal computational model, which allows very fast, on-the-fly construction of three dimensional haploid interphase genomes from single cell Hi-C contact maps using the HOOMD-blue molecular dynamics package on graphics processing units. Chromosomes are represented by a string of connected beads, each of which corresponds to 100,000 base pairs, and contacts are mediated via a structure-based harmonic potential. We suggest and test two minimization protocols which consistently fold into conformationally similar low energy states. The latter are similar to previously published structures but are calculated in a fraction of the time. We find evidence that mere fulfillment of contact maps is insufficient to create experimentally relevant structures. Particularly, an excluded volume term is required in our model to induce the formation of chromosome territories. We also observe empirically that contact maps do not capture the chirality of the underlying structures. Depending on starting configurations and protocol details, one of two mirror images emerges. Finally, we analyze the occurrence of knots in a particular chromosome. The same knot appears in (almost) all structures irrespective of minimization protocols or even details of underlying potentials providing further evidence for the existence of knots in interphase chromatin.


Introduction
Knowledge of the three-dimensional structure of interphase chromatin is crucial for the understanding of gene regulation [1] and other important processes such as cellular differentiation [2] and the cell cycle [3]. At the large scale this information was first obtained from fluorescence in-situ hybridization (FISH) experiments showing that interphase chromosomes are usually compact and form territories within the nucleus [1,4,5]. The detailed structure, however, has only become accessible recently with the rise of chromosome conformation capture techniques such as 3C [6] and derived methods like Hi-C [7]. The latter applies high-throughput sequencing techniques to determine (non-sequential) spatial contacts between DNA strands from proximity ligation of DNA by sequencing ends and mapping them back onto a reference genome. With the contact map of single-cell Hi-C [8,9,10,11,12,13], the folded structure of two meters of DNA into the confines of a nucleus a few microns wide was revealed at scales down to 20,000 base pairs [13].
So far, there is still no physical model which is generally agreed upon as the currently accessible resolution is still above the nucleosome size and persistence length. Various methods have been discussed recently to reconstruct structures from single cell Hi-C maps such as approaches based on Bayesian inference [14], manifold based optimization [15] or the minimization of polymer models [8,9,10,16,17,18,19]. Note that numerous similar approaches have also been used on bulk Hi-C data i.e., contact maps derived from multiple cells, working under the assumption that contacts and characteristic features of structures appear simultaneously in all or at least many cells (see, e.g., [20,21,22] and references therein.) Our single cell ansatz is based on a polymer model and aims at fulfilling two specific objectives. Firstly, we would like to construct a minimal model with as simple a parametrization as possible. Secondly, we want to optimize the minimization procedure and place it on a modern computing platform using graphics processing units (GPUs). The implementation of these two goals allows us to rebuild whole genome structures (as opposed to single chromosomes) with a resolution of 100,000 base pairs almost instantaneously.
In the following sections we describe our choice of potentials, simulation protocols and analysis tools to obtain the three-dimensional structure of haploid mouse interphase chromatin from the contact map. We find that structures generated by our procedure are rather similar and belong to either one of two chiralities depending on starting conditions and protocol details. This is an indication that a given contact map in conjunction with our choice of potentials leads to a ground-state like conformation with a two-fold chirality. We find that the inclusion of an excluded volume term in our model is essential for the emergence of chromosome territories and plays a major role in the amount of self-entanglements observed. By choosing a rather large excluded volume term we are able to reduce but not prevent knots from occurring in the final structures.

Model and simulation protocols
The task of obtaining a three-dimensional structure from a single-cell contact map is essentially an optimization problem which we approach by applying vari-ous concepts from computational statistical physics. Chromosomes are modeled as chains of connected monomers, each corresponding to 100, 000 base pairs. This resolution was chosen to have enough beads with one or more contacts and ensure that in turn most of the structure is well-defined. The haploid mouse genome is therefore represented by 20 chains of lengths ranging from N ≈ 500 to N ≈ 2000 monomers. A general assumption of these models is that the chromatin path at a given scale can be approximated by a chain of uniform spherical monomers. Naturally, the smaller-scale conformation of the chromatin backbone is highly convoluted and there will be a vast distribution of underlying non-spherical conformations across the model. Nonetheless, the uniform, spherical bead approximation is a reasonable average: given it is isotropic, can accommodate a degree of uncertainty using harmonic distance restraints and because interphase chromosomes are suspected to have fractal-like chain crumpling. The latter is indicated by Hi-C contact probabilities [7] and suggests there is some degree of self-similarity over a wide range of length scales with an overall compaction greater than a more extended, equilibrated polymer. Beads which are in contact according to the experimentally determined contact pairs are also connected by harmonic springs which drives the system towards a native state. This concept is borrowed from protein folding simulations of so-called Gō-models [23,24,25,26]. In 1975, Gō and coworkers suggested a two-dimensional lattice model which captures aspects of protein folding. They defined "bonds" between monomers which were to end up in spatial vicinity in the native state in an attempt to study dynamics of folding with Monte Carlo simulations. Nowadays, the term is typically used for a class of generic polymer models in which "native bonds" are defined between amino acids which are known to be in "contact", i.e. close spatial proximity according to structural information from the Protein Data Bank [27]. However, while the latter aim at revealing the folding trajectory towards a previously known native state we apply this idea to chromatin for which contact data is available to reveal the unknown "native" state. The actual trajectory is of no physical relevance in this case. For our particular application, using Hi-C data from single cells, it is further assumed that the contacts can be satisfied by having spherical beads effectively touch, but not superimpose. This encompasses the uncertainty in the underlying chain path, as there can be many ways that a DNA ligation can be accommodated between two beads, and leads to solutions of fairly uniform density (that may be compared with microscopy [9]) where the bulk of the spherical objects do not mix, even where they are known to have short-range contacts. This ansatz is by no means original and is a central element of almost all polymer-based minimization methods for the determination of chromosome structures because the contact potential is essentially responsible to enforce contacts. Note that with this approach we strive to obtain a global structure based on local information. Non-local information on distances between various base pairs would likely improve the integrity of the global structure and could be implemented naturally in this class of models. Our simulation protocols allow for a maximal exploration of conformations. The gradual increase of excluded volume from a random starting configuration in protocol 1, for example is inspired by equilibration techniques for polymer melts [28]. Our approaches are also comparable to methods such as simulated annealing [29] which also aim at locating low energy conformations or ground states. Potentials are chosen to facilitate this optimization task and should not necessarily be regarded as realistic coarse-grained representations of chromatin. To this end we define rather stiff bonds between adjacent beads and beads in contact which ensure that contacts and bonds are enforced. In addition, a rather large penalty is enforced in our excluded volume term for overlaps, which pushes beads without contacts away from each other but still allows for bond crossings in the expansion phase.
Specifically, we use a generic bead-spring polymer model [30] in which adjacent monomers are connected by a harmonic potential ( Fig. 1): where κ = 2000 is the force constant and r 0,B = 1 the preferred bond length. In all cases we refer to standard simulation units and do not mention them explicitly. Contacts between beads are also enforced via a "native" harmonic potential with κ = 2000. The preferred distance, however, is somewhat larger (r 0,C = 1.5) which increases similarities of final structures in terms of the root mean squared deviation (RMSD). Contacts are obtained from Hi-C data available at [9] and binned to sizes of 100, 000 base pairs. For the remainder of the paper we will focus on cell 2 of this data set. From the binned data self-contacts and contacts between adjacent monomers are removed. Multiple contacts between two beads are also not considered as the contact potential of our minimal model does not depend on the number of contacts. This reduces the number of contacts in cell 2 of the data set from 79, 569 to 32, 243. A Gaussian pair potential is acting between all particles that are neither adjacent nor in contact and pushes non-bonded monomers away from each other: where = 100, whereas σ = 1 and cut-off radius r cut = 3.5 ( Fig. 1). In the intermediate step of our simulation protocol the excluded volume is reduced (σ = 0.1 and r cut = 0.4).
It should be noted that energies in the model do not represent physical energies arising in chromatin. The sole purpose of these models and protocols is to find a ground-state-like conformation (similar to the native state of a protein) which fulfills all (or almost all) contacts. In this sense the whole method can be considered as a kind of black-box algorithm to generate a structure which is compatible with the contact matrix of a given cell. Therefore, the resulting structure is as far away from equilibrium as the original structure in the cell on which it is based upon. Furthermore, reference [7] shows that the contact probability power law of the genome structures (which are very close to ours), and which also match bulk Hi-C, is inconsistent with a fully equilibrated polymer. Constants in our potentials are selected to facilitate this purpose. Bond and contact potentials are harmonic which is typical for generic bead-spring polymers. Prefactors are chosen to be extraordinarily large to strongly enforce contacts and bonds, but the exact choice of κ = 2000 is somewhat arbitrary. The bond length of r 0,B = 1 of our potential is a standard choice, again, while we opted to increase the optimal contact distance r 0,C slightly to 1.5 to introduce a second length-scale, which we find lowers the RMSD of the resulting structures. Another motivation for this particular choice of r 0,B/C was to enable comparisons with the model of [9], which features similar optimal distances. Below we will define two variants of our simulation protocol. In contrast to more sophisticated methods [31] which aim at generating random walks in a confined volume, in protocol 1 we simply place 25, 719 particles in a unit cube using a random, uniform distribution. To ensure maximum randomness particles are connected randomly to form 20 linear chains corresponding to 20 chromosomes with lengths ranging from 582 and 1925 particles. The number of particles was not derived directly from the contact pairs, but obtained from [9] to enable a comparison with the reference structures. (Trajectories can be obtained upon request.) From the starting configuration we proceed through the following steps: • 80, 000 time steps with bond and contact potentials but no excluded volume • 50, 000 time steps with bond, contact and reduced excluded volume interactions • 50, 000 time steps with bond, contact and full excluded volume interactions The number of time steps were chosen such that the structure can easily reach an energy plateau in each simulation step. Note that this number may need to be adjusted if different resolutions are considered or the underlying contact data is changed. The cycle is iterated over and over again. The last frame of each cycle is saved and constitutes a new and independent configuration. The first or the first couple of frames are discarded because it typically takes two runs for the energy to converge. Note that having no or reduced repulsion in the early stages of the calculations has also been fairly common in simulated annealing of protein structures with distance restraints (e.g. from NMR [32]). The expansion with reduced excluded volume in the second step was initially introduced to limit the amount of GPU memory used in conjunction with neighbor lists. Fig. 2 shows the starting configuration and snapshots taken at the end of each step in the first cycle. Different colors indicate different chromosomes. For comparison, we also test a second protocol. Here we do not de-or inflate our configuration but turn on and off the contacts instead. For this protocol we start with a random walk of 25, 719 beads which is cut into 20 chromosomes. The actual cycle only consists of two steps: • 100, 000 time steps with bond and excluded volume interactions but no contact potentials • 100, 000 time steps with all interactions All simulations use HOOMD-blue [34,35] with a Langevin thermostat (γ = 1, k B T = 1) and a time step of 0.001. We choose the implementation of the LBVH tree [36] as our neighbor list which scales with particle number as opposed to the system volume.
One cycle of protocol 1 only takes about 100 seconds on a desktop PC with a NVIDIA 1050 graphics card and allows us to generate three dimensional structures from Hi-C contact matrices almost at once. Note that this time includes a setup time for each step of the cycle which adds up to about 20 − 30s. The time for the actual integration can of course be reduced further with a more powerful GPU. (With a RTX 2070 on our cluster one cycle only takes about 30s). Even though this is not a typical application of HOOMD-blue which thrives on long runs with hundred of thousands of particles on high-end GPUs, in this case it enables on-the-fly model and code development on an inexpensive desktop PC.

Analysis of resulting structures
Before analyzing structures in more detail we would like to discuss apparent features arising from Fig. 2. In the first step of the cycle (Fig. 2b) bonds and contacts are enforced, but monomers do not have any volume and can freely pass through each other with no costs involved. While such a configuration fulfills all contacts by construction it does not exhibit chromosome territories. In a sense a globular intertwined configuration such as this is already a trivial (non-unique) solution to our optimization problem without the apparent features observed in experiments. This not only indicates that fulfillment of the contact map is insufficient to obtain experimentally relevant structures, it also suggests that we need to reformulate our task to account for beads that are not in contact. In our case this additional constraint is effectively enforced by the introduction of excluded volume which pushes beads away from each other if they are neither adjacent nor in contact. As can be seen in Fig. 2c, adding a rather short-ranged excluded volume interaction already leads to the emergence of chromosome territories which become more pronounced when full excluded volume interactions are enforced (Fig. 2d). Note that structures obtained as a solution of our optimization problem thus depend to some degree on the choice of our model parameters. As bead sizes increase the number of short-ranged contacts (not accounted for in the contact map) decreases naturally. Enforcement of non-contacts via excluded volume interactions is, unfortunately, not entirely unproblematic as Hi-C contact pairs typically contain only a small fraction of potential contacts that actually exist in the real cell. These missing contacts are, at least in a good data set, often a consequence of sequence mapping problems with repetitive genomic DNA and especially apparent at the centromers. The outward reaching arms which emerge in the final structure are also mostly a result of missing contacts.
In the following we will focus on the discussion of cell 2 of the data set provided by [9]. For simplicity, we will use the first simulation protocol to generate structures, i.e., we will initialize the system once and run the cycle over and over again. For cell 2 (and most other cells in the data set) this approach preserves the chirality of structures across cycles and simplifies our analysis. Chirality, which typically arises using protocol 2 will be discussed in the next section. Fig. 3a shows the potential energy of the final structure at the end of each cycle as a function of the number of cycles. At the end of the first cycle the energy is still somewhat larger which is why this frame should be discarded before the analysis. 2 In subsequent cycles the energy of the last frame of the cycle has converged and jumps between two conformations which only differ slightly. Fig. 3b shows the distribution of bonds and contacts. Both peaks and averages are shifted a little to the right of the potential minimum at r 0,B/C indicating that some of the bonds and contacts are overstretched. (This might be expected when approximating chromatin with spherical monomers.) Nevertheless, our step potentials manage to enforce constraints and the fraction of bonds and contacts which are overstretched beyond 2 · r 0,B/C is below 3 · 10 −4 and 10 −4 , respectively, which is significantly lower than reference simulations of cell 2 in [9]. As our algorithm essentially enforces all contacts and bonds, all experimental contacts are also present in a matrix based on the generated structure. However, as experimental contacts only amount to a small fraction of actual contacts in the cell, we expect to find more contacts in our model structure. In Fig. 4 we analyze the similarity of structures by measuring deviations from an average structure. To this end we determine the RMSD with respect to a pre-averaged reference structure using the RMSD trajectory tool of VMD [33]: where N atoms is the number of monomers and r i (t j ) are the positions of the i th monomer of frame j and r i (t 0 ) the corresponding position in the pre-averaged reference frame. Figure 4: Root mean squared deviation from a pre-averaged structure for the final configuration of each frame (protocol 1).
As illustrated by the low average RMSD of 0.88 all structures are remarkably similar demonstrating that our method is indeed converging to very similar ground-state like conformations after the first cycle has been discarded. The inset of Fig. 4 reveals that the two competing final states are also similar in terms of RMSD. This result can be compared to reference simulations in [9], which obtain a slightly smaller RMSD of 0.67, however not without pre-selecting structures. (The hierarchical annealing protocol of the latter would discard outlier structures at an intermediate resolution/bin size which was required for the most sparse data sets when chromosomes sometimes adopted a chirality opposite to the rest of the structure and got stuck.) In general, we recommend to regard resulting configurations as an ensemble of optimization runs and select only the states with the lowest energies. If, for example, we only consider the ten percent of all configurations with the lowest energies the RMSD drops to 0.76 for protocol 1.

Chirality
In this section we will discuss chiral structures which arise naturally if protocol 2 is applied or if we compare structures resulting from different starting configurations using protocol 1. To detect chiral structures we compute the RMSD with respect to a specific, but randomly chosen configuration. As shown in Fig. 5a, structures cluster into two groups with small and large RMSD, respectively. If one multiplies, for example, all x-coordinates of one group by −1, both groups can be mapped onto each other (Fig. 5b) indicating that they are in fact mirror images. (For protocol 2 the variation of structures is somewhat larger resulting in an average RMSD of 1.02. If we combine the 1, 000 runs from protocol 1 and 2 the average RMSD increases slightly to 1.172 indicating that structures from protocol 1 and 2 are not identical but very similar.) More formally one can also construct a trihedron from the center of masses of the overall structure and the first three chromosomes. The (somewhat arbitrarily defined) chirality can then be determined, for example, by computing the sign of the corresponding vector product. To summarize, our empirical observation indicates that information Figure 5: a) RMSD arising from protocol 2 with respect to a specific configuration. b) RMSD after mirroring transformation. from contact maps is insufficient to determine the chirality of the underlying structure which reflects the inherent symmetry of the associated matrix. The emergence of mirror images in the modeling of interphase chromatin has already been described in [16] for single chromosomes, but to our knowledge not for whole genome structures. Note that the true chirality of the cell may always be obtained by other means, for example, if a microscope image of the same cell exists.

Knots
It is generally assumed that interphase chromosomes are (mostly) unknotted. It has been argued that knots may lead to breakage of DNA [37] or hinder RNA transcription [38], and recently, mechanisms have been suggested to actively unknot chromosomes via type II DNA topoisomerases [39]. The theoretical foundation for the presumed absence of knots is provided by the so-called crumpled globule model, a concept developed in the 1980 s [40,41] in the context of polymer physics. Essentially, interphase chromosomes are believed to be akin to an unknotted polymer which after a collapse from an extended conformation remains unknotted in a non-equilibrium metastable state. The main argument for the latter is the power-law decay of the contact probability which exhibits a similar exponent as a crumpled globule [7]. It has also been shown that chains with an extensive degree of knottedness can be constructed while still preserving the correct exponent [42].
Recently, knots have been observed directly in circular minichromosomes of yeast [43], and we have analyzed three-dimensional structures of haploid interphase chromosomes of mice based on the same data and a similar model as the one discussed in this paper. This analysis revealed that these structures are knotted to a substantial degree [44]. However, different optimization runs based on data from the same cell resulted in similar, but somewhat different topologies. Even though evidence was presented that at least some of these knots are likely real and not artifacts of the modeling, the question whether or not knots exist in interphase chromosomes could not be resolved conclusively. Other modeling attempts based on data from multiple cells have also detected knots, however, to a somewhat lesser degree [22,17]. In the following paragraph we would like to shed some more light on this controversial issue.
To gauge the emergence of knots we will focus on chromosome 14 from cell 2 which has already been discussed in great detail in our previous publication [44]. Table 1 displays the occurrence of knots in the structures constructed with protocol 1 and 2 in comparison to our reference model [9]. The latter is based on the same Hi-C data but uses somewhat different potentials and minimization protocols. Knots are determined by closing the ends via an extension of two lines through the center of mass and the termini of the chromosome before computing a variant of the Alexander polynomial as described in detail in Ref. [44]. Note that details of the closure may have a minor influence on the probabilities [45,46] of the observed knots. We have opted for a rather simple closure which has been used successfully in the context of proteins [46]. However, we have also checked that results do not change qualitatively if more sophisticated closures such as variants of the statistical closure [46,47] are applied in agreement with previous results [45,46]. Structures can in principle also be analyzed using a dedicated web server which was set up to check models of chromatin for knots [48].
As in the reference model we are not able to steer the minimization towards a unique topological state but observe a variety of simple knots. Note however, that this task is far from trivial as a small displacement of a single bond suffices to change knot types as demonstrated in [44]. Nevertheless, we observe very similar knotting probabilities and a noteworthy absence of unknots, figure-eight and five-fold knots in all cases. Surprisingly, the simple trefoil knot located between monomer number 290 and 340 (Fig. 7) is present in (almost) all structures (last row in Table 1), irrespective of the simulation protocol and even the underlying model. This not only provides further evidence that this particular knot actually existed in the cell on which the Hi-C map is based upon, but also for the consistency of our approach. We have also checked that this particular knot is not present in any other cell lines published in [9]. Together with the seemingly different structures emerging from the different cells in [9], this may serve as an indication that single-cell data is actually required to obtain reliable structural representations. Finally, we would like to point out an observation from our simulations. The amount and complexity of knots strongly depends on the excluded volume contribution of the potential. States with no or little excluded volume as produced by steps 1 and 2 of protocol 1 are knotted beyond the capabilities of our algorithm to distinguish them or in fact any algorithm we are aware of. The additional constraints due to contacts increase knotting substantially beyond ideal chains or equilibrium globules [45]. Only a strong excluded volume term guarantees the modest knotting observed in Table 1. From this point of view knotting does depend on the details of the potentials and Table 1 represents in a way the lowest degree of knotting we were able to obtain with our ansatz. Also note that an essentially unknotted starting configuration (such as created from a random walk in the first few time steps of step 1 of protocol 2) does not guarantee that final configurations are unknotted as long as bond crossings can occur.

Summary and Outlook
In this paper we present a minimal structure-based model, which in conjunction with a highly efficient minimization protocol, allows for the instantaneous determination of three-dimensional structures of whole genomes from haploid Hi-C contact maps at the level of 100,000 base pairs per monomer. Irrespective of protocol details and starting conditions, structures appear to converge to conformationally similar low energy states as long as bond crossings are allowed. We find evidence that the mere fulfillment of contact maps is insufficient to create experimentally meaningful structures and that an excluded volume potential which essentially enforces non-contacts is required for the formation of chromosome territories. Furthermore, contact maps also do not contain information about the chirality of the underlying structures, so that one of two mirror images may emerge from our minimization runs. Finally, we also investigate self-entanglements in a particular chromosome. We observe that the same trefoil knot is present in (almost) all of our structures as well as in structures which employ a completely different optimization strategy and potentials, providing further evidence for the consistency of our approach and for the existence of knots in interphase chromatin.
In future we would like to extend our modeling ansatz to higher resolutions while ensuring consistency across all levels of coarse-graining. We also intend to apply our approach to the conceptually challenging modeling of diploid chromosomes [13].