Computing the Rearrangement Distance of Natural Genomes

The computation of genomic distances has been a very active field of computational comparative genomics over the past 25 years. Substantial results include the polynomial-time computability of the inversion distance by Hannenhalli and Pevzner in 1995 and the introduction of the double cut and join distance by Yancopoulos et al. in 2005. Both results, however, rely on the assumption that the genomes under comparison contain the same set of unique markers (syntenic genomic regions, sometimes also referred to as genes). In 2015, Shao et al. relax this condition by allowing for duplicate markers in the analysis. This generalized version of the genomic distance problem is NP-hard, and they give an integer linear programming (ILP) solution that is efficient enough to be applied to real-world datasets. A restriction of their approach is that it can be applied only to balanced genomes that have equal numbers of duplicates of any marker. Therefore, it still needs a delicate preprocessing of the input data in which excessive copies of unbalanced markers have to be removed. In this article, we present an algorithm solving the genomic distance problem for natural genomes, in which any marker may occur an arbitrary number of times. Our method is based on a new graph data structure, the multi-relational diagram, that allows an elegant extension of the ILP by Shao et al. to count runs of markers that are under- or over-represented in one genome with respect to the other and need to be inserted or deleted, respectively. With this extension, previous restrictions on the genome configurations are lifted, for the first time enabling an uncompromising rearrangement analysis. Any marker sequence can directly be used for the distance calculation. The evaluation of our approach shows that it can be used to analyze genomes with up to a few 10,000 markers, which we demonstrate on simulated and real data.


Introduction
The study of genome rearrangements has a long tradition in comparative genomics. A central question is how many (and what kind of) mutations have occurred between the genomic sequences of two individual genomes. In order to avoid disturbances due to minor local effects, often the basic units in such comparisons are syntenic regions identified between the genomes under study, much larger than the individual DNA bases. We refer to such regions as genomic markers, or simply markers, although often one also finds the term genes.
Following the initial statement as an edit distance problem [16], a comprehensive trail of literature has addressed the problem of computing the number of rearrangements between two genomes in the past 25 years. In a seminal paper in 1995, Hannenhalli and Pevzner [12] introduced the first polynomial time algorithm for the computation of the inversion distance of transforming one chromosome into another one by means of segmental inversions. Later, the same authors generalized their results to the HP model [11] which is capable of handling multi-chromosomal genomes and accounts for additional genome rearrangements. Another breakthrough was the introduction of the double cut and join (DCJ) model [2,20], that is able to capture many genome rearrangements and whose genomic distance is computable in linear time. The model is based on a simple operation in which the genome sequence is cut twice between two consecutive markers and re-assembled by joining the resulting four loose cut-ends in a different combination.
A prerequisite for applying the DCJ model in practice to study rearrangements in genomes of two related species is that their genomic marker sets must be identical and that any marker occurs exactly once in each genome. This severely limits its applicability in practice. Linear time extensions of the DCJ model allow markers to occur in only one of the two genomes, computing a genomic distance that minimizes the sum of DCJ and insertion/deletion (indel) events [5,9]. Still, markers are required to be singleton, i.e., no duplicates can occur. When duplicates are allowed, the problem is more intrincate and all approaches proposed so far are NP-hard, see for instance [1,6,7,14,17,18]. From the practical side, more recently, Shao et al. [18] presented an integer linear programming (ILP) formulation for computing the DCJ distance in presence of duplicates, but restricted to balanced genomes, where both genomes have equal numbers of duplicates. A generalization to unbalanced genomes was presented by Lyubetsky et al. [13], but their approach does not seem to be applicable to real data sets, see Section 6.1 for details.
In this paper we present the first feasible exact algorithm for solving the NP-hard problem of computing the distance under a general genome model where any marker may occur an arbitrary number of times in any of the two genomes, called natural genomes. Specifically, we adopt the maximal matches model where only markers appearing more often in one genome than in the other can be deleted or inserted. Our ILP formulation is based on the one from Shao et al. [18], but with an efficient extension that allows to count runs of markers that are under-or over-represented in one genome with respect to the other, so that the pre-existing model of minimizing the distance allowing DCJ and indel operations [5] can be adapted to our problem. With this extension, once we have the genome markers, no other restriction on the genome configurations is imposed.
The evaluation of our approach shows that it can be used to analyze genomes with up to a few ten thousand markers, provided the number of duplicates is not too large. The complete source code of our ILP implementation and the simulation software used for generating the benchmarking data in Section 6.2 are available from https://gitlab.ub.uni-bielefeld.de/gi/ding.

Preliminaries
A genome is a set of chromosomes and each chromosome can be linear or circular. Each marker in a chromosome is an oriented DNA fragment. The representation of a marker m in a chromosome can be the symbol m itself, if it is read in direct orientation, or the symbol m, if it is read in reverse orientation. We represent a chromosome S of a genome A by a string s, obtained by the concatenation of all symbols in S, read in any of the two directions. If S is circular, we can start to read it at any marker and the string s is flanked by parentheses.
Given two genomes A and B, let U be the set of all markers that occur in both genomes.  -Since the genomes can have distinct multiplicity of markers, we also need to consider insertions and deletions of segments of contiguous markers [5,9,21]. We refer to insertions and deletions collectively as indels. For example, the deletion of segment 5 2 6 2 from linear chromosome 1 2 3 5 2 6 2 4 results in 1 2 3 4. Indels have two restrictions: (i) only markers that have positive ∆Φ can be deleted; and (ii) only markers that have negative ∆Φ can be inserted.
In this paper, we are interested in computing the DCJ-indel distance between two genomes A and B, that is denoted by d id DCJ (A, B) and corresponds to the minimum number of DCJs and indels required to sort A into B. We separate the instances of the problem in three types: 1. Singular genomes: the genomes contain no duplicate markers, that is, each common marker 1 is singular in each genome. Formally, we have that, for each m ∈ G, Φ A (m) = Φ B (m) = 1. The distance between singular genomes can be easily computed in linear time [2,5,9]. 2. Balanced genomes: the genomes contain no exclusive markers, but can have duplicates, and the number of duplicates in each genome is the same. Formally, we have U = G and, for each m ∈ U, Φ A (m) = Φ B (m). Computing the distance for this set of instances is NP-hard, and an ILP formulation was given in [18]. 3. Natural genomes: these genomes can have exclusive markers and duplicates, with no restrictions on the number of copies. Since these are generalizations of balanced genomes, computing the distance for this set of instances is also NP-hard. In the present work we present an efficient ILP formulation for computing the distance in this case.

DCJ-indel distance of singular genomes
First we recall the problem when common duplicates do not occur, that is, when we have singular genomes. We will summarize the linear time approach to compute the DCJ-indel distance in this case that was presented in [5], already adapted to the notation required for presenting the new results of this paper.

Relational diagram
For computing a genomic distance it is useful to represent the relation between two genomes in some graph structure [2,3,5,10,11]. Here we adopt a variation of this structure, defined as follows. For each marker m, denote its two extremities by m t (tail) and m h (head). Given two singular genomes A and B, the relational diagram R(A, B) has a set of vertices V = V (A)∪V (B), where V (A) has a vertex for each extremity of each marker of genome A and V (B) has a vertex for each extremity of each marker of genome B. Due to the 1-to-1 correspondence between the vertices of R(A, B) and the occurrences of marker extremities in A and B, we can identify each extremity with its corresponding vertex. It is convenient to represent vertices in V (A) in an upper line, respecting the order in which they appear in each chromosome of A, and the vertices in V (B) in a lower line, respecting the order in which they appear in each chromosome of B.
If the marker extremities γ 1 and γ 2 are adjacent in a chromosome of A, we have an adjacency edge connecting them. Similarly, if the marker extremities γ 1 and γ 2 are adjacent in a chromosome of B, we have an adjacency edge connecting them. Marker extremities located at chromosome ends are called telomeres and are not connected to any adjacency edge. In contrast, each extremity that is not a telomere is connected to exactly one adjacency edge. Denote by E A adj and by E B adj the adjacency edges in A and in B, respectively. In addition, for each common marker m ∈ G, we have two extremity edges, one connecting the vertex m h from V (A) to the vertex m h from V (B) and the other connecting the vertex m t from V (A) to the vertex m t from V (B). Denote by E γ the set of extremity edges. Finally, for each occurrence of an exclusive marker in U\G, we have an indel edge connecting the vertices representing its two extremities. Denote by E A id and by E B id the indel edges in A and in B. Each vertex is then connected either to an extremity edge or to an indel edge.
All vertices have degree one or two, therefore R(A, B) is a simple collection of cycles and paths. A path that has one endpoint in genome A and the other in genome B is called an AB-path. In the same way, both endpoints of an AA-path are in A and both endpoints of a BB-path are in B. A cycle contains either zero or an even number of extremity edges. When a cycle has at least two extremity edges, it is called an AB-cycle. Moreover, a path (respectively cycle) of R(A, B) composed exclusively of indel and adjacency edges in one of the two genomes corresponds to a whole linear (respectively circular) chromosome and is called a linear (respectively circular) singleton in that genome. Actually, linear singletons are particular cases of AA-paths or BB-paths. An example of a relational diagram is given in Fig. 1.

······
······ ······ ······ ······ ······ r r r r r r r r r r r r r r r r r r r r r r r r r r r r The numbers of telomeres and of AB-paths in R(A, B) are even. The DCJcost [5] of a DCJ operation ρ, denoted by ρ , is defined as follows. If it either increases the number of AB-cycles by one, or the number of AB-paths by two, ρ is optimal and has ρ = 0. If it does not affect the number of AB-cycles and AB-paths in the diagram, ρ is neutral and has ρ = 1. If it either decreases the number of AB-cycles by one, or the number of AB-paths by two, ρ is counteroptimal and has ρ = 2.

Runs and indel-potential
The approach that uses DCJ operations to group exclusive markers for minimizing indels depends on the following concepts.
Given two genomes A and B and a component C of R(A, B), a run [5] is a maximal subpath of C, in which the first and the last edges are indel edges, and all indel edges belong to the same genome. It can be an A-run when its indel edges are in genome A, or a B-run when its indel edges are in genome B. We denote by Λ(C) the number of runs in component C. If Λ(C) ≥ 1 the component C is said to be indel-enclosing, otherwise Λ(C) = 0 and C is said to be indel-free.
While sorting components separately with optimal DCJs only, runs can be merged (when two runs become a single one), and also accumulated together (when all its indel edges alternate with adjacency edges only and the run can be inserted or deleted at once) [5]. The indel-potential of a component C, denoted by λ(C), is the minimum number of indels derived from C after this process and can be directly computed from Λ(C):   After an optimal DCJ that creates a new cycle, one A-run is accumulated (between edges e4 and e3 there is only an adjacency edge) and two B-runs are merged (e2 is in the same run with e5 and e6). Indeed the indel-potential of the original BB-path is three.
Let λ 0 and λ 1 be, respectively, the sum of the indel-potentials for the components of the relational diagram before and after a DCJ ρ. The indel-cost of ρ is then ∆λ(ρ) = λ 1 − λ 0 , and the DCJ-indel cost of ρ is defined as ∆d(ρ) = ρ + ∆λ(ρ). While sorting components separately, it has been shown that by using neutral or counter-optimal DCJs one can never achieve ∆d < 0 [5]. This gives the following result: Lemma 1 (from [2,5]). Given two singular genomes A and B, whose relational diagram R(A, B) has c AB-cycles and i AB-paths, we have

Distance of circular genomes
For singular circular genomes, the graph R(A, B) is composed of cycles only. In this case the upper bound given by Lemma 1 is tight and leads to a simplified formula [5]: λ(C) .

Recombinations and linear genomes
For singular linear genomes, the upper bound given by Lemma 1 is achieved when the components of R(A, B) are sorted separately. However, there are optimal or neutral DCJ operations, called recombinations, that act on two paths and have ∆d < 0. Such path recombinations are said to be deducting. The total number of types of deducting recombinations is relatively small. By exhaustively exploring the space of recombination types, it is possible to identify groups of chained recombinations, so that the sources of each group are the original paths of the graph. In other words, a path that is a resultant of a group is never a source of another group. This results in a greedy approach that optimally finds the value to be deducted, as we will describe in the following Deducting recombinations. In a recombination, the two paths on which the cuts are applied are called sources and the paths obtained after the joinings are called resultants. Any recombination whose sources are an AA-path and a BBpath is optimal. A recombination whose sources are two different AB-paths can be either neutral, when the resultants are also AB-paths, or counter-optimal, when the resultants are an AA-path and a BB-path. Any recombination whose sources are an AB-path and an AA-or a BB-path is neutral [4,5]. Let A (respectively B) be a sequence with an odd (≥ 1) number of runs, starting and ending with an A-run (respectively B-run). We can then make any combination of A and B, such as AB, that is a sequence with an even (≥ 2) number of runs, starting with an A-run and ending with a B-run. An empty sequence (with no run) is represented by ε. Then each one of the notations AA ε , AA A , AA B , AA AB ≡ AA BA , BB ε , BB A , BB B , BB AB ≡ BB BA , AB ε , AB A , AB B , AB AB and AB BA represents a particular type of path (AA, BB or AB) with a particular structure of runs (ε, A, B, AB or BA). By convention, an AB-path is always read from A to B. These notations were adopted due to the observation that, besides the DCJ type of the recombination (optimal, neutral or counteroptimal), the only properties that matter are whether the paths have an odd or an even number of runs and whether the first run is in genome A or in genome B [5]. An example of a deducting recombination is given in Fig. 3.
The complete set of path recombinations with ∆d ≤ −1 is given in Table 1. In Table 2 we also list recombinations with ∆d = 0 that create at least one source of recombinations of Table 1. We denote by • an AB-path that can not be a source of a recombination in Tables 1 and 2, such as AB ε , AB A and AB B . Table 1. Path recombinations that have ∆d ≤ −1 and allow the best reuse of the resultants.
sources resultants ∆λ ρ ∆d The two sources of a recombination can also be called partners. Looking at Table 1 we observe that all partners of AB AB and AB BA paths are also partners of AA AB and BB AB paths, all partners of AA A and AA B paths are also partners of AA AB paths and all partners of BB A and BB B paths are also partners of BB AB paths. Moreover, in some cases deducting recombinations are chained, that is, resultants from deducting recombinations in Tables 1 and 2 are sources of other deducting recombinations, as shown in Fig. 4. These observations allow the identification of groups of chained recombinations, as listed in Table 3.
Each group is represented by a combination of letters, where: -W represents an AA AB , W represents an AA A and W represents an AA B ; -M represents a BB AB , M represents a BB A and M represents a BB B ; -Z represents an AB AB and N represents an AB BA .
Although some groups have reusable resultants, those are actually never reused. (If groups that are lower in the table use as sources resultants from higher groups, the sources of all referred groups would be previously consumed in groups that occupy even higher positions in the table.) Due to this fact, the number of occurrences in each group depends only on the initial number of each type of component.
The deductions shown in Table 3 can be computed with an approach that greedily maximizes the groups in P, Q, T, S, M and N in this order. The P part contains only one operation and is thus very simple. The same happens with Q, since the two groups in this part are exclusive after applying P. The four subparts of T are also exclusive after applying Q. (Note that groups WWM, WWM, MMW and MMW of T are simply subgroups of Q.) The groups in S correspond to the simple application of all possible remaining operations with ∆d = −1. After applying operations of type ZN, WM and WM, the remaining operations in S are all exclusive. After S, the two groups in M are exclusive and then the same happens to the six groups in N (that are simply subgroups of M).
We can now write the theorem that gives the exact formula for the DCJ-indel distance of linear singular genomes: Table 3. Chained recombination groups obtained from Tables 1 and 2. The column scr indicates the contribution of each path to the distance decrease (the table is sorted in descending order with respect to this column).
id sources resultants ∆d scr Theorem 1 (from [5]). Given two singular linear genomes A and B, whose relational diagram R(A, B) has c AB-cycles and i AB-paths, we have where δ = 2P + 3Q + 2T + S + 2M + N and P, Q, T, S, M and N here refer to the number of deductions in the corresponding chained recombination groups.

DCJ-indel distance of natural genomes
Based on the results presented so far, we develop an approach for computing the DCJ-indel distance of natural genomes A and B. First we note that it is possible to transform A and B into matched singular genomes A ‡ and B ‡ as follows. For The matched occurrences receive the same identifier (for example, by adding the same index ) in A ‡ and in B ‡ . Examples are given in Fig. 5 (top and center). Observe that, after this procedure, the number of common markers between any pair of matched genomes A ‡ and B ‡ is Let M be the set of all possible pairs of matched singular genomes obtained from natural genomes A and B. The DCJ-indel distance of A and B is then defined as

Multi-relational diagram
While the original relational diagram clearly depends on the singularity of common markers, when they appear in multiple copies we can obtain a data structure that integrates the properties of all possible relational diagrams of matched genomes. The multi-relational diagram MR(A, B) of two natural genomes A and B also has a set V (A) with a vertex for each of the two extremities of each marker occurrence of genome A and a set V (B) with a vertex for each of the two extremities of each marker occurrence of genome B. Again, sets E A adj and E B adj contain adjacency edges connecting adjacent extremities of markers in A and in B. But here the set E γ contains, for each marker m ∈ G, an extremity edge connecting each vertex in V (A) that represents an occurrence of m t to each vertex in V (B) that represents an occurrence of m t , and an extremity edge connecting each vertex in V (A) that represents an occurrence of m h to each vertex in V (B) that represents an occurrence of m h . Furthermore, r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r where c D and i D are the numbers of AB-cycles and AB-paths in D, respectively, and δ D is the optimal deduction of recombinations of paths from D. Since n * is constant for any consistent decomposition, we can separate the part of the formula that depends on D, called weight of D: Theorem 2. Given two natural genomes A and B, the DCJ-indel distance of A and B can be computed by the following equation: where D is the set of all consistent decompositions of MR (A, B).
Proof. Since a consistent decomposition allows to match duplicates in both genomes, it is clear that (A, B) corresponds to an optimal rearrangement scenario from A to some B and therefore implies a matching between the markers of A and the markers of B that gives rise to a consistent decomposition D of MR(A, B) such that d id DCJ (D ) < min D∈D {d id DCJ (D)}, which is a contradiction.
A consistent decomposition D such that d id DCJ (D) = d id DCJ (A, B) is said to be optimal. Computing the DCJ-indel distance between two natural genomes A and B, or, equivalently, finding an optimal consistent decomposition of MR (A, B) is an NP-hard problem. In Section 6 we will describe an efficient ILP formulation to solve it. Before that, we need to introduce a transformation of MR(A, B) that is necessary for our ILP.

Capping
The ends of linear chromosomes produce some difficulties for the decomposition. Fortunately there is an elegant technique to overcome this problem, called capping [11]. It consists of modifying the genomes by adding artificial singular common markers, also called caps, that circularize all linear chromosomes, so that their relational diagram is composed of cycles only, but, if the capping is optimal, the genomic distance is preserved.

Capping of canonical genomes
When two singular genomes A and B have no exclusive markers, they are called canonical genomes.
The graph R(A, B) of canonical genomes A and B has no indel edges and the indel-potential of any component C is λ(C) = 0. In this case, the upper bound given by Lemma 1 is tight, and the distance formula can be simplified to d id DCJ (A, B) = |G| − c − i 2 , as it was already shown in [2]. Also, obtaining an optimal capping of canonical genomes is quite straightforward [4,11,20], as shown in Table 4: the caps should guarantee that each AB-path is closed into a separate AB-cycle, and each pair composed of an AAand a BB-path is closed into an AB-cycle by linking each extremity of the AApath to one of the two extremities of the BB-path (there are two possibilities of linking, and any of the two is optimal). If the numbers of linear chromosomes in A and in B are different, there will be some AA-or BB-paths remaining. For each of these an artificial adjacency between caps is created in the genome with less linear chromosomes, and each artificial adjacency closes each remaining AAor BB-path into a separate AB-cycle.
Let κ A be the total number of linear chromosomes in A and κ B be the total number of linear chromosomes in B. The difference between the number of AAor BB-paths is equal to the difference between κ A and κ B . In other words, if R(A, B) has a AA-paths, b BB-paths and i AB-paths, the number of artificial Table 4. Linking paths from R(A, B) of canonical genomes. The symbol ΓA represents an artificial adjacency in A and the symbol ΓB represents an artificial adjacency in B. The value ∆d corresponds to ∆n − ∆c − ∆(2i). adjacencies in such an optimal capping is exactly a * = |κ A − κ B | = |a − b|. Moreover, the number of caps to be added is We can show that the capping described above is optimal by verifying the corresponding DCJ-indel distances. Let the original genomes A and B have n markers and R(A, B) have c AB-cycles, besides the paths. Then, after capping, the circular genomes A • and B • have n = n + p * markers and R(A • , B • ) has c = c + i + max{a, b} AB-cycles and no path, so that An example of an optimal capping of two canonical linear genomes is given in Fig. 6. q q q q q q q q q q q q q q q q q q q q

Singular genomes: correspondence between recombinations and capping
When exclusive markers occur, we can obtain an optimal capping by simply finding caps that properly link the sources of each recombination group (listed in Table 3) into a single AB-cycle. Indeed, in Table 5 we give a linking that achieves the optimal ∆d for each recombination group, followed by the optimal linking of remaining paths. The remaining paths are treated exactly as the linking of paths in canonical genomes. By greedily linking the paths following a top-down order of the referred Table 5 we clearly obtain an optimal capping that transforms A and B into circular genomes A • and B • with d id DCJ (A • , B • ) = d id DCJ (A, B). See an example in Fig. 7. q q q q q q q q q q q q q q Furthermore, similarly to the case of canonical genomes, the numbers of artificial adjacencies and caps in such a capping are respectively a * = |κ A − κ B | and p * = max{κ A , κ B } as we will show in the following.
In Table 5 we can observe that there are two types of groups: (i) balanced, that contain the same number of AA-and BB-paths, and (ii) unbalanced, in which the numbers of AA-and BB-paths are distinct. Unbalanced groups require some extra elements to link the cycle. These elements can be indel-free AA-or BB-paths (of the type that is under-represented in the group) or, if these paths do not exist, artificial adjacencies either in genome A or in genome B (again, of the genome that is under-represented in the group). We then need to examine these unbalanced groups to determine the number of caps and of artificial adjacencies that are required for an optimal capping. Proposition 1. After identifying the recombination groups, either we have only unbalanced groups that are over-represented in genome A or we have only unbalanced groups that are over-represented in genome B.
Proof. It is clear that, after P and until N, we have either only groups W * (overrepresented in A), or only groups M * (over-represented in B). The question is Table 5. Linking sources of chained recombination groups from Table 3. The symbol ΓA represents an artificial adjacency in A and the symbol ΓB represents an artificial adjacency in B. The notation AAε ≺ ΓA means that an AAε-path is preferred to close the cycle, but if it does not exist, we take an artificial adjacency in A. In order to give the correct order of linking, we sometimes need to represent a path AB AB by BA BA and a path AB BA by BA AB . The value ∆d corresponds to ∆n−∆c−∆(2i)+∆λ. Unbalanced groups over-represented in genome A are marked with a "∪", while unbalanced groups over-represented in genome B are marked with a "∩" .
id sources linking AB-cycle T ∆n ∆c ∆(2i) ∆λ ∆d whether groups in N that are over-represented in B are compatible with previous groups of type W * and, symmetrically, whether groups in N that are overrepresented in A are compatible with previous groups of type M * .
Let us examine the case of group ZZW. (i) At a first glance one could think that this group is compatible with MMW. However, if all components of these two unbalanced groups would be in the graph, we would instead have two times the group MZW, that is balanced and located before the two other groups in the table (observe that 2 × MZW has a smaller ∆d than ZZW + MMW). (ii) When we test the compatibility of ZZW with MMW, we see that with the same components we would get MMWW, that is balanced and located before the two other groups in the table (observe that MMWW has the same ∆d as ZZW + MMW).
With a similar analysis we can show that for all cases either we have only unbalanced groups that are over-represented in genome A or we have only unbalanced groups that are over-represented in genome B.

Proposition 2.
When an unbalanced group is being linked, either there is a remaining AA or BB-path (of the genome that is under-represented), that is then used to link the group, or there is no remaining AA or BB-path (of the genome that is under-represented) and an artificial adjacency links the group.
Proof. First we observe that, after distributing all paths of the relational diagram among the recombination groups, following the top-down greedy approach, there could be AA and/or BB-paths remaining, that were not assigned to any group, and they might be useful to link unbalanced groups. We will now examine the procedure of linking the unbalanced groups either with those remaining paths or with artificial adjacencies.
A particular case are the unbalanced groups from T. Since all unbalanced groups in T have analogous compositions, without loss of generality, suppose a group over-represented in genome A of type WWM is being linked. If, at this point, there is a remaining indel-enclosing BB-path, it cannot be a BB AB or a BB B , otherwise with the components of the group being linked and the existing remaining path we could form a balanced group that appears in a higher position of the table, with at least the same ∆d, which is a contradiction. We could however have an extra BB A . In this case we would take the alternative solution of linking each pair AA AB +BB A into a separate cycle, that is twice group WM of S, achieving the same ∆d. If no BB A remains, we would have the standard linking of the three paths into a single cycle including either an indel-free BB-path or an artificial adjacency in B.
The unbalanced groups from S or N are easier to analyze: if one of these groups, over-represented in genome A (respectively in genome B), is being linked, there cannot be any remaining indel-enclosing BB-path (respectively AA-path). We can verify this by supposing, without loss of generality, that an unbalanced group over-represented in genome A is being linked. If, at this point, there is a remaining indel-enclosing BB-path, then with the components of the group being linked and the existing remaining path we could form a balanced group that appears in a higher position of the table, with at least the same ∆d, which is a contradiction.

Propositions 1 and 2 prove the following result.
Theorem 3. Let κ A and κ B be, respectively, the total numbers of linear chromosomes in singular genomes A and B. We can obtain an optimal capping of A and B with exactly caps and a * = |κ A − κ B | artificial adjacencies between caps.

Capped multi-relational diagram
We can transform MR(A, B) into the capped multi-relational diagram MR • (A, B) as follows. First we need to create 4p * new vertices, named , each one representing a cap extremity. Each of the 2κ A telomeres of A is connected by an adjacency edge to a distinct cap extremity among Similarly, each of the 2κ B telomeres of B is connected by an adjacency edge to a distinct cap extremity among by an artificial adjacency edge. All these new adjacency edges and artificial adjacency edges are added to E A adj and E B adj , respectively. We also connect each  Note that the 2p * cap extremities added to each genome correspond to p * implicit caps. Furthermore, the number of artificial adjacency edges added to the genome with less linear chromosomes is a * = |κ A − κ B |. Since each pair of matched singular genomes (A ‡ , B ‡ ) ∈ M can be optimally capped with this number of caps and artificial adjacencies, it is clear that at least one optimal capping of each (A ‡ , B ‡ ) corresponds to a consistent decomposition D ∈ D • . An example of a capped multi-relational diagram is given in Figure 8. r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r 6 An algorithm to compute the DCJ-indel distance of natural genomes An ILP formulation for computing the distance of two balanced genomes A and B was given by Shao et al. in [18]. In this section we describe an extension of that formulation for computing the DCJ-indel distance of natural genomes A and B, based on consistent cycle decompositions of MR • (A, B). The main difference is that here we need to address the challenge of computing the indel-potential λ(C) for each cycle C of each decomposition. Note that a cycle C of R(A, B) has either 0, or 1, or an even number of runs, therefore its indel-potential can be computed as follows: The formula above can be redesigned to a simpler one, that is easier to implement in the ILP. First, let a transition in a cycle C be an indel-free segment of C that is between a run in one genome and a run in the other genome and denote by ℵ(C) the number of transitions in C. Observe that, if C is indel-free, then obviously ℵ(C) = 0. If C has a single run, then we also have ℵ(C) = 0. On the other hand, if C has at least 2 runs, then ℵ(C) = Λ(C). Our new formula is then split into a part that simply tests whether C is indel-enclosing and a part that depends on the number of transitions ℵ(C). Note that C∈D r(C) = c r D + s D , where c r D and s D are the number of indelenclosing AB-cycles and the number of circular singletons in D, respectively. Now, we need to find a consistent decomposition D of MR • (A, B) maximizing its weight where cr D = c D − c r D is the number of indel-free AB-cycles in D.

ILP formulation
Our formulation (shown in Algorithm 1) searches for an optimal consistent cycle decomposition of MR • (A, B) = (V, E), where the set of edges E is the union of all disjoint sets of the distinct types of edges, In the first part we use the same strategy as Shao et al. [18]. A binary variable x e (D.01) is introduced for every edge e, indicating whether e is part of the computed decomposition. Constraint C.01 ensures that adjacency edges are in all decompositions, Constraint C.02 ensures that each vertex of each decomposition has degree 2, and Constraint C.03 ensures that an extremity edge is selected only together with its sibling. Counting the number of cycles in each decomposition is achieved by assigning a unique identifier i to each vertex v i that is then used to label each cycle with the numerically smallest identifier of any contained vertex (see Constraint C.04, Domain D.02). A vertex v i is then marked by variable z i (D.03) as representative of a cycle if its cycle label y i is equal to i (C.06). However, unlike Shao et al., we permit each variable y i to take on value 0 which, by Constraint C.05, will be enforced whenever the corresponding cycle is indel-enclosing. Since the smallest label of any vertex is 1 (cf. D.02), any cycle with label 0 will not be counted.
The second part is our extension for counting transitions. We introduce binary variables r v (D.04) to label runs. To this end, Constraint C.07 ensures that each vertex v is labeled 0 if v is part of an A-run and otherwise it is labeled 1 indicating its participation in a B-run. Transitions between Aand B-runs in a cycle are then recorded by binary variable t e (D.05). If a transition occurs between any neighboring pair of vertices u, v ∈ V of a cycle, Constraint C.08 causes transition variable t {u,v} to be set to 1. We avoid an excess of co-optimal solutions by canonizing the locations in which such transitions may take place. More specifically, Constraint C.09 prohibits label changes in adjacencies not directly connected to an indel and Constraint C.10 in edges other than adjacencies of genome A, resulting in all A-runs containing as few vertices as possible.
In the third part we add a new constraint and a new domain to our ILP, so that we can count the number of circular singletons. Let K be the circular chromosomes in both genomes and E k id be the set of indel edges of a circular chromosome k ∈ K. For each circular chromosome we introduce a decision variable s k (D.06), that is 1 if k is a circular singleton and 0 otherwise. A circular chromosome is then a singleton if all its indel edges are set (see Constraint C.11).
The objective of our ILP is to maximize the weight of a consistent decompositon, that is equivalent to maximizing the number of indel-free cycles, counted by the sum over variables z i , while simultaneously minimizing the number of transitions in indel-enclosing AB-cycles, calculated by half the sum over variables t e , and the number of circular singletons, calculated by the sum over variables s k .

Algorithm 1 ILP for the computation of the DCJ-indel distance of natural genomes
Objective: such that e and d are siblings Implementation. We implemented the construction of the ILP as a python application, available at https://gitlab.ub.uni-bielefeld.de/gi/ding.
Comparison to the approach by Lyubetsky et al. As mentioned in the Introduction, another ILP for the comparison of genomes with unequal content and paralogs was presented by Lyubetsky et al. [13]. In order to compare our method to theirs, we ran our ILP using CPLEX on a single thread with the two small artificial examples given in that paper on page 8. The results in terms of DCJ distance were the same. A comparion of running times is presented in Table 6. Table 6. Comparison of running times and memory usage to the ILP in [13].

Performance benchmark
For benchmarking purposes, we used gurobi 9.0 as solver. In all our experiments, we ran gurobi on a single thread.
Generation of simulated data. Here we describe our simulation tool that is included in our software repository (https://gitlab.ub.uni-bielefeld.de/ gi/ding) and used for evaluating the performance of our ILP implementation. Our method samples marker order sequences over a user-defined phylogeny. However, our simulations are restricted to pairwise comparisons generated over rooted, weighted trees of two leaves. Starting from an initial marker order sequence of user-defined length (i.e., number of markers), the simulator samples Poisson-distributed DCJ events with expectation equal to the corresponding edge weights. Likewise, insertion, deletion and duplication events of one or more consecutive markers are sampled, yet, their frequency is additionally dependent on a rate factor that can be adjusted by the user. The length of each segmental insertion, deletion, and duplication is drawn from a Zipf distribution, whose parameters can also be adjusted by the user. At each internal node of the phylogeny, the succession of mutational operations is performed in the following order: DCJ operations, duplications, deletions, insertions. To this end, cut points, as well as locations for insertions, deletions and duplications are uniformly drawn over the entire genome.
In our simulations, we used s = 4 for zipfian distributions of insertions and deletions, and s = 6 for duplications. Unless specified otherwise, insertion and deletion rates were set to be 0.1 and 0.2 respectively. We set the length of the root genome to 20,000 marker occurrences.
Evaluating the impact of the number of duplicate occurrences. In order to evaluate the impact of the number of duplicate occurrences on the running time, we first keep the number of simulated DCJ events fixed to 10,000 and vary parameters that affect the number of duplicate occurrences.
Our ILP solves the decomposition problem efficiently for real-sized genomes under small to moderate numbers of duplicate occurrences: solving times for genome pairs with less than 10,000 duplicate occurrences (∼ 50% of the genome size) shown in Figure 9 are with few exceptions below 5 minutes and exhibit a linear increase, but solving time is expected to boost dramatically with higher numbers of duplicate occurrences. To further exploit the conditions under which the ILP is no longer solvable with reasonable compute resources we continued the experiment with even higher amounts of duplicate occurrences and instructed gurobi to terminate within 1 hour of computation. We then partitioned the simulated data set into 8 intervals of length 500 according to the observed number of duplicate occurrences. For each interval, we determined the average as well as the maximal multiplicity of any duplicate marker and examined the average optimality gap, i.e., the difference in percentage between the best primal and the best dual solution computed within the time limit. The results are shown in Table 7 and emphasize the impact of duplicate occurrences in solving time: below 14,000 duplicate occurrences, the optimality gap remains small and sometimes even the exact solution is computed, whereas above that threshold the gap widens very quickly.    Evaluating additional parameters. So far we examined only the impact of duplicates on solving times of our program. However, other parameters of our experiment are expected to have an effect on the solving times, too. We ran three experiments, in each varying one of the following parameters while keeping the others fixed: (i) genome size, (ii) number of simulated DCJs and indels, and (iii) number of chromosomes. The duplication rate was fixed at 0.4 for these experiments and the running time was limited to 1 hour.
The results, illustrated in Figures 10, 11 and 12, indicate that the number of linear chromosomes plays a major factor in the solving time. At the same time, solving times vary more widely with increasing chromosome number. The latter has a simple explanation: telomeres, represented as caps in the multi-relational diagram, behave in the same way as duplicate occurrences of the same marker do. Increasing their number (by increasing the number of linear chromosomes) increases exponentially the search space of matching possibilities.
Conversely, the number of simulated DCJs and indels has a minor impact on the solving times of our simulation runs. However, while initially exhibiting collinearity, the solving times for higher numbers of DCJs and indels divert super-linearly. Lastly, the genome size has a negligible effect on solving time within the tested range of 20,000 to 50,000 marker occurrences.

Real data analysis
Recently, the first three high-resolution haplotype-resolved human genomes have been published [8]. The study reports an average number of 156 inversions per genome, of which 121 are characterized as simple and 35 as copy-variable inversions. Here, we demonstrate the applicability of our approach to the study of real data by calculating the DCJ-indel distance between one of these haplotypes (HG00514.h0) and the human reference sequence (GRCh38). After the construction of a genomic marker set, we represented each chromosome of both genomes as marker sequence, with the largest chromosome (chr. 1) comprising close to 18,000 markers. We then ran our ILP for the computation of the DCJ-indel distance on each pair of chromosomes independently. We were able to obtain exact solutions for 17 chromosomes within few minutes and two more within a few days. However, the remaining four comparisons did not complete within a timelimit of 3 days. Still, after that time, their optimality gaps were below 0.1%. The calculated DCJ-indel distances ranged between 1.3% and 7.7% of the length of the marker sequences, with the number of runs accounting for at least 48.7% of the distance. Further details on the data set, the construction of the genomic markers, and the calculated DCJ-indel distances are described in Appendix A.

Conclusion
By extending the DCJ-indel model to allow for duplicate markers, we introduced a rearrangement model that is capable of handling natural genomes, i.e., genomes that contain shared, individual, and duplicated markers. In other words, under this model genomes require no further processing nor manipulation once genomic markers and their homologies are inferred. The DCJ-indel distance of natural genomes being NP-hard, we presented a fast method for its calculation in form of an integer linear program. Our program is capable of handling realsized genomes, as evidenced in simulation and real data experiments. It can be applied universally in comparative genomics and enables uncompromising analyses of genome rearrangements. We hope that such analyses will provide further insights into the underlying mutational mechanisms. Conversely, we expect the here presented model to be extended and specialized in future to reflect the insights gained by these analyses.

A Analysis of high-resolution human genome data
Recently, the first three high-resolution haplotype-resolved human genomes have been published, representing a Han Chinese (HG00514), a Puerto Rican (HG00733), and a Yoruban Nigerian (NA19240) individual, respectively [8]. Each of these individuals contribute two sets of 23 chromosomes that we call genomes h0 and h1, respectively. The analysis in this work is confined to the comparison of the HG000514.h0 genome with the human reference sequence (GRCh38).
Construction of Genomic marker set. The computation of the here proposed rearrangement measure depends on pre-defined genomic markers. To obtain such markers from the studied human genome data set, we used GEESE [15]. GEESE implements a heuristic for the genome segmentation problem [19] and takes as input local pairwise sequence alignments. These were computed by LASTZ with parameter settings "--step=10 --gapped --gfextend --ambig-uous=iupac --masking=5 --filter=identity:95 --hspthresh=90000". Furthermore, we used the following scoring scheme which has been inferred from the genomic data set: We then ran GEESE on the computed alignments to construct genomic markers of length at least 500bp using the following parameter settings: --minLength 500 --minIdent 90 --maxGap 50 --minAlnLength 20 An overview of the number of markers obtained for each chromosome is given in Table 8. Figure 13 shows a histogram of the multiplicities of markers in both genomes.
Comparison with structural variation data set. To assess the quality of our marker set, we compare the breakpoints that are introduced by the markers with breakpoints of identified structural variations (SVs) in the data set. Chaisson et al. [8] unified call sets of structural variations from a range of different technologies into a single call set that is made available at: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/ working/20180627_PanTechnologyIntegrationSet Table 8. Overview of high-resolution human genome dataset. Columns from left to right: Chromosome number, number of markers in each of the twenty-three chromosomes of both genomes, agreement of marker set of the reference genome with SV call set of Chaisson et al. [8], solving time of the ILP for the pairwise comparison ("-" indicates that no exact solution was found-instead, the optimality gap is specified in brackets), the DCJ-indel distance of the computed solution, and the reported number of runs.  Note that largest fraction (47%) of called SVs is associated with tandem repeats and other repetitive elements, whose exact number and precise location depends heavily on the parameter choices of the employed detection algorithms, as the authors remark in their paper [8]. Structural variations introduce breakpoints in the pairwise alignment of genomic sequences. We assess the quality of our genomic marker set by quantifying the number of markers that are in agreement with these breakpoints, i.e., that are not disrupted by a boundary of an SV from the call set. To this end, we mapped the SV breakpoints to the reference genome and compared them with the marker set of the reference genome. The outcome of this analysis is summarized in the "SV agreement" column of Table 8.
DCJ-indel distance computation. Using gurobi 9.0 as solver, we applied the ILP described in Section 6 to each of the twenty-three pairs of chromosomes of HG00514.h0 and GRCh38. Each computation was run on a single thread on a compute cluster populated with different state-of-the-art hardware (e.g., Intel Xeon E7540 processors). We set gurobi to terminate the search for an optimal solution after at most 3 days of running time. Otherwise the default parameters were used. The results of these calculations are summarized in the last threee columns of Table 8. For chromosomes 5, 9, 15 and 16, no optimal solution was found within the time limit. In those cases, column "solving time[s]" shows in brackets the optimality gap instead. The optimality gap is defined as difference in percentage between the best primal and the best dual solution so far identified by the solver.