ALGORITHMS FOR THE GENOME MEDIAN UNDER A RESTRICTED MEASURE OF REARRANGEMENT

. Ancestral reconstruction is a classic task in comparative genomics. Here, we study the genome median problem , a related computational problem which, given a set of three or more genomes, asks to find a new genome that minimizes the sum of pairwise distances between it and the given genomes. The distance stands for the amount of evolution observed at the genome level, for which we determine the minimum number of rearrangement operations necessary to transform one genome into the other. For almost all rearrangement operations the median problem is NP-hard, with the exception of the breakpoint median that can be constructed efficiently for multichromosomal circular and mixed genomes. In this work, we study the median problem under a restricted rearrangement measure called 𝑐 4 -distance , which is closely related to the breakpoint and the DCJ distance. We identify tight bounds and decomposers of the 𝑐 4 -median and develop algorithms for its construction, one exact ILP-based and three combinatorial heuristics. Subsequently, we perform experiments on simulated data sets. Our results suggest that the 𝑐 4 -distance is useful for the study the genome median problem, from theoretical and practical perspectives.

the sum of rearrangement distances between itself and the given genomes, is computationally intractable even for highly simplified rearrangement distances.
The distance between two genomes depends on the chosen rearrangement operation. For instance, the number of breakpoints between two genomes, i.e., the number of pairs of genomic markers that appear consecutive in one genome but not in the other, gives rise to a simple rearrangement distance. While, strictly speaking, the breakpoint distance underlies no rearrangement operation [6], other distances do, such as the double-cut-and-join (DCJ) distance [14]. A DCJ operation breaks a genome, represented by a set of sequences of genomic markers, at two arbitrary positions and subsequently reconnects the thus created four open ends in a new combination.
Almost all known rearrangement distances can be computed efficiently in linear time under the assumption that genomic markers appear unique in each genome [2,6,14]. However, considering one step forward, constructing a median of three genomes is NP-hard under almost all rearrangement distances, including DCJ [10], with two notable exceptions: the breakpoint distance and the closely related single-cut-or-join (SCJ) are tractable for multichromosomal circular and mixed genomes [5,10].
The breakpoint graph is a common aid in computing rearrangement distances. When comparing two genomes that are permutations of one another, the graph consists of cycles of even length. The contained number of cycles plays an essential role. For instance, the smallest cycles, i.e., cycles of length 2 represent adjacencies in the breakpoint graph, which are the counterpart of breakpoints. Larger cycles represent the requirement of one or more DCJ operations to transform one genome into the other. Thus, to compute the breakpoint distance, cycles of length 2 are counted (and this quantity is then subtracted from the number or markers of the two genomes). Likewise, to compute the DCJ distance, cycles of any length are counted.
It is natural to also consider intermediary distance measures. In this work we study the 4 -distance between two genomes, which is based on the number of cycles of length up to 4 in the breakpoint graph. In other words, only those cycles are counted that require at most one DCJ operation to be transformed into adjacencies. It is important to highlight that the 4 -distance violates the triangle inequality and therefore is not a metricunlike the breakpoint and DCJ distance. Here, we address the 4 -median problem, i.e., the construction of a new genome that minimizes the sum of pairwise 4 -distances between it and each member of a set of three or more genomes. In particular, we are interested in finding 4 -medians for three genomes.
This work is an extension of the abstract submitted to the 7th Theoretical Computer Science Meeting (VII ETC -Encontro de Teoria da Computação, in portuguese) and the main new contributions are the following: -We present properties and upper bounds to the 4 -distance (Sects. 3, 4); -We improve the first version of the ILP, making it faster, and now it can handle instances with up to 2000 markers with neglegible gaps (Sect. 5.1); -We develop two new combinatorial heuristics to compute the 4 -distance of given genomes (Sects. 5.2, 5.3); -We extend the experiments with different data sets for the new algorithms (Sect. 6). This paper is structured as follows. Section 2 provides basic definitions and notation, including the definition of the median problem. We then present bounds for the median in Section 3. Section 4 addresses the characterization of decomposers, which are building blocks of median genomes. In Section 5, we describe algorithms to compute the 4 -median, including one exact ILP-based and three heuristics. Experimental results on simulated instances are presented in Section 6. Section 7 concludes this paper and presents future works.

Preliminaries
A marker is an oriented DNA fragment. We denote a marker either by or by , depending on its orientation in the DNA strand. A chromosome is a sequence of markers and can be linear or circular. A linear chromosome has two extremities and each one is a telomere. We use a string of markers to represent a chromosome and we add parentheses at the extremities of the represented string to denote a circular chromosome. A genome is a collection of chromosomes.
A marker has two distinct extremities, called tail and head, represented by and ℎ , respectively. An adjacency in a chromosome is conformed by either the extremity of a marker adjacent to a telomere, or by a pair of consecutive marker extremities. As an example, the adjacencies 1 5 ℎ , 5 2 , 2 ℎ 4 , 4 ℎ 3 , 3 ℎ 6 and 6 ℎ 1 ℎ define a circular chromosome. Another representation to it is (5 2 4 3 6 1). A multichromosomal genome is a set of chromosomes such as {(3), 5 2 6, (4 1)}, which is composed of three chromosomes, one linear and two circular.

Classical distances
One can compute a rearrangement distance between two given genomes with support of an equivalent structure known as breakpoint graph [1]. Let M be a set of markers and define M the set of extremities of all markers in M, with |M | = 2 . For two genomes and , each one with markers from M, the breakpoint graph BG( , ) is the multigraph whose vertex set is M and the edges are of two types: -edges and -edges, corresponding to adjacencies in genomes and , respectively. This graph has vertices with degree zero, one or two, and thus it is a collection of paths and cycles. If is the number of common non-telomeric adjacencies between and and is the number of common telomeres, the breakpoint distance [10] between and is Notice that the breakpoint distance is equivalent to the so called single-cut-or-join (SCJ) distance [5]. Since is the number of nontelomeric adjacencies in BG( , ), each one of these adjacencies represents a cycle of length 2 in BG( , ) and one can denote it by 2 = . We call a cycle of length a -cycle.
On the other hand, if is the number of cycles (of any length) and is the number of paths with an even number of edges in BG( , ), the double-cut-and-join (DCJ) distance [14] between and is Breakpoint/SCJ and DCJ distances can be computed efficiently [2,5,10].

Multichromosomal circular genomes
Chromosomes and plasmids of single-celled organisms such as bacteria and archaea, mitochondrial DNA within eukaryotic cells, and chloroplast DNA in plants are examples of circular chromosomes/genomes and motivate the study of the circular genome median. Confining to circular chromosome also avoids a lengthy treatment of distance formulas to account for telomeric adjacencies [7,8,11]. Therefore, from now on we consider only multichromosomal genomes with circular chromosomes.

Median problem
Let Π be a set with ≥ 3 genomes, each one with markers from M. The (genome) median problem on Π asks for finding a genome Γ with markers from M minimizing the pairwise distances between Γ and each genome in Π, under a rearrangement operation. If the operation is the breakpoint/SCJ, then the median can be computed in polynomial time [5]. However, for the DCJ operation, the median problem is NP-hard, even for = 3 [3,10]. Motivated by the searching for where is the boundary of efficiency-hardness of the median problem, one can define a new measure. The 4 -distance, denoted by d 4 , between Π and Π is given by where is the number of markers in M and in both Π and Π , and ℓ is the number of ℓ-cycles in BG(Π , Π ), ℓ ∈ {2, 4}. As mentioned before, the triangle inequality does not hold for d 4 as we can see in a simple example: Given genomes Π 1 = (1 3 2 4), Π 2 = (1 2 3 4) and Π 3 = (1 2 3 4), we have that Let Π = {Π 1 , . . . , Π } be a set of ≥ 3 genomes and Γ be a genome. The 4 -cost (Π, Γ) of Γ given Π is We say that a genome Γ is a 4 -median of a set of ≥ 3 genomes Π if Γ minimizes the 4 -cost (Π, Γ). For a given set of genomes Π, we denote by ⋆ (Π) the value of its 4 -median: ⋆ (Π) = min{ (Π, Γ) : genomes in Π and genome Γ}.
Thus, we can formally state the following: given ≥ 3 genomes in Π, each with markers from M, find a genome Γ with markers from M such that ⋆ (Π) = (Π, Γ).
We are particularly interested in the simplest version of the 4 -Median problem, when = 3.

Graph formulation
We can rephrase the 4 -Median in terms of graphs. We assign each extremity of a marker from the set of markers M to a vertex in a graph and thus | ( )| = 2 . An adjacency in a given genome Π is represented as an edge in with color , that is, an adjacency in Π is an edge in with color . Thus, is a -edge-colored multigraph. Observe that is a generalization of the breakpoint graph [1] for at least three genomes, and we call it an extended breakpoint graph for Π, denoted by BG (Π) = .

Bounds
In this section we give upper bounds to Max-2/4-Cycles, such that the instance is a 3-edge-colorable graph with 2 vertices.
We consider graphs without loops. Let be a graph and , ∈ ( ). The number of edges between and is the multiplicity of , , which is denoted by ( , ). Let be an edge which is incident to and . If there is no harm of confusion, then we write = and we say that is an edge of multiplicity ( , ). An edge of multiplicity 1 is also called a simple edge, and an edge of multiplicity at least 2 is called a multiedge. If ( , ) ≤ 1 for all , ∈ ( ), then is called a simple graph. A 3-regular graph is also called a cubic graph. A cubic graph is 3-edge-colorable if its edges set can be partitioned into three perfect matchings, which are also called the color classes of a 3-edge-coloring of .
Let be a 3-edge-colorable cubic graph of order 2 and Γ be a color class of . Then Γ has -colored 2-cycles and hence, ( Note that every edge of is contained in at most one -colored cycle of Γ and that an -colored 2-cycle contains precisely one edge of and an -colored 4-cycle contains precisely two edges of . Let 3 2 be the unique cubic graph with two vertices. Proof. The first part follows directly from the 4 -distance definition. If = 3 2 , then ( ) = 3. For the other direction, choose Γ such that ( Γ ) = ( ). If ( ) = 3 2 | ( )|, then it follows by the remarks above that Γ has -colored 2-cycles only. Hence, = 3 2 .
Let ≥ 2 and be a path on vertices 1 , . . . , in this order and let ′ be a copy of . Let L( ) be the cubic graph obtained from and ′ by adding the edges ′ for ∈ {1, . . . , } and duplicating edges 1 We call a graph a linear ladder if it is isomorphic to L( ) for an integer ≥ 2. See Figure 2 for an example.
Let 3 4 be the connected cubic graph on 4 vertices which has multiedges. Note that ( 3 4 ) = 5 = | ( 3 4 )|+⌊ 2 ⌋. Therefore, the bound of Theorem 2 is tight for this graph. However, this graph can be characterized by its 4median. Proof. Let be of order 2 . It follows with Theorem 2, that = . Hence, the edges of multiplicity 2 form a perfect matching in . Thus, is obtained from an even cycle by doubling every second edge. Now it easy to see that 3 4 is the only graph with ( ) = 5 2 . The other direction is trivial. As we will see in Section 4, a special type of ladder introduced below, called circular ladder, is a strong decomposer and this is a motivation to show the next results on simple graphs. That is, for simple graphs we can prove some better bounds. For ∈ {0, 2} there are infinitely many graphs which attain the bound of Theorem 2. Maybe, it is true that the family of Figure 2 characterizes the graphs with maximum ( ) and two multiedges. It is unclear whether such graphs exist for > 2. It could be that if a connected cubic graph has more than two multiedges, then ( ) < 2 + ⌊ 2 ⌋.
Let ≥ 3 and be a cycle on vertices 1 , . . . , in this order and let ′ be a copy of . Let 1 ( ) be the cubic graph obtained from and ′ by adding the edges ′ for ∈ {1, . . . , } and 2 ( ) be the cubic graph which is obtained from 1 ( ) − { 1 , ′ ′ 1 } by adding the edges ′ 1 and ′ 1 . Graph 2 ( ) is also called a Möbius ladder.

Decomposers
A popular strategy for constructing solutions to median problems is to decompose into smaller parts and then identify partial solutions thereof. These are subsequently integrated into a complete median [4,[9][10][11][12][13]. In the following, we make use of notation from [12] to characterize such partial solutions for Max-2/4-Cycles.
Decomposers of related problems. In dissecting the computational problem of finding 4 -medians, a straightforward question is whether some decomposers of its related problems, the breakpoint and the DCJ median, are also decomposers of a 4 -median. Here, we look at simple decomposers, in particular 2 2 , a graph of two vertices connected by two distinctly colored parallel edges, and 3 2 , a graph of two vertices connected by parallel edges of all three distinct colors. Proposition 7 ([9]). 2 2 and every connected component in is a decomposer. Further, 3 2 is a strong decomposers of the breakpoint median of three.
While 3 2 is a strong decomposer of the 4 -median as shown above, we observe that 2 2 may be a decomposer, as we can see in Figure 3.
Adequate subgraphs [12] are a family of decomposers of the DCJ median of three genomes: A subgraph ⊂ is an adequate subgraph if ( ) ≥ 3 4 | ( )|.
Proof. Figure 3 is a trivial counterexample. A more illustrative example is the following: Cycles of four vertices

Algorithms
In this section we present four algorithms for the Max-2/4-cycles: one exact ILP-based algorithm and three greedy heuristics.
Suppose we have an instance of Max-2/4-cycles: a graph with 2 vertices, > 0, and a set of three 1-regular graphs Π = {Π 1 , Π 2 , Π 3 } such that each Π has vertex set ( ). Saying in other words, Π has edges, i.e., it is a perfect matching in . Algorithms in this section receive and Π and build and return an 1-regular graph Γ in maximizing the number of 2-and 4-cycles in Γ .

ILP formulation
Our exact ILP algorithm translates the minimization formula for the 4 -median problem in a straightforward way. See Algorithm 1.  We establish that one, and only one, edge in the solution is chosen for each possible in and if the edge in is chosen, then the binary variable receives 1, otherwise it receives 0 (constraint (C.01) and binary variable (D.01)). If the pair of distinct edges and is chosen in , then the binary variable , receives 1. Otherwise, , gets 0 (constraint (C.02), binary variable (D.02)). Finally, we maximize the amount of 2-and 4-cycles with the variables and , , respectively, which correspond to the ILP objective function. Observe that the choices of edges conforming 2-cycles are linked to the counter when an edge is chosen to be part of the solution. And the choice of edge pairs for 4-cycles in the variable , it is performed when the pair is in , with ∈ {1, 2, 3}. The constraint is divided into two parts, since this speeds up the process in the search for the optimal solution.
Lastly, observe that we have ( 2 ) constraints and binary variables in the Algorithm 1.

Induced bicolored cycles
The basic idea behind Algorithm 2 is the following. We take two of the perfect matchings Π and Π from Π in , with , , ∈ {1, 2, 3} and ̸ = , and build an induced graph , = [Π ∪ Π ]. Observe that , is a collection of bicolored cycles, with alternating -and -colored edges. For each bicolored cycle in , , we add edges = in Π to such that both extremities and are vertices of , obtaining the subgraph ′ which is a linear ladder. Notice that ′ may have vertices of degree either 2 or 3. Then we find a set of edges Γ , that maximizes the number of 2-and 4-cycles in ′ . We repeat this process to each pair , ∈ {1, 2, 3}, ̸ = , and return the best of three.

Shrinking adjacencies
Shrinking adjacencies is a greedy strategy to find an 1-regular graph Γ with vertex set ( ) of a given graph defined by three 1-regular graphs Π 1 , Π 2 and Π 3 each one with vertex set ( ). Remember that has 2 vertices and 3 edges. Let , be vertices in ( ). Let and be vertices in ( ) such ̸ = , and , are -colored edges. For each color ∈ {1, 2, 3}, let ′ := + − and set color for the new edge . We call the edge a shrinked adjacency in . Notice that ′ is a 3-edge-colorable graph. Moreover, notice that a shrinking procedure removes three to six edges and adds zero to three new edges from to ′ , depending on how many edges with endpoints and there exist in . Thus, the number of edges in ′ is 3 − 3. Algorithm 3 implements this idea. The criterion in the line 2 of Algorithm 3 is described below.
Shortest cycle criterion: for ℎ ≥ 1, let ℎ be a 3-edge-colorable graph given as an instance of ℎ-th recursive call of Algorithm 3. Notice that each edge in a 3-edge-colorable graph belongs to exactly two bicolored cycles. Thence, in order to describe the shortest cycle criterion, consider a quadruple (vl ( ), rl ( ), sh( ), lg( )), for each edge = ∈ ( ℎ ) such that vl ( ) ∈ {0, 1} denotes the contribution value of the edge ; rl ( ) = T denotes an edge in , and rl ( ) = F otherwise; and sh( ) and lg( ) are the lengths of the two bicolored cycles whose edge belongs to, with sh( ) ≤ lg( ) (longest and shortest cycles).
Initially, in graph = 1 , vl ( ) = 1 and rl ( ) = T for each edge ∈ ( ). Algorithm 3 chooses an -colored edge ∈ ( ℎ ) according to the following. Suppose that = and , are vertices such that ̸ = , = and = are -colored edges. When vertices and are removed from ℎ then, for each , we define an -colored edge = , set rl ( ) = F and vl ( ) = 1 if rl ( ) = rl ( ) = T; otherwise, vl ( ) = 0. Notice that removing vertices and implies in removing edges which, in turn, implies that the quadruple attributes of all remaining edges, of cycles where those edges belong to, must be updated. This means that in each recursive call ( ) edges must have their attributes updated.
Considering the quadruple attributes, we define a total order ⪯ on the set of edges of ℎ . Given two edges 1 and 2 , we say that 1 ⪯ 2 if one and only one of the following conditions holds: We say that ∈ ( ℎ ) is an optimal edge for the shortest cycle criterion if ⪯ for each ∈ ( ℎ ). In each recursion call of Algorithm 3, an optimal edge is chosen.
Finally observe that, in each recursive call, we have to find an edge = according to the shortest cycle criterion, which takes ( ) time, and we have to remove and from ℎ , which takes (1) time. Then, we have to update the quadruple attributes of all edges in cycles involved in this operation and it can be performed in ( ) time. Thus, Algorithm 3 spends time ( ) in each recursive call and therefore its running time is ( 2 ).

Edge scores
Let R be the set of reliable edges, and an edge in belongs to R if ( ) ≥ 2. Let Γ be a 1-regular graph with vertex set ( ). Define the score of an edge in Γ as ( ) = + 1 2 , where is the number of -colored 2-cycles and is the number of -colored 4-cycles such that belongs to them in Γ , with ∈ {1, 2, 3}. It is easy to see that 0 ≤ + ≤ 3. Let (Γ) := ∑︀ ∈Γ ( ) and observe that (Γ) = ( Γ ). The two edges in Γ of an -colored 4-cycle in Γ are called sibling edges. Figure 1b  2 and (2 2 ℎ ) = 0. Algorithm 4 starts choosing reliable edges and arbitrary remaining edges to be part of an initial solution, obtaining a 1-regular graph Γ with vertex set ( ). Then, it computes score and cycle potential for each edge in Γ. The next step is trying to increase the score of an edge, and to decrease the cycle potential of a small subset of edges as well, through local changes. Let be an edge in Γ such that ( ) > 0. Subsequently, for ∈ {1, 2, 3}, there is at least one pair of -colored edges in , say 1 and 1 , such that 1 1 ̸ ∈ Γ. Since Γ is a perfect matching, 1 and 1 are saturated vertices and let 1 2 and 1 2 be these edges in Γ. Now, the algorithm removes 1 2 , 1 2 and adds 1 1 , 2 2 to Γ and the score and cycle potential of the edge must be updated, as well as for the sibling edges of the removed edges 1 2 and 1 2 . Additionally, the score and cycle potential of the new edges 1 1 and 2 2 must be computed. Algorithm allows this operation if and only if 1 2 and 1 2 are not reliable edges. It repeats this process while the sum of the scores of all edges increases from one step to the next and there is an edge with positive cycle potential.

Algorithm 4. Edge scores.
Input: 3-edge-colorable cubic graph obtained by the perfect matchings in Π = {Π1, Π2, Π3} Output: 1-regular graph Γ with vertex set ( ) such that (Γ) is maximum 1: let R be the set of reliable edges of 2: let Γ be a 1-regular graph comprised of R and arbitrary remaining edges Observe that the search in line 4, for an edge in Γ with cycle potential positive, takes ( ) time. Besides that, each line from 5 to 11 can be performed in constant time. Since (Γ) can be increased ( ) times, we have that the running time of Algorithm 4 is ( 2 ).

Experiments and performance evaluation
We simulated multiple genomes in order to (i) sketch the boundaries where our ILP (Algorithm 1) can perform in reasonable time whilst providing an acceptable accuracy, and (ii) evaluate the quality and running times of our heuristics (Algorithms 2-4). Experiments were run using 3.6 GHz CPUs. We implemented the heuristics in Python 3 and used Gurobi 9.0.2 as ILP solver with 8 cores.
The simulated instances were built as follows. Given a root genome with markers, a descendant trio is a set of three genomes, each one generated by simulating independently 100 · random DCJs in the root genome (i.e., is a percentage of the root genome size ranging in {10, 15, 20, 25}).  In order to evaluate the performance of Algorithm 1 for large genomes, we generated root genomes with 1000, 1100, 1200, . . . , 2000 markers distributed in several circular chromosomes and then for each root genome, four descendent trios ranging in {10, 15, 20, 25}. In the solver execution command, we set the time limit to 10 h. For this dataset, the solver exceeded the time limit for genomes with > 1100 markers when = 25. See Figure 5.
Second, in order to stress the heuristics and evaluate their performance, we simulated datasets with large genomes, from 1000 to 10 000 markers and = 25. Figure 6 shows running times and 4 -distances for Algorithms 2-4. Algorithm 4 always returned the smallest 4 -distances, followed by Algorithm 3 with a difference of 0.71%, on average.
Finally, after performing experiments with ILP and heuristics, we compare the returned distances. Due to the ILP limitations, we first generate root genomes with 50, 100, 150, . . . , 500 markers spread over several circular chromosomes and then, for each root genome, four descending triplets with = 25. We then carried out comparisons of the results obtained by the three heuristics with the optimal distances obtained by the ILP. Figure 7 shows that the "Edge scores" heuristic returns distances closer to the optimal ones with a difference of 2.8% on average. Then the heuristics "Shrinking adjacencies" and "Induced bicolored cycles" have, respectively, differences of 5.5% and 15.9% on average in comparison to the optimal distances.

Conclusion
In this work we study the genome median problem under the 4 -distance, a restricted measure of rearrangement. We show bounds and properties concerning this measure, and establish connections with previous works on the breakpoint and the DCJ median problems. We also develop algorithms, one exact ILP-based and three combinatorial heuristics, which allow us to perform experiments on simulated data sets and to provide many insights about the problem. Moreover, this work offers many perspectives for future research as detailed below.
From the theoretical perspective, the computational complexity of the problem is still open, although we conjecture it is NP-hard. Additionally, there is room to deepen the relationships between this and Xu's work [12], especially with respect to decomposers and adequate subgraphs.
Algorithms proposed in this work give a practical perspective to the problem and allow to compare their results to those for the breakpoint and the DCJ median, and we will do so in future work. Particularly, the bounds obtained in Section 3 can give support to design a strategy to speed up Algorithm 1, such as a branch and bound algorithm. On the other hand, these bounds can help us to establish approximations factors to the developed heuristics (Algorithms 2-4). Besides that, a real data analysis, of single-celled organisms or mitochondrial DNA of more complex organisms should give us more information about the behaviour of this measure in practice. Furthermore, we can possibly extend this work to multichromosomal linear genomes, using a classic approach to deal with capping [7,8,11], in which we can safely transform linear into circular chromosomes and preserve the distances.