Generalizations of the Genomic Rank Distance to Indels ?

The rank distance model, introduced by Zanetti et al. in 2016, represents genome rearrangements in multi-chromosomal genomes looking at them as matrices. So far, this model only supported comparisons between genomes with the same gene content. We seek to generalize it, allowing for genomes with different gene content. In this paper, we approach such generalization from two different angles, both using the same representation of genomes, and leading to simple distance formulas and sorting algorithms for genomes with different gene contents, but without duplications.

Proof. Let x = (AB) k y for k ≥ 0. We will prove the lemma by induction on k.
The base case is when k = 0, i.e., x = y. Then, we have x − y = 0, which belongs to im(A − B). Now, suppose that (AB) k y − y ∈ im(A − B) for all y and some k ≥ 0, and let x = (AB) k+1 y. Define z = (AB) k y. By the induction hypothesis, z − y ∈ im(A − B). Additionally, x = ABz. Since Bz ̸ = 0, we can use Lemma 1 to write And therefore x − z ∈ im(A − B). Finally, x − y = (x − z) + (z − y) ∈ im(A − B).
⊓ ⊔ The following result classifies the various types of connected components in the augmented breakpoint graph according to the orbits they contain. 1. A cycle contains two orbits, both balanced. 2. If a path is proper, A-null, or B-null, it contains a single orbit, which is balanced when the path is proper, and unbalanced otherwise. 3. An AB-null path contains two orbits, one balanced and one unbalanced. 4. An AA or BB-null path contains two orbits, both unbalanced.
Proof. 1. A cycle in BG(A, B) must have an even number of vertices because its edges alternate between edges in A and edges in B. In a cycle v 1 , v 2 , . . . , v 2k , the mapping AB corresponds to walking two steps in one direction, while BA corresponds to walking two steps in the other direction. This means that the odd-numbered vertices are all equivalent to one another, as are all the even-numbered vertices. Since the cycle is even, no odd-numbered vertex is equivalent to an even-numbered vertex. Therefore, we end up with two orbits: {v 1 , v 3 , . . . , v 2k−1 } and {v 2 , v 4 , . . . , v 2k }. Notice also that showing that each orbit is balanced. 2. In a path v 1 , v 2 , . . . , v k , as in the case of a cycle above, all the odd-numbered vertices are pairwise equivalent, as are all the even-numbered vertices. However, if there is a free end in the path, and there are at least two vertices, this free end is equivalent to its neighbor, making all the vertices in the path equivalent. If the path consists of a single vertex, then it is clearly a singleton orbit. In both cases, we have a single orbit. If the path is proper and has at least two vertices, then so the orbit is balanced. A proper path with only one vertex also gives rise to a balanced orbit, because either On the other hand, if the path is A-null or B-null and e is the null vertex, then since one of these expressions is zero and the other isn't, showing that the orbit cannot be balanced. 3. An AB-null path v 1 , v 2 , . . . , v 2k+1 has at least two vertices, an even number of edges, and therefore an odd number of vertices. As in the previous cases, the odd-numbered vertices are pairwise equivalent, as are the even-numbered ones. In this case, however, since there are no free ends, these two sets of vertices constitute separate orbits. Notice that so the odd-numbered vertices form a balanced orbit. On the other hand, since one side of this equation is zero while the other isn't, showing that the even-numbered vertices form an unbalanced orbit. 4. An AA or BB-null path v 1 , v 2 , . . . , v 2k has an even number of vertices. As in the previous case, the odd-numbered vertices form an orbit, and the evennumbered ones form a distinct orbit, since there are no free ends. Both are unbalanced, since because one side is zero and the other isn't, and also v t for a similar reason.

⊓ ⊔
We want to show now that the set K of all vectors χ(S) such that S is a balanced orbit forms a basis for ker(A − B). To do so, we need to show that: -For every v ∈ K, we have (A − B)v = 0. This follows directly from the definition of balanced orbits. -K is linearly independent. This comes from the fact that each extremity is present in at most one vector of K (the vectors in K have disjoint supports).
-K generates ker(A − B). This will be proven below.
Lemma 4. Let e be an extremity such that Ae = 0. Then, for every v ∈ ker(A− B), we have (Be) t v = 0.
Proof. We have Proof. From Lemma 2, we know that x − y ∈ im(A − B). Since im(A − B) and ker(A − B) are orthogonal due to the symmetry of A − B, we have (x − y) t v = 0, and therefore x t v = y t v. ⊓ ⊔ Lemma 6. If S is an unbalanced orbit, there is an extremity e ∈ S such that either Ae or Be is a null extremity.
Proof. According to Lemma 3, all unbalanced orbits come from null paths. If S comes from an A-null or B-null Assume, without loss of generality, that v 1 is the null extremity in the path.
Both v 2 and v 2k−2 are adjacent to a null extremity in one of the genomes.
Both orbits also satisfy the lemma, because Bv 2k−1 = v 2k and Bv 2 = v 1 . A similar reasoning applies to the case of a BB-null path. Since there are no other cases of null paths, the lemma is proved.
⊓ ⊔ Lemma 7. The set K generates the kernel of A − B.
Proof. According to Lemma 5, any v ∈ ker(A − B) can be written as where the S i are the disjoint AB-orbits. If S i is an unbalanced orbit, Lemma 6 states that there is an extremity e ∈ S i such that either Ae or Be is a null extremity. For this e, by Lemma 4, we have e t v = 0, and, consequently, Therefore, v is a linear combination of vectors χ(S), where S is a balanced orbit.
⊓ ⊔ With Lemma 7, we conclude that the dimension of ker(A − B) is equal to the number of balanced orbits, and, consequently, we can state the following: Proof. By counting the number of balanced orbits present in each type of component, according to Lemma 3, we get and the desired result follows immediately from the rank-nullity theorem.

B Proofs for Section 3.3 -Sorting
We now show that the rank distance d(A, B) is equal to the optimum weight of a scenario going from A to B using the basic operations listed in Section 3.2.
Lemma 9. Given two genomes A and B, we have Proof. Let X = X 1 , X 2 , . . . , X k be a scenario such that w(X ) = w(A, B). Repeatedly applying the triangle inequality to intermediate genomes of the form However, Therefore,

⊓ ⊔
We say an operation X on genome A is sorting with respect to genome B When A and B are fixed, we say that an operation is sorting if it falls into one or the other of these categories.
We say a component of BG(A, B) is sorted if it is a proper 0-path or a 2cycle, that is, a path with 0 edges or a cycle with 2 edges. The relevance of sorted components stems from the fact that when all the components of the breakpoint graph BG(A, B) are sorted, we have A = B. Therefore, one strategy to transform A into B is to sort component by component of the breakpoint graph. This is the approach we take here. In addition, we consider sorting operations in both directions (applied to A and sorting with respect to B, or vice versa), because all the basic operations we consider have inverses that are themselves basic. Furthermore, for any genome X, a sequence of operations X starting from A ending up at X and a sequence of operations Y starting from B ending up at X can be combined into Z := X • −Y R , where −Y R contains the inverses of the operations in Y listed in reverse order, and • denotes list concatenation.
Lemma 10. If Ax ̸ = 0, Ax ̸ = x, and Bx = x, then cutting the adjacency {x, Ax} in A is always sorting.
Proof. In the breakpoint graph BG (A, B), the node corresponding to the extremity x is the end of a path. Let P be this path. Let X be the cut of adjacency {x, Ax}. The graph BG(A + X, B) has the same components as BG (A, B), except for P . Instead of P , there are two paths. The first is a path with all the nodes of P except for x. It has the same type as P . The second is a proper 0-path with node x. Therefore, d r (A + X, B) = d r (A, X) − 1, because the number of proper paths increases, while n, c, and p AB remain the same.
⊓ ⊔ Lemma 11. If BG(A, B) has at least one path with at least 3 edges, or one cycle with at least 4 edges, there is a sorting double swap.
Proof. In either case, we can take two edges from the same genome, with one edge from the other genome incident to both, to define a double swap. For the cycle, this double swap splits the cycle into two smaller ones. For the path, this double swap transforms the path into a cycle and a path of the same type as the original path. In both cases, the number c of cycles increases by 1, decreasing the distance by 2. Proof. Since the deletion is of an entire chromosome of B-null extremities, the only adjacencies affected are those between two B-null extremities. Therefore, the only components that undergo changes are BB-null paths of length 1, each of them turning into two single nodes absent in both genomes (which we defined to be proper paths), and 0-length B-null paths that turn into 0-length (also proper) paths. Thus, Proof. Notice that the right hand side is f (A, B). Let X 1 , X 2 , . . . , X m be an optimal sequence of operations sorting A into B. Define A 0 = A, A 1 = A 0 + X 1 , A 2 = A 1 + X 2 , and so on, so that A m = B. In order to show that the inequality d i (A, B) ≥ f (A, B) holds, we will first prove that no operation can cause a change in the formula greater than its weight, that is, for any integer i such that This will be done by an exhaustive case analysis. We examine all options for the operation X i and its impact on the breakpoint graph. Note that, by definition, ∆f (A, B; X) = ∆p AB (A, B; X) − 2∆c(A, B; X) − ∆p 0 (A, B; X).
-X i is a cut, so w(X i ) = 1 • If X i cuts an edge in a cycle, the cycle turns into a proper path, so ∆f (A i−1 , B; X i ) = 1.
Recall that only free ends can be joined, and note that the free ends being joined must be present in both genomes. • A double swap involving an A-null or B-null path on the one hand, and an AA-null, BB-null or AB-null path on the other hand, results in one A-null path or a B-null path and one AA-null, BB-null or AB-null path, so −1 ≤ ∆f (A i−1 , B; X i ) ≤ 1.
• A double swap involving two AA-null or BB-null paths results in two AA-null or BB-null paths, or two AB-null paths, so ∆f (A i−1 , B; X i ) = 0 or ∆f (A i−1 , B; X i ) = 2.
• A double swap involving two AB-null paths results in an AA-null path and a BB-null path, or two AB-null paths, so ∆f (A i−1 , B; X i ) = −2 or ∆f (A i−1 , B; X i ) = 0.
• A double swap involving a path and a cycle results in a longer path of the same type, so ∆f (A i−1 , B; X i ) = 2.
• A double swap applied to two cycles results in a longer cycle, in which case ∆f (A i−1 , B; X i ) = 2.
• A double swap applied to two edges of the same path either results in a path of the same type with a reversed segment, in which case ∆f (A i−1 , B; X i ) = 0, or a path of the same type and a cycle, in which case ∆f (A i−1 , B; X i ) = −2.
• A double swap applied to two edges of the same cycle either results in another cycle, in which case ∆f (A i−1 , B; X i ) = 0, or in two shorter cycles, in which case ∆f (A i−1 , B; X i ) = −2.
We conclude that in all these cases we always have −∆f -X i is an insertion of k markers, so w(X i ) = 2k In this case, our result follows immediately from Lemma 15.
-X i is a deletion of k markers, so w(X i ) = 2k In this case, our result follows immediately from Lemma 16. Now we can reason as follows. The expression f (A, B) can be written as a summation of deltas: . . .
Then we can use this fact to conclude:

⊓ ⊔
There is a simple way to equalize the gene content of A and B. Examining

⊓ ⊔
Lemmas 18 and 19, together with the triangle inequality, which necessarily holds for any distance defined via a set of allowed operations with non-negative weights, give us a lower bound on the indel distance d i (A, B).