An approximation algorithm for genome sorting by reversals to recover all adjacencies

Zhai, Shanshan; Zhang, Peng; Zhu, Daming; Tong, Weitian; Xu, Yao; Lin, Guohui

doi:10.1007/s10878-018-0346-y

An approximation algorithm for genome sorting by reversals to recover all adjacencies

Published: 31 August 2018

Volume 37, pages 1170–1190, (2019)
Cite this article

Journal of Combinatorial Optimization Aims and scope Submit manuscript

Shanshan Zhai¹,
Peng Zhang¹,
Daming Zhu¹,
Weitian Tong²,
Yao Xu³ &
…
Guohui Lin ORCID: orcid.org/0000-0003-4283-3396³

334 Accesses
1 Citation
Explore all metrics

Abstract

Genome rearrangement problems have been extensively studied for more than two decades, intended to understand the species evolutionary relationships in terms of the long range genetic mutations at the genome level. While most earlier studies focus on the simplified genomes ignoring gene duplicates, thousands of whole genome sequencing projects reveal that a genome typically carries multiple gene duplicates distributed in various ways along the genome. Given a source genome and a target genome such that one is a re-ordering of the genes in the other, we measure the evolutionary distance by the minimum number of reversals applied on the source genome to recover all the gene adjacencies in the target genome. We define this optimization problem as sorting by reversals to recover all adjacencies, or SBR2RA in short. We show that SBR2RA is APX-hard and uncover some similarities and differences to the classic counterpart, the sorting by reversals problem. From the approximability perspective, we present a $2 \alpha $-approximation algorithm, where $\alpha \in [1, 2]$ is the best approximation ratio for a related optimization problem which is suspected to be NP-hard.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An improved approximation algorithm for the reversal and transposition distance considering gene order and intergenic sizes

Article Open access 29 December 2021

Sorting by Reversals, Transpositions, and Indels on Both Gene Order and Intergenic Sizes

Sorting by Reversals and Transpositions with Proportion Restriction

Notes

We note that during this recovering process, existing common adjacencies can be broken, if necessary, and then recovered later.

References

Bafna V, Pevzner PA (1996) Genome rearrangements and sorting by reversals. SIAM J Comput 25:272–289
Article MathSciNet MATH Google Scholar
Bafna V, Pevzner PA (1998) Sorting by transpositions. SIAM J Discrete Math 11:224–240
Article MathSciNet MATH Google Scholar
Berman P, Hannenhalli S, Karpinski M (2002) $1.375$-approximation algorithm for sorting by reversals. In: Proceedings of the 10th annual European symposium on algorithms (ESA’02), pp 200–210
Berman P, Karpinski M (1999) On some tighter inapproximability results. In: Proceedings of the of 26th international colloquium on automata, languages and programming (ICALP’99), pp 200–209
Caprara A (1997) Sorting by reversals is difficult. In: Proceedings of the first annual international conference on computational molecular biology, pp 75–83
Chen W, Chen Z, Samatova NF, Peng L, Wang J, Tang M (2014) Solving the maximum duo-preservation string mapping problem with linear programming. Theor Comput Sci 530:1–11
Article MathSciNet MATH Google Scholar
Christie DA (1996) Sorting permutations by block-interchanges. Inf Process Lett 60:165–169
Article MathSciNet MATH Google Scholar
Christie DA (1998) A $3/2$ approximation algorithm for sorting by reversals. In: ACM-SIAM proceedings of the ninth annual symposium on discrete algorithms (SODA’98), pp 244–252
Christie DA, Irving RW (2001) Sorting strings by reversals and by transpositions. SIAM J Discrete Math 14:193–206
Article MathSciNet MATH Google Scholar
Chrobak M, Kolman P, Sgall J (2004) The greedy algorithm for the minimum common string partition problem. In: Proceedings of the 7th international workshop on approximation algorithms for combinatorial optimization problems (APPROX 2004) and the 8th international workshop on randomization and computation (RANDOM 2004), LNCS 3122, pp 84–95
Goldstein A, Kolman P, Zheng J (2004) Minimum common string partition problem: hardness and approximations. In: Proceedings of the 15th international symposium on algorithms and computation (ISAAC 2004), LNCS 3341, pp 484–495
Gu Q-P, Peng S, Sudborough H (1999) A $2$-approximation algorithm for genome rearrangements by reversals and transpositions. Theor Comput Sci 210:327–339
Article MathSciNet MATH Google Scholar
Hannenhalli S, Pevzner P (1995) Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. In: ACM proceedings of the 27th annual symposium on the theory of computing (STOC’95), pp 178–189
Jerrum MR (1985) The complexity of finding minimum-length generator sequences. Theor Comput Sci 36:265–289
Article MathSciNet MATH Google Scholar
Kececioglu JD, Sankoff D (1993) Exact and approximation algorithms for the inversion distance between two permutations. In: Proceedings of the fourth annual symposium on combinatorial pattern matching (CPM’93), LNCS 684, pp 87–105
Kolman P, Waleń T (2007) Approximating reversal distance for strings with bounded number of duplicates. Discrete Appl Math 155:327–336
Article MathSciNet MATH Google Scholar
Rubert DP, Feijão P, Braga MDV, Stoye J, Martinez FHV (2017) Approximating the DCJ distance of balanced genomes in linear time. Algorithms Mol Biol 12:3
Article MATH Google Scholar
Sankoff D (1999) Genome rearrangement with gene families. Bioinformatics 16:909–917
Article Google Scholar
Watterson G, Ewens W, Hall T, Morgan A (1982) The chromosome inversion problem. J Theor Biol 99:1–7
Article Google Scholar

Download references

Acknowledgements

PZ is partially supported by the NNSF China Grant 61672323 and the NSF of Shandong Province Grant ZR2016AM28; DZ is partially supported by the NNSF China Grants 61472222, 61732009, and 61761136017; WT is partially supported by the funds from the Office of the Vice President for Research and Economic Development at Georgia Southern University; GL is supported by the NSERC.

Author information

Authors and Affiliations

School of Computer Science and Technology, Shandong University, Jinan, Shandong, China
Shanshan Zhai, Peng Zhang & Daming Zhu
Department of Computer Science, Georgia Southern University, Statesboro, GA, USA
Weitian Tong
Department of Computing Science, University of Alberta, Edmonton, AB, T6G 2E8, Canada
Yao Xu & Guohui Lin

Authors

Shanshan Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Peng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Daming Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Weitian Tong
View author publications
You can also search for this author in PubMed Google Scholar
Yao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Guohui Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guohui Lin.

Appendices

A Hardness results

Given a permutation $\pi = (\pi _1, \pi _2, \ldots , \pi _n)$ on $\{1, 2, \ldots , n\}$, a breakpoint of $\pi $ is a pair of adjacent positions $i, i+1$ such that $|\pi _i - \pi _{i+1}| \ne 1$ (Kececioglu and Sankoff 1993).

Lemma A.1

[Theorem 1 in Berman and Karpinski (1999)] For any $\epsilon ', \epsilon '' \in (0, \frac{1}{2})$, there exists an $m \ge 2$ such that it is NP-hard to decide whether an instance of SBR with 2240m breakpoints has the minimum number of reversals less than $(1236 + \epsilon ')m$ or larger than $(1237 - \epsilon '')m$.

In the following we need the values of $\epsilon '$ and $\epsilon ''$ such that $\epsilon ' + \epsilon '' < \frac{1}{2}$, which imply

$$\begin{aligned} (1237 - \epsilon '')m - (1236 + \epsilon ')m = \left( 1 - (\epsilon ' + \epsilon '') \right) m > 1. \end{aligned}$$

(1)

Theorem A.1

The SBR2RA problem is NP-hard.

Proof

We prove the NP-hardness of the SBR2RA problem by contradiction, using a reduction from the SBR problem, as follows. Assume to the contrary that the SBR2RA problem is solvable in polynomial time.

For any $\epsilon ', \epsilon '' \in (0, \frac{1}{2})$ such that $\epsilon ' + \epsilon '' < \frac{1}{2}$, let I denote an instance of the SBR problem as stated in Lemma A.1. That is, I has 2240m breakpoints and it is NP-hard to decide whether its minimum number of reversals is less than $(1236 + \epsilon ')m$ or larger than $(1237 - \epsilon '')m$. Let the permutation in I be $\pi = (\pi _1, \pi _2, \ldots , \pi _n)$, on $\{1, 2, \ldots , n\}$ for some n. We construct an instance $I'$ of SBR2RA with the source genome $A = (\pi _1, \pi _2, \ldots , \pi _n)$ and the target genome $B = (1, 2, \ldots , n)$.

Let $\langle \rho _1, \rho _2, \ldots , \rho _r \rangle $ denote a series of the minimum number r of reversals on the source genome A of the instance $I'$ such that ${{{\mathcal {J}}}}(\rho _r \circ \rho _{r-1} \circ \cdots \circ \rho _1 \circ A) = {{{\mathcal {J}}}}(B)$. Note that the number r (and the series of reversals $\langle \rho _1, \rho _2, \ldots , \rho _r \rangle $) can be obtained in polynomial time by our assumption. Since the target genome B is the identity permutation $(1, 2, \ldots , n)$, the resulting genome $\rho _r \circ \rho _{r-1} \circ \cdots \circ \rho _1 \circ A$ is either B or the reverse of B, i.e., $(n, n-1, \ldots , 1)$. It follows that exactly the same series of reversals $\langle \rho _1, \rho _2, \ldots , \rho _r \rangle $ on the permutation $\pi $ in the instance I, possibly plus one extra reversal $\rho (1, n)$, transform $\pi $ to the identity permutation. Let $r^*$ denote the minimum number of reversals for the instance I. In other words, in polynomial time we determine that $r^* \le r + 1$. Obviously we have $r \le r^*$; and thus together we have

$$\begin{aligned} r \le r^* \le r + 1. \end{aligned}$$

This contradicts the gap in Eq. (1); or equivalently, since r is computed in polynomial time and $r < (1236 + \epsilon ')m$ if and only if $r^* < (1236 + \epsilon ')m$, we can decide in polynomial time for the instance I whether $r^* < (1236 + \epsilon ')m$ or $r^* > (1237 - \epsilon '')m$. This proves that the SBR2RA problem is NP-hard. $\square $

Using the gap in Eq. (1), the proof of Theorem A.1 can be slightly modified to show the APX-hardness of the SBR2RA problem. In fact, since 1 is insignificant compared to 1236m, one easily sees that the SBR2RA problem can not be approximated within a factor of

$$\begin{aligned} \frac{1237}{1236} - \epsilon \approx 1.0008 - \epsilon , \ \text{ for } \text{ any } \text{ positive } \epsilon < 0.0008, \end{aligned}$$

which is also an inapproximability lower bound for the SBR problem (Berman and Karpinski 1999).

Corollary A.1

The SBR2RA problem is APX-hard.

B The complete proof of Lemma 4.2

Proof

(of Lemma 4.2, continued from the main text)

Case 1.2.$ac \in R_M(A, B)$ but $bd \notin R_M(A, B)$ (or the other way around, which can be argued symmetrically). Let $C_3$ denote the cycle containing the adjacency-edge ac. After the reversal, $a_{i-1} a_j$ becomes saturated, and $|R_{M'}(A, B)| = |R_M(A, B)| + 1$; the ending characters of the blocks remain the same except that $a_i =$ ‘b’ becomes an end and $a_{j+1} =$ ‘d’ becomes an end. Due to $ab, cd \in R_{M'}(A, B)$, we split the pairing ac to pair a with $a_i$ and to pair c with $a_{j+1}$; then the two blocks in which $a_i$ and $a_{j+1}$ reside respectively belong to the same cycle, which contains two ending characters $a_i =$ ‘b’ and $a_{j+1} =$ ‘d’. Furthermore, $C_3$ is merged with this cycle to form a single cycle containing both edges ab and cd, see Fig. 8.

From the non-existence of a saturated adjacency stated in Lemma 4.1, the cycles $C_1$ containing ab, $C_3$ containing the character ‘a’, and the one containing the character ‘b’, denoted as $C_4$, must not all be distinct; for the same reason, the cycles $C_2$ containing cd, $C_3$ containing the character ‘c’, and the one containing the character ‘d’, denoted as $C_5$, must not all be distinct. One can easily check that $C_1, C_2, C_3, C_4, C_5$ actually represent at most three distinct cycles of $G = (V, E)$. This suggests that the reversal operation $\rho (i, j)$ decreases $C_M$ by at most 2, by merging all these cycles into one. Note that in this process only one copy of ac is moved out of $R_M(A, B)$, but ‘a’ and ‘c’ reside in the same cycle afterwards; these imply the non-existence of a saturated adjacency stated in Lemma 4.1.

Case 1.3.$ac \notin R_M(A, B)$ and $bd \notin R_M(A, B)$. After the reversal, $R_{M'}(A, B) = R_M(A, B) \oplus \{ab, cd\}$ ($\oplus $ is the multiset union operation); the ending characters of the blocks remain the same except that all these four characters $a_{i-1}, a_i, a_j, a_{j+1}$ become ends. We pair $a_{i-1}$ with $a_i$ and pair $a_j$ with $a_{j+1}$; this way, the cycle $C_1$ is expanded to contain an adjacency-edge ab and the cycle $C_2$ is expanded to contain an adjacency-edge cd, see Fig. 9.

From the non-existence of a saturated adjacency stated in Lemma 4.1, the cycles $C_1$, the one containing the character ‘a’ denoted as $C_3$, and the one containing the character ‘b’ denoted as $C_4$, must not all be distinct; for the same reason, the cycles $C_2$, the one containing the character ‘c’ denoted as $C_5$, and the one containing the character ‘d’ denoted as $C_6$, must not all be distinct. Since the characters on the cycles $C_1, C_3, C_4$ remain the same after the reversal, so do the characters on the cycles $C_2, C_5, C_6$, the reversal operation $\rho (i, j)$ decreases $C_M$ by at most 2, by merging $C_1, C_3$ and $C_4$ into one and by merging $C_2, C_5$ and $C_6$ into one. Note that in this process no adjacency is moved out of $R_M(A, B)$ and cycles were only merged, implying the non-existence of a saturated adjacency stated in Lemma 4.1 is maintained.

Case 2.$a_{i-1}$ and $a_i$ reside in the same block on the cycle $C_1$, while $a_j$ and $a_{j+1}$ are ending characters. In this case, the reversal operation $\rho (i, j)$ breaks the adjacency ab, i.e., $ab \in R_{M'}(A, B)$.

Case 2.1. Both $ac, bd \in R_M(A, B)$. Let $C_2$ denote the cycle containing the adjacency-edge ac and $C_3$ denote the cycle containing the adjacency-edge bd. Applying the non-existence of a saturated adjacency stated in Lemma 4.1 on $a_{i-1} a_i = ab$, we conclude that $C_1, C_2, C_3$ must not all be distinct. After the reversal, $a_{i-1} a_j$ and $a_i a_{j+1}$ become saturated, and $|R_{M'}(A, B)| = |R_M(A, B)| - 1$; the ending characters of all the blocks remain the same, except the disappearance of $a_j =$ ‘c’ and $a_{j+1} =$ ‘d’. Due to $ab \in R_{M'}(A, B)$, we pair the ‘a’ originally paired with $a_j$ and the ‘b’ originally paired with $a_{j+1}$ to form the adjacency ab. At the end, if $C_1, C_2$ and $C_3$ (at most two!) are merged into one cycle, see Fig. 10, then the reversal operation $\rho (i, j)$ decreases $C_M$ by at most 1, and additionally all ending characters ‘a’, ‘b’, ‘c’ and ‘d’ (if any) reside in the same cycle. These imply the non-existence of a saturated adjacency stated in Lemma 4.1.

In the other case $C_1, C_2$ and $C_3$ (at most two!) are not merged into one cycle, but lead to k distinct cycles with $k = 2$ or 3. When $k = 2$, $C_M$ remains the same and at most one of ac and bd can satisfy the statement in Lemma 4.1, then restoring the non-existence of a saturated adjacency stated in Lemma 4.1 helps decrease $C_M$ by 1. When $k = 3$, $C_M$ is increased by 1 and both ac and bd can satisfy the statement in Lemma 4.1, then restoring the non-existence of a saturated adjacency stated in Lemma 4.1 helps decrease $C_M$ by 2, with the total effect of decreasing by 1.

In summary, the reversal operation $\rho (i, j)$ decreases $C_M$ by at most 1 and consequently decreases the quantity $|R_M(A, B)| + C_M$ by at most 2.

Case 2.2.$ac \in R_M(A, B)$ but $bd \notin R_M(A, B)$ (or the other way around, which can be argued symmetrically). Let $C_2$ denote the cycle containing the adjacency-edge ac. After the reversal, $a_{i-1} a_j$ becomes saturated, and $|R_{M'}(A, B)| = |R_M(A, B)|$; the ending characters of the blocks remain the same, except the disappearance of $a_j =$ ‘c’ while $a_i =$ ‘b’ becomes an end. We break the edge ac of $C_2$ incident at $a_j =$ ‘c’ while re-connect the ending character ‘a’ with $a_i =$ ‘b’. Applying the non-existence of a saturated adjacency stated in Lemma 4.1 on $a_{i-1} a_i = ab$, we conclude that $C_1, C_2$, and the cycle containing character ‘b’ (if any) denoted as $C_3$, must not all be distinct.

If $C_1, C_2$ and $C_3$ (at most two!) are merged into one cycle, see Fig. 11, then the reversal operation $\rho (i, j)$ decreases $C_M$ by at most 1, and additionally all ending characters ‘a’, ‘b’, and ‘c’ (if any) reside in the same cycle. These imply the non-existence of a saturated adjacency stated in Lemma 4.1.

In the other case $C_1, C_2$ and $C_3$ (at most two!) are not merged into one cycle, but lead to 2 distinct cycles. Then $C_M$ remains the same, and the new adjacency ac can satisfy the statement in Lemma 4.1. Immediately restoring the non-existence of a saturated adjacency stated in Lemma 4.1 helps decrease $C_M$ by 1.

In summary, the reversal operation $\rho (i, j)$ decreases $C_M$ by at most 1 and consequently decreases the quantity $|R_M(A, B)| + C_M$ by at most 1.

Case 2.3.$ac \notin R_M(A, B)$ and $bd \notin R_M(A, B)$. After the reversal, $|R_{M'}(A, B)| = |R_M(A, B)| + 1$; the ending characters of the blocks remain the same except that the two characters $a_{i-1} =$ ‘a’ and $a_i =$ ‘b’ become new ends. By pairing $a_{i-1}$ with $a_i$, the cycle $C_1$ is expanded to contain a new adjacency-edge ab, see Fig. 12. From the non-existence of a saturated adjacency stated in Lemma 4.1, the cycles $C_1$, the one containing the character ‘a’ denoted as $C_2$, and the one containing the character ‘b’ denoted as $C_3$, must not all be distinct; the reversal operation $\rho (i, j)$ decreases $C_M$ by at most 1, by merging $C_1, C_2$ and $C_3$ (at most two!) into one, see Fig. 12. Note that in this process no adjacency is moved out of $R_M(A, B)$ and cycles were only merged, implying the non-existence of a saturated adjacency stated in Lemma 4.1 is maintained.

Case 3. All four $a_{i-1}$, $a_i$, $a_j$, and $a_{j+1}$ are ending characters. Let $C_1$ ($C_2$, $C_3$, $C_4$, respectively) denote the cycle containing ‘a’ (‘b’, ‘c’, ‘d’, respectively).

Case 3.1. Both $ac, bd \in R_M(A, B)$. Note that the ending characters ‘a’ and ‘c’ (‘b’ and ‘d’, respectively) are in a common cycle, that is, $C_1$ and $C_3$ refer to the same cycle ($C_2$ and $C_4$ refer to the same cycle, respectively). If $a_{i-1}$ and $a_j$ are not paired, i.e., not forming an adjacency-edge in the cycle $C_1$, we pair them together to break $C_1$ into two smaller cycles (see Fact 4.1). After the reversal, $a_{i-1} a_j$ and $a_i a_{j+1}$ become saturated, and $|R_{M'}(A, B)| = |R_M(A, B)| - 2$; the ending characters of all the blocks remain the same, except the disappearance of these four ending characters $a_{i-1}$, $a_i$, $a_j$, and $a_{j+1}$, see Fig. 13. In summary, the reversal operation $\rho (i, j)$ does not decrease $C_M$ (but might increase it by at most 2). Note that in this process only one copy of ac (one copy of bd, respectively) is moved out of $R_M(A, B)$, but the new block containing ac and the characters ‘a’ and ‘c’ are in at most two cycles by Fact 4.1 (the new block containing bd and the characters ‘b’ and ‘d’ are in at most two cycles by Fact 4.1, respectively); these imply the non-existence of a saturated adjacency stated in Lemma 4.1.

Case 3.2.$ac \in R_M(A, B)$ but $bd \notin R_M(A, B)$ (or the other way around, which can be argued symmetrically). The same as in Case 3.1, we note that the ending characters ‘a’ and ‘c’ are in a common cycle, that is, $C_1$ and $C_3$ refer to the same cycle. If $a_{i-1}$ and $a_j$ are not paired, i.e., not forming an adjacency-edge in the cycle $C_1$, we pair them together to break $C_1$ into two smaller cycles (see Fact 4.1). After the reversal, $a_{i-1} a_j$ becomes saturated, and $|R_{M'}(A, B)| = |R_M(A, B)| - 1$; the ending characters of the blocks remain the same except the disappearance of the two ending characters $a_{i-1}$ and $a_j$. Therefore, the reversal operation $\rho (i, j)$ does not decrease $C_M$ (but might increase it by at most 1). The non-existence of a saturated adjacency stated in Lemma 4.1 is maintained because the new block containing ac and the characters ‘a’ and ‘c’ are in at most two cycles by Fact 4.1.

Case 3.3.$ac \notin R_M(A, B)$ and $bd \notin R_M(A, B)$. After the reversal, $|R_{M'}(A, B)| = |R_M(A, B)|$; the ending characters of the blocks remain the same. The reversal operation $\rho (i, j)$ does not change $C_M$. Since no adjacency is moved out of $R_M(A, B)$, the non-existence of a saturated adjacency stated in Lemma 4.1 is maintained. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhai, S., Zhang, P., Zhu, D. et al. An approximation algorithm for genome sorting by reversals to recover all adjacencies. J Comb Optim 37, 1170–1190 (2019). https://doi.org/10.1007/s10878-018-0346-y

Download citation

Published: 31 August 2018
Issue Date: 01 May 2019
DOI: https://doi.org/10.1007/s10878-018-0346-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An approximation algorithm for genome sorting by reversals to recover all adjacencies

Abstract

Access this article

Similar content being viewed by others

An improved approximation algorithm for the reversal and transposition distance considering gene order and intergenic sizes

Sorting by Reversals, Transpositions, and Indels on Both Gene Order and Intergenic Sizes

Sorting by Reversals and Transpositions with Proportion Restriction

Notes

References

Acknowledgements