Skip to main content

Advertisement

Log in

An approximation algorithm for genome sorting by reversals to recover all adjacencies

  • Published:
Journal of Combinatorial Optimization Aims and scope Submit manuscript

Abstract

Genome rearrangement problems have been extensively studied for more than two decades, intended to understand the species evolutionary relationships in terms of the long range genetic mutations at the genome level. While most earlier studies focus on the simplified genomes ignoring gene duplicates, thousands of whole genome sequencing projects reveal that a genome typically carries multiple gene duplicates distributed in various ways along the genome. Given a source genome and a target genome such that one is a re-ordering of the genes in the other, we measure the evolutionary distance by the minimum number of reversals applied on the source genome to recover all the gene adjacencies in the target genome. We define this optimization problem as sorting by reversals to recover all adjacencies, or SBR2RA in short. We show that SBR2RA is APX-hard and uncover some similarities and differences to the classic counterpart, the sorting by reversals problem. From the approximability perspective, we present a \(2 \alpha \)-approximation algorithm, where \(\alpha \in [1, 2]\) is the best approximation ratio for a related optimization problem which is suspected to be NP-hard.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. We note that during this recovering process, existing common adjacencies can be broken, if necessary, and then recovered later.

References

  • Bafna V, Pevzner PA (1996) Genome rearrangements and sorting by reversals. SIAM J Comput 25:272–289

    Article  MathSciNet  MATH  Google Scholar 

  • Bafna V, Pevzner PA (1998) Sorting by transpositions. SIAM J Discrete Math 11:224–240

    Article  MathSciNet  MATH  Google Scholar 

  • Berman P, Hannenhalli S, Karpinski M (2002) \(1.375\)-approximation algorithm for sorting by reversals. In: Proceedings of the 10th annual European symposium on algorithms (ESA’02), pp 200–210

  • Berman P, Karpinski M (1999) On some tighter inapproximability results. In: Proceedings of the of 26th international colloquium on automata, languages and programming (ICALP’99), pp 200–209

  • Caprara A (1997) Sorting by reversals is difficult. In: Proceedings of the first annual international conference on computational molecular biology, pp 75–83

  • Chen W, Chen Z, Samatova NF, Peng L, Wang J, Tang M (2014) Solving the maximum duo-preservation string mapping problem with linear programming. Theor Comput Sci 530:1–11

    Article  MathSciNet  MATH  Google Scholar 

  • Christie DA (1996) Sorting permutations by block-interchanges. Inf Process Lett 60:165–169

    Article  MathSciNet  MATH  Google Scholar 

  • Christie DA (1998) A \(3/2\) approximation algorithm for sorting by reversals. In: ACM-SIAM proceedings of the ninth annual symposium on discrete algorithms (SODA’98), pp 244–252

  • Christie DA, Irving RW (2001) Sorting strings by reversals and by transpositions. SIAM J Discrete Math 14:193–206

    Article  MathSciNet  MATH  Google Scholar 

  • Chrobak M, Kolman P, Sgall J (2004) The greedy algorithm for the minimum common string partition problem. In: Proceedings of the 7th international workshop on approximation algorithms for combinatorial optimization problems (APPROX 2004) and the 8th international workshop on randomization and computation (RANDOM 2004), LNCS 3122, pp 84–95

  • Goldstein A, Kolman P, Zheng J (2004) Minimum common string partition problem: hardness and approximations. In: Proceedings of the 15th international symposium on algorithms and computation (ISAAC 2004), LNCS 3341, pp 484–495

  • Gu Q-P, Peng S, Sudborough H (1999) A \(2\)-approximation algorithm for genome rearrangements by reversals and transpositions. Theor Comput Sci 210:327–339

    Article  MathSciNet  MATH  Google Scholar 

  • Hannenhalli S, Pevzner P (1995) Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. In: ACM proceedings of the 27th annual symposium on the theory of computing (STOC’95), pp 178–189

  • Jerrum MR (1985) The complexity of finding minimum-length generator sequences. Theor Comput Sci 36:265–289

    Article  MathSciNet  MATH  Google Scholar 

  • Kececioglu JD, Sankoff D (1993) Exact and approximation algorithms for the inversion distance between two permutations. In: Proceedings of the fourth annual symposium on combinatorial pattern matching (CPM’93), LNCS 684, pp 87–105

  • Kolman P, Waleń T (2007) Approximating reversal distance for strings with bounded number of duplicates. Discrete Appl Math 155:327–336

    Article  MathSciNet  MATH  Google Scholar 

  • Rubert DP, Feijão P, Braga MDV, Stoye J, Martinez FHV (2017) Approximating the DCJ distance of balanced genomes in linear time. Algorithms Mol Biol 12:3

    Article  MATH  Google Scholar 

  • Sankoff D (1999) Genome rearrangement with gene families. Bioinformatics 16:909–917

    Article  Google Scholar 

  • Watterson G, Ewens W, Hall T, Morgan A (1982) The chromosome inversion problem. J Theor Biol 99:1–7

    Article  Google Scholar 

Download references

Acknowledgements

PZ is partially supported by the NNSF China Grant 61672323 and the NSF of Shandong Province Grant ZR2016AM28; DZ is partially supported by the NNSF China Grants 61472222, 61732009, and 61761136017; WT is partially supported by the funds from the Office of the Vice President for Research and Economic Development at Georgia Southern University; GL is supported by the NSERC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guohui Lin.

Appendices

A Hardness results

Given a permutation \(\pi = (\pi _1, \pi _2, \ldots , \pi _n)\) on \(\{1, 2, \ldots , n\}\), a breakpoint of \(\pi \) is a pair of adjacent positions \(i, i+1\) such that \(|\pi _i - \pi _{i+1}| \ne 1\) (Kececioglu and Sankoff 1993).

Lemma A.1

[Theorem 1 in Berman and Karpinski (1999)] For any \(\epsilon ', \epsilon '' \in (0, \frac{1}{2})\), there exists an \(m \ge 2\) such that it is NP-hard to decide whether an instance of SBR with 2240m breakpoints has the minimum number of reversals less than \((1236 + \epsilon ')m\) or larger than \((1237 - \epsilon '')m\).

In the following we need the values of \(\epsilon '\) and \(\epsilon ''\) such that \(\epsilon ' + \epsilon '' < \frac{1}{2}\), which imply

$$\begin{aligned} (1237 - \epsilon '')m - (1236 + \epsilon ')m = \left( 1 - (\epsilon ' + \epsilon '') \right) m > 1. \end{aligned}$$
(1)

Theorem A.1

The SBR2RA problem is NP-hard.

Proof

We prove the NP-hardness of the SBR2RA problem by contradiction, using a reduction from the SBR problem, as follows. Assume to the contrary that the SBR2RA problem is solvable in polynomial time.

For any \(\epsilon ', \epsilon '' \in (0, \frac{1}{2})\) such that \(\epsilon ' + \epsilon '' < \frac{1}{2}\), let I denote an instance of the SBR problem as stated in Lemma A.1. That is, I has 2240m breakpoints and it is NP-hard to decide whether its minimum number of reversals is less than \((1236 + \epsilon ')m\) or larger than \((1237 - \epsilon '')m\). Let the permutation in I be \(\pi = (\pi _1, \pi _2, \ldots , \pi _n)\), on \(\{1, 2, \ldots , n\}\) for some n. We construct an instance \(I'\) of SBR2RA with the source genome \(A = (\pi _1, \pi _2, \ldots , \pi _n)\) and the target genome \(B = (1, 2, \ldots , n)\).

Let \(\langle \rho _1, \rho _2, \ldots , \rho _r \rangle \) denote a series of the minimum number r of reversals on the source genome A of the instance \(I'\) such that \({{{\mathcal {J}}}}(\rho _r \circ \rho _{r-1} \circ \cdots \circ \rho _1 \circ A) = {{{\mathcal {J}}}}(B)\). Note that the number r (and the series of reversals \(\langle \rho _1, \rho _2, \ldots , \rho _r \rangle \)) can be obtained in polynomial time by our assumption. Since the target genome B is the identity permutation \((1, 2, \ldots , n)\), the resulting genome \(\rho _r \circ \rho _{r-1} \circ \cdots \circ \rho _1 \circ A\) is either B or the reverse of B, i.e., \((n, n-1, \ldots , 1)\). It follows that exactly the same series of reversals \(\langle \rho _1, \rho _2, \ldots , \rho _r \rangle \) on the permutation \(\pi \) in the instance I, possibly plus one extra reversal \(\rho (1, n)\), transform \(\pi \) to the identity permutation. Let \(r^*\) denote the minimum number of reversals for the instance I. In other words, in polynomial time we determine that \(r^* \le r + 1\). Obviously we have \(r \le r^*\); and thus together we have

$$\begin{aligned} r \le r^* \le r + 1. \end{aligned}$$

This contradicts the gap in Eq. (1); or equivalently, since r is computed in polynomial time and \(r < (1236 + \epsilon ')m\) if and only if \(r^* < (1236 + \epsilon ')m\), we can decide in polynomial time for the instance I whether \(r^* < (1236 + \epsilon ')m\) or \(r^* > (1237 - \epsilon '')m\). This proves that the SBR2RA problem is NP-hard. \(\square \)

Using the gap in Eq. (1), the proof of Theorem A.1 can be slightly modified to show the APX-hardness of the SBR2RA problem. In fact, since 1 is insignificant compared to 1236m, one easily sees that the SBR2RA problem can not be approximated within a factor of

$$\begin{aligned} \frac{1237}{1236} - \epsilon \approx 1.0008 - \epsilon , \ \text{ for } \text{ any } \text{ positive } \epsilon < 0.0008, \end{aligned}$$

which is also an inapproximability lower bound for the SBR problem (Berman and Karpinski 1999).

Corollary A.1

The SBR2RA problem is APX-hard.

B The complete proof of Lemma 4.2

Proof

(of Lemma 4.2, continued from the main text)

Case 1.2.\(ac \in R_M(A, B)\) but \(bd \notin R_M(A, B)\) (or the other way around, which can be argued symmetrically). Let \(C_3\) denote the cycle containing the adjacency-edge ac. After the reversal, \(a_{i-1} a_j\) becomes saturated, and \(|R_{M'}(A, B)| = |R_M(A, B)| + 1\); the ending characters of the blocks remain the same except that \(a_i =\)b’ becomes an end and \(a_{j+1} =\)d’ becomes an end. Due to \(ab, cd \in R_{M'}(A, B)\), we split the pairing ac to pair a with \(a_i\) and to pair c with \(a_{j+1}\); then the two blocks in which \(a_i\) and \(a_{j+1}\) reside respectively belong to the same cycle, which contains two ending characters \(a_i =\)b’ and \(a_{j+1} =\)d’. Furthermore, \(C_3\) is merged with this cycle to form a single cycle containing both edges ab and cd, see Fig. 8.

Fig. 8
figure 8

The effect of the reversal operation \(\rho (i, j)\), indicated by the crossed box, in Case 1.2

From the non-existence of a saturated adjacency stated in Lemma 4.1, the cycles \(C_1\) containing ab, \(C_3\) containing the character ‘a’, and the one containing the character ‘b’, denoted as \(C_4\), must not all be distinct; for the same reason, the cycles \(C_2\) containing cd, \(C_3\) containing the character ‘c’, and the one containing the character ‘d’, denoted as \(C_5\), must not all be distinct. One can easily check that \(C_1, C_2, C_3, C_4, C_5\) actually represent at most three distinct cycles of \(G = (V, E)\). This suggests that the reversal operation \(\rho (i, j)\) decreases \(C_M\) by at most 2, by merging all these cycles into one. Note that in this process only one copy of ac is moved out of \(R_M(A, B)\), but ‘a’ and ‘c’ reside in the same cycle afterwards; these imply the non-existence of a saturated adjacency stated in Lemma 4.1.

Case 1.3.\(ac \notin R_M(A, B)\) and \(bd \notin R_M(A, B)\). After the reversal, \(R_{M'}(A, B) = R_M(A, B) \oplus \{ab, cd\}\) (\(\oplus \) is the multiset union operation); the ending characters of the blocks remain the same except that all these four characters \(a_{i-1}, a_i, a_j, a_{j+1}\) become ends. We pair \(a_{i-1}\) with \(a_i\) and pair \(a_j\) with \(a_{j+1}\); this way, the cycle \(C_1\) is expanded to contain an adjacency-edge ab and the cycle \(C_2\) is expanded to contain an adjacency-edge cd, see Fig. 9.

Fig. 9
figure 9

The effect of the reversal operation \(\rho (i, j)\), indicated by the crossed box, in Case 1.3

From the non-existence of a saturated adjacency stated in Lemma 4.1, the cycles \(C_1\), the one containing the character ‘a’ denoted as \(C_3\), and the one containing the character ‘b’ denoted as \(C_4\), must not all be distinct; for the same reason, the cycles \(C_2\), the one containing the character ‘c’ denoted as \(C_5\), and the one containing the character ‘d’ denoted as \(C_6\), must not all be distinct. Since the characters on the cycles \(C_1, C_3, C_4\) remain the same after the reversal, so do the characters on the cycles \(C_2, C_5, C_6\), the reversal operation \(\rho (i, j)\) decreases \(C_M\) by at most 2, by merging \(C_1, C_3\) and \(C_4\) into one and by merging \(C_2, C_5\) and \(C_6\) into one. Note that in this process no adjacency is moved out of \(R_M(A, B)\) and cycles were only merged, implying the non-existence of a saturated adjacency stated in Lemma 4.1 is maintained.

Case 2.\(a_{i-1}\) and \(a_i\) reside in the same block on the cycle \(C_1\), while \(a_j\) and \(a_{j+1}\) are ending characters. In this case, the reversal operation \(\rho (i, j)\) breaks the adjacency ab, i.e., \(ab \in R_{M'}(A, B)\).

Case 2.1. Both \(ac, bd \in R_M(A, B)\). Let \(C_2\) denote the cycle containing the adjacency-edge ac and \(C_3\) denote the cycle containing the adjacency-edge bd. Applying the non-existence of a saturated adjacency stated in Lemma 4.1 on \(a_{i-1} a_i = ab\), we conclude that \(C_1, C_2, C_3\) must not all be distinct. After the reversal, \(a_{i-1} a_j\) and \(a_i a_{j+1}\) become saturated, and \(|R_{M'}(A, B)| = |R_M(A, B)| - 1\); the ending characters of all the blocks remain the same, except the disappearance of \(a_j =\)c’ and \(a_{j+1} =\)d’. Due to \(ab \in R_{M'}(A, B)\), we pair the ‘a’ originally paired with \(a_j\) and the ‘b’ originally paired with \(a_{j+1}\) to form the adjacency ab. At the end, if \(C_1, C_2\) and \(C_3\) (at most two!) are merged into one cycle, see Fig. 10, then the reversal operation \(\rho (i, j)\) decreases \(C_M\) by at most 1, and additionally all ending characters ‘a’, ‘b’, ‘c’ and ‘d’ (if any) reside in the same cycle. These imply the non-existence of a saturated adjacency stated in Lemma 4.1.

Fig. 10
figure 10

The effect of the reversal operation \(\rho (i, j)\), indicated by the crossed box, in Case 2.1

In the other case \(C_1, C_2\) and \(C_3\) (at most two!) are not merged into one cycle, but lead to k distinct cycles with \(k = 2\) or 3. When \(k = 2\), \(C_M\) remains the same and at most one of ac and bd can satisfy the statement in Lemma 4.1, then restoring the non-existence of a saturated adjacency stated in Lemma 4.1 helps decrease \(C_M\) by 1. When \(k = 3\), \(C_M\) is increased by 1 and both ac and bd can satisfy the statement in Lemma 4.1, then restoring the non-existence of a saturated adjacency stated in Lemma 4.1 helps decrease \(C_M\) by 2, with the total effect of decreasing by 1.

In summary, the reversal operation \(\rho (i, j)\) decreases \(C_M\) by at most 1 and consequently decreases the quantity \(|R_M(A, B)| + C_M\) by at most 2.

Case 2.2.\(ac \in R_M(A, B)\) but \(bd \notin R_M(A, B)\) (or the other way around, which can be argued symmetrically). Let \(C_2\) denote the cycle containing the adjacency-edge ac. After the reversal, \(a_{i-1} a_j\) becomes saturated, and \(|R_{M'}(A, B)| = |R_M(A, B)|\); the ending characters of the blocks remain the same, except the disappearance of \(a_j =\)c’ while \(a_i =\)b’ becomes an end. We break the edge ac of \(C_2\) incident at \(a_j =\)c’ while re-connect the ending character ‘a’ with \(a_i =\)b’. Applying the non-existence of a saturated adjacency stated in Lemma 4.1 on \(a_{i-1} a_i = ab\), we conclude that \(C_1, C_2\), and the cycle containing character ‘b’ (if any) denoted as \(C_3\), must not all be distinct.

If \(C_1, C_2\) and \(C_3\) (at most two!) are merged into one cycle, see Fig. 11, then the reversal operation \(\rho (i, j)\) decreases \(C_M\) by at most 1, and additionally all ending characters ‘a’, ‘b’, and ‘c’ (if any) reside in the same cycle. These imply the non-existence of a saturated adjacency stated in Lemma 4.1.

Fig. 11
figure 11

The effect of the reversal operation \(\rho (i, j)\), indicated by the crossed box, in Case 2.2

In the other case \(C_1, C_2\) and \(C_3\) (at most two!) are not merged into one cycle, but lead to 2 distinct cycles. Then \(C_M\) remains the same, and the new adjacency ac can satisfy the statement in Lemma 4.1. Immediately restoring the non-existence of a saturated adjacency stated in Lemma 4.1 helps decrease \(C_M\) by 1.

In summary, the reversal operation \(\rho (i, j)\) decreases \(C_M\) by at most 1 and consequently decreases the quantity \(|R_M(A, B)| + C_M\) by at most 1.

Case 2.3.\(ac \notin R_M(A, B)\) and \(bd \notin R_M(A, B)\). After the reversal, \(|R_{M'}(A, B)| = |R_M(A, B)| + 1\); the ending characters of the blocks remain the same except that the two characters \(a_{i-1} =\)a’ and \(a_i =\)b’ become new ends. By pairing \(a_{i-1}\) with \(a_i\), the cycle \(C_1\) is expanded to contain a new adjacency-edge ab, see Fig. 12. From the non-existence of a saturated adjacency stated in Lemma 4.1, the cycles \(C_1\), the one containing the character ‘a’ denoted as \(C_2\), and the one containing the character ‘b’ denoted as \(C_3\), must not all be distinct; the reversal operation \(\rho (i, j)\) decreases \(C_M\) by at most 1, by merging \(C_1, C_2\) and \(C_3\) (at most two!) into one, see Fig. 12. Note that in this process no adjacency is moved out of \(R_M(A, B)\) and cycles were only merged, implying the non-existence of a saturated adjacency stated in Lemma 4.1 is maintained.

Fig. 12
figure 12

The effect of the reversal operation \(\rho (i, j)\), indicated by the crossed box, in Case 2.3

Case 3. All four \(a_{i-1}\), \(a_i\), \(a_j\), and \(a_{j+1}\) are ending characters. Let \(C_1\) (\(C_2\), \(C_3\), \(C_4\), respectively) denote the cycle containing ‘a’ (‘b’, ‘c’, ‘d’, respectively).

Case 3.1. Both \(ac, bd \in R_M(A, B)\). Note that the ending characters ‘a’ and ‘c’ (‘b’ and ‘d’, respectively) are in a common cycle, that is, \(C_1\) and \(C_3\) refer to the same cycle (\(C_2\) and \(C_4\) refer to the same cycle, respectively). If \(a_{i-1}\) and \(a_j\) are not paired, i.e., not forming an adjacency-edge in the cycle \(C_1\), we pair them together to break \(C_1\) into two smaller cycles (see Fact 4.1). After the reversal, \(a_{i-1} a_j\) and \(a_i a_{j+1}\) become saturated, and \(|R_{M'}(A, B)| = |R_M(A, B)| - 2\); the ending characters of all the blocks remain the same, except the disappearance of these four ending characters \(a_{i-1}\), \(a_i\), \(a_j\), and \(a_{j+1}\), see Fig. 13. In summary, the reversal operation \(\rho (i, j)\) does not decrease \(C_M\) (but might increase it by at most 2). Note that in this process only one copy of ac (one copy of bd, respectively) is moved out of \(R_M(A, B)\), but the new block containing ac and the characters ‘a’ and ‘c’ are in at most two cycles by Fact 4.1 (the new block containing bd and the characters ‘b’ and ‘d’ are in at most two cycles by Fact 4.1, respectively); these imply the non-existence of a saturated adjacency stated in Lemma 4.1.

Fig. 13
figure 13

The effect of the reversal operation \(\rho (i, j)\), indicated by the crossed box, in Case 3.1

Case 3.2.\(ac \in R_M(A, B)\) but \(bd \notin R_M(A, B)\) (or the other way around, which can be argued symmetrically). The same as in Case 3.1, we note that the ending characters ‘a’ and ‘c’ are in a common cycle, that is, \(C_1\) and \(C_3\) refer to the same cycle. If \(a_{i-1}\) and \(a_j\) are not paired, i.e., not forming an adjacency-edge in the cycle \(C_1\), we pair them together to break \(C_1\) into two smaller cycles (see Fact 4.1). After the reversal, \(a_{i-1} a_j\) becomes saturated, and \(|R_{M'}(A, B)| = |R_M(A, B)| - 1\); the ending characters of the blocks remain the same except the disappearance of the two ending characters \(a_{i-1}\) and \(a_j\). Therefore, the reversal operation \(\rho (i, j)\) does not decrease \(C_M\) (but might increase it by at most 1). The non-existence of a saturated adjacency stated in Lemma 4.1 is maintained because the new block containing ac and the characters ‘a’ and ‘c’ are in at most two cycles by Fact 4.1.

Case 3.3.\(ac \notin R_M(A, B)\) and \(bd \notin R_M(A, B)\). After the reversal, \(|R_{M'}(A, B)| = |R_M(A, B)|\); the ending characters of the blocks remain the same. The reversal operation \(\rho (i, j)\) does not change \(C_M\). Since no adjacency is moved out of \(R_M(A, B)\), the non-existence of a saturated adjacency stated in Lemma 4.1 is maintained. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhai, S., Zhang, P., Zhu, D. et al. An approximation algorithm for genome sorting by reversals to recover all adjacencies. J Comb Optim 37, 1170–1190 (2019). https://doi.org/10.1007/s10878-018-0346-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10878-018-0346-y

Keywords

Navigation