Phase transition in the computational complexity of the shortest common superstring and genome assembly

L. A. Fernandez, V. Martin-Mayor, and D. Yllanes

Phys. Rev. E 109, 014133 – Published 24 January 2024

Abstract

Genome assembly, the process of reconstructing a long genetic sequence by aligning and merging short fragments, or reads, is known to be NP-hard, either as a version of the shortest common superstring problem or in a Hamiltonian-cycle formulation. That is, the computing time is believed to grow exponentially with the problem size in the worst case. Despite this fact, high-throughput technologies and modern algorithms currently allow bioinformaticians to handle datasets of billions of reads. Using methods from statistical mechanics, we address this conundrum by demonstrating the existence of a phase transition in the computational complexity of the problem and showing that practical instances always fall in the “easy” phase (solvable by polynomial-time algorithms). In addition, we propose a Markov-chain Monte Carlo method that outperforms common deterministic algorithms in the hard regime.

Received 17 April 2023
Accepted 11 December 2023

DOI:https://doi.org/10.1103/PhysRevE.109.014133

Physics Subject Headings (PhySH)

Bioinformatics Computational complexity NP-hard problems Phase transitions Sequencing analysis

Monte Carlo methods

Physics of Living SystemsStatistical Physics & Thermodynamics

Authors & Affiliations

L. A. Fernandez ^1,2, V. Martin-Mayor^1,2, and D. Yllanes ^3,2

¹Departamento de Física Teórica, Universidad Complutense, 28040 Madrid, Spain
²Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), 50018 Zaragoza, Spain
³Chan Zuckerberg Biohub — SF, 499 Illinois Street, San Francisco, California 94158, USA

Article Text (Subscription Required)

Click to Expand

References (Subscription Required)

Click to Expand

Issue

Vol. 109, Iss. 1 — January 2024

Reuse & Permissions

Access Options

Author publication services for translation and copyediting assistance advertisement

Images

Figure 1
Performance of common algorithms for the shortest common superstring problem. (Top) Probability of finding a successful solution (see text) using a greedy algorithm as a function of the coverage $W$ , Eq. (1), for several values of the number of fragments (reads) $N_{frag}$ and for fragment length $ℓ_{frag} = 100$ . For large coverage values, the algorithm always succeeds. (Bottom) In terms of the correct scaling variable $x$ , Eq. (4), based on the ratio between the average maximum distance between fragments and $ℓ_{frag}$ , the $p_{success}$ curves for different $N_{frag}$ cross, which we interpret as the onset of a phase transition at some critical $x_{c}$ . The value of $x_{c}$ is algorithm dependent, but the qualitative behavior is the same for more sophisticated methods. As a demonstration, we also show the results using Velvet, which employs an algorithm based on de Bruijn graphs.
Reuse & Permissions
Figure 2
Location of the critical point. The critical point can be determined by looking for the value of $〈 x 〉$ where fluctuations are largest. We plot the correlation coefficient between the scaling variable $x$ and the success probability for single realisations of the $N_{frag}$ reads. The absolute value of $r$ has a maximum at the critical point $x_{c}$ .
Reuse & Permissions
Figure 3
The success probability goes to zero in the hard phase. If the behavior shown in Fig. 1 corresponds to a phase transition, $p_{success}$ should tend to zero as $N_{frag} \to \infty$ for $〈 x 〉 < x_{c}$ . This figure shows that indeed $p_{success}$ decays at least as fast as power law in $1 / N_{frag}$ in the hard phase, while in the easy phase our results are already compatible with $p_{success} = 1$ for finite sizes.
Reuse & Permissions
Figure 4
Varying the fragment length hardly makes any difference. Our previous results have always considered $ℓ_{frag} = 100$ . It turns out that the dependence in this parameter is residual and rapidly vanishes as $ℓ_{frag}$ grows, according to Eq. (6). That is, the curves of $p_{success}$ as a function of $〈 x 〉$ can be collapsed if we subtract the scaling correction caused by finite $ℓ_{frag}$ . We also show that the results for a natural genome (namely that of the swinepox virus) are indistinguishable from those for random sequences. In this case, since $L$ is fixed, we have a single value of $p_{success}$ for each $ℓ_{frag}$ , all of which fall on the rescaled curve.
Reuse & Permissions
Figure 5
A Monte Carlo algorithm for the hard cases. We propose a segment-swap Markov-chain Monte Carlo algorithm (sketched in the left panel) that outperforms common deterministic methods in the hard regime (see right panel). We represent the permutation of the reads as an ordered sequence of fragments (the arrows indicate the sense in which the sequence should be toured). The elementary move of the algorithm is composed of the following three steps. First, choose randomly three independent pairs of consecutive fragments (depicted with grey circles in the plot), $... \to α_{1} \to α_{2} \to \dots \to β_{1} \to β_{2} \to \dots \to γ_{1} \to γ_{2} \to \dots$ . Mind the pair ordering: when one tours the circular sequence starting from fragment $α_{1}$ , fragments $γ_{1}$ and $γ_{2}$ are not found earlier than $β_{1}$ and $β_{2}$ (the choices $α_{2} = β_{1}$ and/or $β_{2} = γ_{1}$ are acceptable). Second, consider the rewired sequence $\dots \to α_{1} \to β_{2} \to \dots \to γ_{1} \to α_{2} \to \dots \to β_{1} \to γ_{2} \to \dots$ (there is an unacceptable reconnection– indicated by $NO$ in the figure—that would split the sequence into three disconnected cycles). Third step: If the new cycle is not longer than the original one, the segment swap is accepted. As we show in the right panel, segment swap is more effective than other algorithms for $- 1 < x < 0.5$ . For $x < 0$ the SCS problem no longer corresponds to a full assembly, since there are gaps between reads. The segment-swap algorithm, however, always finds superstrings that satisfy our success criterion ( $ℓ \leq ℓ_{ordered}$ ).
Reuse & Permissions

Physical Review E

covering statistical, nonlinear, biological, and soft matter physics