Optimal gap-affine alignment in O(s) space

Abstract Motivation Pairwise sequence alignment remains a fundamental problem in computational biology and bioinformatics. Recent advances in genomics and sequencing technologies demand faster and scalable algorithms that can cope with the ever-increasing sequence lengths. Classical pairwise alignment algorithms based on dynamic programming are strongly limited by quadratic requirements in time and memory. The recently proposed wavefront alignment algorithm (WFA) introduced an efficient algorithm to perform exact gap-affine alignment in O(ns) time, where s is the optimal score and n is the sequence length. Notwithstanding these bounds, WFA’s O(s2) memory requirements become computationally impractical for genome-scale alignments, leading to a need for further improvement. Results In this article, we present the bidirectional WFA algorithm, the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining WFA’s time complexity of O(ns). As a result, this work improves the lowest known memory bound O(n) to compute gap-affine alignments. In practice, our implementation never requires more than a few hundred MBs aligning noisy Oxford Nanopore Technologies reads up to 1 Mbp long while maintaining competitive execution times. Availability and implementation All code is publicly available at https://github.com/smarco/BiWFA-paper. Supplementary information Supplementary data are available at Bioinformatics online.


Proof of the correctness lemma
In order to reason about the properties of the WFA dynamic programming structures, it is helpful to invoke certain properties of the Needleman-Wunsch dynamic programming matrices. Accordingly, we will provide the recursions here to introduce the notation. D i,j = min{M i−1,j + o + e, D i−1,j + e} where I is the indicator function that evaluates to 1 if its argument is true and 0 otherwise. The base case of the recursion is M 0,0 = 0. We also adopt the convention that D 0,j = I i,0 = ∞ for all i and j. An optimal alignment can be identified with a traceback path through these matrices: a sequence of cells that indicate which of the options from the recursion achieved the minimum score.
Before proving the correctness lemma, we prove two useful properties of the Needleman-Wunsch matrices.
Lemma 1. M is monotonically non-decreasing along each diagonal.
Proof. Choose integers i and j such that 0 ≤ i < m and 0 ≤ j < n, and we will show that M i,j ≤ M i+1,j+1 , which is sufficient to prove the claim. M i+1,j+1 corresponds to the score of an optimal alignment of q 0:i and t 0:j . Any traceback path of this alignment must include a coordinate (i, y) with y ≤ j or (x, j) with x ≤ i. Without loss of generality, assume that there is an optimal alignment path that includes (i, y), and choose y to be the maximal such value within this path. We consider two cases: 2. y < j. Then there must be at least j −y horizontal transitions on the traceback path following (i, y) for it to end in diagonal i − j. Moreover, since y is chosen to be maximal, (i, y + 1) is not on the traceback path, and there must therefore be at least one gap opened after (i, y). This implies M i+1,j+1 ≥ M i,y + o + (j − y)e. We also have M i,j ≤ M i,y + o + (j − y)e, since it is possible to reach (i, j) by taking j − y horizontal transitions starting from (i, y).
Lemma 2. D and I are monotonically non-decreasing along each diagonal, excluding the boundaries D 0,· and I ·,0 .
Proof. The proofs for I and D are essentially identical, so we will prove the claim only for I. The argument will be proved by induction on decreasing values for the diagonal k. The base case k = m − 1 is trivially true because there is only one cell in I in this diagonal (excluding the boundary). Consider i and j such that 0 ≤ i < m and 0 < j < n, and assume that the induction hypothesis holds for all diagonals k > i − j. We will show that I i,j ≤ I i+1,j+1 , which is sufficient prove the induction claim for k = i − j. Consider two cases.
1. The path is in M at (i, j). Then the path up to (i, j) and the path after (i, j) correspond to partial alignments in the forward and reverse direction respectively, and their scores are s f = M i,j and s r = s opt − M i,j . Taking k = i − j, we know that the f.r. points in the k-th diagonal must be at least as far as this coordinate in their respective directions: Because adjacent positions in an optimal traceback path can differ by at most p, we have both |s f − s opt /2| ≤ p/2 and |s r − s opt /2| ≤ p/2. These imply |s f − s r | ≤ p by the triangle inequality. 2. The path is in I at (i, j) and not also in M at (i, j). Then (i, j) is part of a gap that begins at (i, j ′ ) for some j ′ < j and ends at (i, j ′ + ℓ) where j ′ + ℓ > j, else the path is also in M at (i, j). Consider the quantity x = (s opt − 2M i,j ′ )/2e across three cases.
These correspond to the scores of the partial alignments before and after (i, j ′ ), respectively. Therefore we take k = i − j ′ , and, as previously, the f.r. points within this diagonal must obey the inequality − → M k,s f ≥ i ≥ ← − M k,sr . Note that M i,j ′ ≤ s opt /2 else I i,j ′ would not achieve the minimum difference from s opt /2. This implies x ≥ 0, and in particular |x| ≤ 1/2. Therefore, 2.2. 1/2 < x < ℓ − 1/2. Let x * be the nearest integer to x, and let s f = I i,j ′ +x * and s r = s opt − I i,j ′ +x * + o. These correspond to the scores of the partial alignments before and after (i, j ′ + x * ), respectively. Therefore we take k = i − j ′ − x * , and, as previously, the f.r. points within this diagonal must obey the inequality − → I k,s f ≥ i ≥ ← − I k,sr . Noting that |x − x * | ≤ 1/2 by construction, we also have 2.3. x ≥ ℓ − 1/2. Let s f = I i,j ′ +ℓ and s r = s opt − I i,j ′ +ℓ + o. These correspond to the scores of the partial alignments before and after (i, j ′ + ℓ), respectively. Therefore we take k = i − j ′ − ℓ, and, as previously, the f.r. points within this diagonal must obey the inequality − → I k,s f ≥ i ≥ ← − I k,sr . Noting that s opt /2 ≤ I i,j ′ +ℓ else j ≥ j ′ + ℓ, and also that I i,j ′ +ℓ = M i,j ′ + o + ℓe, we can obtain These together imply |s f − s r | ≤ o + e ≤ p.
3. The path is in D at (i, j) and not also in M at (i, j). Same as the previous case.
(⇐) We consider the three conditions separately.
1. Let (i 1 , j 1 ) be the coordinates in M corresponding to − → M k,s f and likewise (i 2 , j 2 ) for ← − M k,sr .
The partial alignments corresponding M i 2 ,j 2 and ← − M k,sr can be concatenated into a full alignment with score M i 2 ,j 2 + s r . By Lemma 1, this score is at most M i 1 ,j 1 + s r = s f + s r = s.