Longest common substrings with k mismatches

The longest common substring with $k$-mismatches problem is to find, given two strings $S_1$ and $S_2$, a longest substring $A_1$ of $S_1$ and $A_2$ of $S_2$ such that the Hamming distance between $A_1$ and $A_2$ is $\le k$. We introduce a practical $O(nm)$ time and $O(1)$ space solution for this problem, where $n$ and $m$ are the lengths of $S_1$ and $S_2$, respectively. This algorithm can also be used to compute the matching statistics with $k$-mismatches of $S_1$ and $S_2$ in $O(nm)$ time and $O(m)$ space. Moreover, we also present a theoretical solution for the $k = 1$ case which runs in $O(n \log m)$ time, assuming $m\le n$, and uses $O(m)$ space, improving over the existing $O(nm)$ time and $O(m)$ space bound of Babenko and Starikovskaya.


Introduction
String matching is an important task in many scientific fields such as text mining, detecting plagiarism or bioinformatics.Depending on the application, the matching can be either exact or approximate.In the approximate case, there exist many different metrics to measure the closeness of a match.Popular examples are the edit distance, the Damerau distance and the Hamming distance.In the Hamming distance, the distance between two strings of the same length is equal to the number of positions in the strings at which there is a mismatch between the corresponding symbols.Consider for example the field of bioinformatics, where the genetic code, in form of DNA, is compared.Mismatches may occur naturally since current sequencing technologies often incorrectly read some of the bases [2] or simply because the DNA of two different sources is compared to get a measure of closeness.For an overview of applications in computational biology see for example [3].In this paper we study the longest common substring (or factor ) with k-mismatches problem (k-LCF for short 4 ) which consists in finding the longest common substring of two strings S 1 and S 2 , while allowing for at most k mismatches, i.e., the Hamming distance between the two substrings is ≤ k.This problem is a generalization of the Longest Common Substring problem [3,4,5] and is similar to the threshold all-against-all problem defined by Gusfield [3] and to the local alignment problem.In the threshold all-against-all problem the goal is to find all the pairs of substrings of S 1 and S 2 such that the corresponding edit distance is less than a given number d.The difference in the k-LCF problem is that the distance used is the Hamming distance rather than the edit distance, and that we are interested in the pairs of substrings of maximal length only.In the local alignment problem, which can be solved in O(|S 1 | • |S 2 |) time using the Smith-Waterman algorithm [6], the goal is to compute a pair of substrings of S 1 and S 2 such that the corresponding similarity, according to a suitable scoring function, is maximum over all the pairs of substrings.In particular, if the scoring function is such that the score of a match is 1, the score of a mismatch is 0 and gaps are not allowed, a solution of the local alignment problem is comparable to one of the k-LCF problem, with the difference that there is no bound on the number of mismatches.Babenko and Starikovskaya [1] studied the case of 1 mismatch only and presented an algorithm for the 1-LCF problem which runs in A closely related problem is the one of computing the matching statistics with k mismatches.The matching statistics, introduced by Chang and Lawler [7] for the approximate string matching problem, is an array ms of |S 2 | integers such that ms[i] is the length of the longest prefix of the suffix of S 2 starting at position i that exactly matches a substring of S 1 , for i = 0, . . ., |S 2 | − 1.A natural generalization consists in relaxing the definition so that the matching is approximate with respect to the Hamming distance.Recently, Leimeister and Morgenstern [8] presented a greedy heuristic for the computation of the matching statistics with k mismatches, which runs in O(|S 1 | • k • z) time, where z is the maximum number of occurrences in S 2 of a string of maximal length which occurs in both S 1 and S 2 .In this paper we present two novel contributions.Our first result is an efficient algorithm for the k-LCF problem which runs in quadratic time in the length of the strings, that is, in time O(|S 1 | • |S 2 |) and only requires a constant amount of space.This algorithm can also be used to compute the matching statistics with k mismatches with no overhead in the time complexity, i.e., in O(|S 1 | • |S 2 |) time, and using O(|S 2 |) space.Our second result is an algorithm for the 1-LCF problem, i.e., for the k = 1 case.We show how to solve this instance in a more time efficient manner by using results from Crochemore et al. [9] for finding the longest repeat(s) with a block of k don't cares.Our algorithm takes time

Basic definitions
Let Σ be a finite alphabet of symbols and let Σ * be the set of strings over Σ.Given a string S ∈ Σ * , we denote with |S| the length of S and with S[i] the i-th symbol of S, for 0 ≤ i < |S|.Given two strings S and S ′ , S ′ is a substring of S if there are indices 0 then S ′ is a prefix (suffix) of S. We denote by S[i..j] the substring of S starting at position i and ending at position j.For i > j we obtain the empty string ε.The suffix tree T (S) of a string S is a rooted directed tree with |S| leaves and edge labels over Σ * \ {ε}.Each internal node has at least two children and is such that the edge labels of the children have different first symbols.For each leaf i, the concatenation of the edge labels on the path from the root to leaf i is equal to S[i..|S| − 1].Assuming a constant size alphabet, the suffix tree can be built in O(|S|) time [3].For any node u in T (S) depth(u) denotes the length of the string labeling the path from the root to u.For any pair of nodes u, v in T (S), LCA(u, v) denotes the lowest common ancestor of u and v, i.e., the deepest node in T (S) that is ancestor of both u and v.The suffix tree can be preprocessed in O(|S|) time so as to answer LCA queries in constant time [10].We denote with B(S) the binary suffix tree obtained by replacing each node u in T (S) with out-degree > 2 with a binary tree with d − 1 internal nodes and d − 2 internal edges, where the d leaves are the d children of u.The binary suffix tree can be built in O(|S|) time [9].

The longest common substring with k mismatches problem
Let S 1 and S 2 be two strings with n = |S 1 |, m = |S 2 |.W.l.o.g.we assume that n ≥ m.Given an integer k, let φ(i, j) be the length of the longest substring of S 1 and S 2 ending at position i and j, respectively, such that the two substrings have Hamming distance at most k.Formally, φ(i, j) is equal to the largest integer l ≤ min(i, j) + 1 such that for 0 ≤ i < n, 0 ≤ j < m.The longest common substring with k-mismatches problem consists in, given two strings S 1 and S 2 and an integer k, finding the length of the longest substrings of S 1 and S 2 with Hamming distance at most k, i.e., max i,j φ(i, j).

A practical algorithm for arbitrary k
In this section we present a practical algorithm for the k-LCF problem.By definition, φ(i, j) is also the length of the longest suffixes of S 1 [0..i] and S 2 [0..j] with Hamming distance at most k.Our algorithm computes all the values φ(i, j) based on this alternative formulation.The idea is to iterate over the φ matrix diagonal-wise and compute, for a fixed (i, j) ∈ {(0, 0), (0, 1), . . ., (0, m − 1)} ∪ {(1, 0), (2, 0), . . ., (n−1, 0)}, the values φ(i+l, j +l), for 0 ≤ l < min(n−i, m−j), i.e., the diagonal starting at (i, j), in O(m) time.Let Q be a (empty) queue data structure and s = 0, for a given pair (i, j).The algorithm iterates over l maintaining the invariant that l − s is the length of the longest common suffix of At the beginning the invariant holds since Q is empty, l − s = 0 and S 1 [i + s..i + l − 1] = S 2 [j + s..j + l − 1] = ε.Suppose that the invariant holds up to position l.If S 1 [i + l] = S 2 [j + l] then the invariant trivially holds also for l + 1 with s ′ = s and Q ′ = Q.Otherwise, we have a mismatch between S 1 [i + l] and S 2 [j + l].If |Q| < k, then the invariant also holds for l + 1 with s ′ = s and Q ′ equal to Q after an enqueue(Q, l) operation.Instead, if |Q| = k, the pair of suffixes S 1 [i + r..i + l] and S 2 [j + r..j + l], for r = s, . . ., min Q, match with k + 1 mismatches and r = min Q + 1 is the minimum position for which the corresponding suffixes match with k mismatches.Hence, in this case the invariant also holds for l + 1 with s ′ = min Q + 1 and Q ′ equal to Q after a dequeue operation followed by an enqueue(Q, l) operation.
The algorithm maintains the largest length found up to the current iteration and the starting positions of the corresponding substrings in S 1 and S 2 , such that the position in S 1 is minimal, in three integers ℓ, r 1 , and r 2 .Each time l − s > ℓ it updates their values accordingly.The code of the algorithm is shown in Figure 1.The time complexity of one iteration of the algorithm is O(1) if the queue operations take constant time, which yields O(m) time for a fixed i and O(nm) time in total.The space complexity is O(k), as the queue contains at most k elements at any iteration.
The algorithm can also be modified to use O(1) space at the price of a constant factor in the running time.We replace the queue with one integer q, encoding the number of mismatches (number of elements in the queue).The dequeue and enqueue operations then become q ← q − 1 and q ← q + 1, respectively.The update of s requires the computation of min Q + 1, which, by definition, is equal to the smallest position s ′ > s such that To this end, we simply scan S 1 and S 2 from position i + s and j + s, respectively, until we find a mismatch.As each symbol of S 1 and S 2 is looked up at most twice, the time complexity does not change.In practice, using an explicit queue is preferable, as it allows one to avoid rescanning the already scanned parts of the strings.
Observe that the pairs (i, j) corresponding to the largest value of φ must belong to K, as otherwise φ(i + 1, j + 1) > φ(i, j).Based on this observation, we now describe a sparse variant of our algorithm that runs in O(n + m + |K|) time, which is preferable if |K| = o(nm), at the price of O(n + m) space.Let LCE(i, j) be the longest common extension of S 1 [i..|S 1 | − 1] and S 1 [j..|S 2 | − 1], i.e., the length of the longest common prefix of the i-th suffix of S 1 and of the j-th suffix of S 2 .The idea is to iterate over the pairs in K ∪ K ′ only using LCE queries, where To this end, we add the following instructions at the beginning of the while loop at line 9: Let l ′ = l + max(0, γ − 1).If γ = 0, then l ′ = l and (i + l ′ , j + l ′ ) ∈ K ′ .Otherwise, by definition, (i + l + r, j + l + r) / ∈ K, for 0 ≤ r < γ − 1, and (i + l ′ , j + l ′ ) ∈ K. Thus, at each iteration of the while loop, the above code jumps to the next point in K ∪ K ′ along the current diagonal starting from the last processed point.Observe that the invariant is maintained in the jump from l to l ′ , i.e., l ′ − s = φ(i + l ′ − 1, j + l ′ − 1), because S 1 [i + l + r] = S 2 [j + l + r], for 0 ≤ r < γ − 1.Moreover, if γ > 0, S 1 [i + l ′ ] = S 2 [j + l ′ ], i.e., the point (i + l ′ , j + l ′ ) corresponds to a match, and either l ′ + 1 = min(n − i, m − j) or (i + l ′ + 1, j + l ′ + 1) ∈ K ′ , i.e., the next point to be processed is a mismatch.LCE queries can be answered in constant time using O(n + m) space after a linear time preprocessing of S 1 and S 2 by means of the generalized suffix tree of S 1 and S 2 , preprocessed so as to answer LCA queries in constant time [3].Hence, we obtain the claimed bound.
Finally, we describe how to compute the matching statistics with k mismatches of S 2 with respect to S 1 .The matching statistics with k mismatches of S 2 w.r.t.S 1 is an array ms k of m integers such that ms k [i] is the length of the longest prefix of S 2 [i..m − 1] that matches a substring of S 1 with at most k mismatches, for i = 0, . . ., m − 1.Using the algorithm described above, the array ms k can be computed in O(nm) time and O(m) space as follows: first, we Finally, we denote with S r = S[|S| − 1]S[|S| − 2] . . .S[0] the reverse of the string S.

Fig. 1 .
Fig. 1.The algorithm to compute the longest common substring up to k-mismatches of two strings.