All-pairs suﬃx/preﬁx in optimal time using Aho-Corasick space

traversal of the Aho-Corasick machine, and it thus requires space linear in the size of the machine. © 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http:/


Introduction
The all-pairs suffix/prefix (APSP) problem is a classic problem in computer science.It has many applications in bioinformatics because it is the first step in genome assembly [6].Given a set R = {S 1 , . . ., S k } of k strings of total length n, the APSP problem asks us to find, for each string S i , i ∈ [1, k], its longest suffix that is a prefix of string S j , for all j = i, j ∈ [1, k].Gusfield et al. presented an algorithm running in the optimal O(n + k 2 ) time for solving APSP [7].The algorithm is based on the generalized suffix tree [17] of R. Ohlebusch and Gog [12] gave another optimal algorithm which is based on the generalized suffix array [11] of R. Tustumi et al. [15] gave yet another optimal algorithm based on the generalized suffix array of R. Thus the common denominator of all existing optimal algorithms for APSP is that they rely on sorting the suffixes of all strings in R, and they thus require space (n) in any case and for any alphabet.
There also exists a large body of works devoted to implementing algorithms for APSP that are suboptimal but practically fast on real-world datasets; see [5,14,9] and references therein for some of the state-of-the-art implementations.For a parallel implementation of the algorithm by Tustumi et al. see [10].
In this paper, we formalize the parameterized version of the APSP problem, denoted by -APSP, in which we are asked to output only the pairs in R whose suffix/prefix overlap is of length at least .-APSP is more attractive to study both from a theory as well as from a practical https://doi.org/10.1016/j.ipl.2022.1062750020-0190/© 2022 The Authors.Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).perspective.From a theory perspective, it is more interesting to have an algorithm that is optimal with respect to the actual size of the output.Specifically, even when = 0, many pairs of strings may have no suffix/prefix overlap, and so the size of the meaningful output could be asymptotically smaller than (k 2 ).From a practical perspective, we stress that most papers studying the APSP problem in fact considered the -APSP problem in their experiments.
The aforementioned algorithms solving APSP in optimal time do not explicitly consider -APSP.Gusfield et al. provide an extension of their main algorithm that solves 1-APSP in the optimal time [7].We observe that an extra, trivial modification of this extended algorithm solves -APSP in the optimal O(n + |OUTPUT |) time, where OUTPUT is the set of output pairs: instead of considering every internal node of the generalized suffix tree, we must only consider internal nodes of string depth at least .It is perhaps less clear how one could modify the suffix-arraybased algorithms presented in [12,15] to solve -APSP in the optimal time.
Here we give an algorithm running in the optimal O(n + |OUTPUT |) time using O(n) space for solving -APSP.Our algorithm is thus optimal for the APSP problem as well by setting = 0. Notably our algorithm is fundamentally different from all optimal algorithms for solving APSP.Specifically, our algorithm does not resort to suffix sorting.It relies on a novel traversal of the Aho-Corasick (AC) machine [1]; a finite-state machine that directly encodes all pairwise suffix/prefix overlaps (not only the longest one per string pair).It is thus somewhat surprising that the AC machine has not been used to obtain an optimal algorithm for the APSP problem-as we detail next, the AC machine has been employed for solving other APSP versions.In particular, our algorithm uses a tree induced from the AC machine, which we term the failure transition tree.We annotate, decompose, and carefully traverse this tree to infer only the longest suffix/prefix overlap per pair.Our algorithm thus requires space linear in the size of the AC machine, which may be asymptotically smaller than O(n) if prefix redundancy in R is non-negligible, or space (n) in the worst case.
Other related work.In addition to -APSP that is formulated in this paper, there are two other versions of APSP that have been studied in the literature.The first version consists in enumerating all pairwise suffix/prefix overlaps (not necessarily the longest ones) in decreasing order of their lengths.This version of the problem was solved by Ukkonen [16], who used this solution as the crux of his classic linear-time implementation of the greedy algorithm for constructing approximate shortest common superstrings.Ukkonen's solution is based on a reversed BFS traversal of the AC machine.Note, however, that by enumerating all such suffix/prefix overlaps in decreasing order of their length, we cannot guarantee that we will enumerate the longest ones per pair within the optimal O(n + |OUTPUT |) time.Thus, in some sense, this version of APSP is computationally harder than -APSP.The second APSP version studied consists in enumerating the set of longest suffix/prefix overlaps (not however their asso-ciation with the corresponding pairs of strings) [2].Since any suffix/prefix overlap in this set is a prefix of some input string, the size of this set is in O(n).This version of the problem was solved in the optimal O(n) time, independently, by Park et al. [13] and by Khan [8].Note, however, that by enumerating all such longest suffix/prefix overlaps, we cannot trivially infer their association with the corresponding pairs of strings in optimal time.Thus, in some sense, this version of APSP is computationally easier than -APSP.Park et al.'s solution is based on the efficient assessment of sets sizes along failure transition paths in the AC machine.Our work employs ideas that are similar to the ones of Ukkonen [16] and to the ones of Park et al. [13].Khan's solution is based on a careful simulation and adaptation of Gusfield et al.'s algorithm for APSP [7] on the AC machine.
Paper organization.Section 2 presents some preliminaries and our main result.Section 3 presents the algorithm we develop.Section 4 concludes the paper.

Preliminaries and main result
An alphabet is a finite nonempty set whose elements The empty string ε is the string of length 0. The concatenation of two strings S and T is the string composed of the letters of S followed by the letters of T ; it is denoted by S • T or simply by S T .For 1 ≤ i ≤ j ≤ m, S[i] denotes the ith letter of S, and the fragment S[i . .j] denotes an occurrence of the underlying substring . We say that P occurs at (starting is not equal to S. Given two strings S and T , a suffix/prefix overlap of S and T is a suffix U of S that is a prefix of T ; when U is the longest such suffix, then U is called the maximal suffix/prefix overlap of S and T .We now define the main problem investigated in this paper.

All-Pairs Suffix/Prefix of Length at Least ( -APSP)
We denote the output of -APSP for all i ∈ [1, k] by OUTPUT .Our main result is the following.

Theorem 1. The -APSP problem can be solved in the optimal
The AC machine is a finite-state machine which was introduced in [1] to find all occurrences of a set R = {S 1 , . . ., S k } of k > 1 strings within an input text.For conceptual convenience, we assume that no string in R is a prefix of another string in R (i.e., that R is prefix-free).If this is not the case, we simply append a terminal letter # / ∈ to every string in R. Let us assume that we need to find all occurrences of strings in R = {abaa, abac, abb, abcb, baba, bbaa, bbba} in some other string (the text).To perform this by a single scan of the text, we can employ the AC machine in Fig. 1.The AC machine has two types of transitions: goto transitions and failure transitions.The goto transitions are shown as solid arrows in Fig. 1.The AC machine goes from a state s to another state t when a letter a of the alphabet is consumed.This is represented by a goto transition, denoted by g(s, a) = t.An AC machine can have a goto transition g(s, a) = t for each state s of the machine and each letter a of the alphabet.If the AC machine is at a state s and there is no goto transition from s, for some letter of the alphabet, then a failure transition is made.The start state of the AC machine has one goto transition for each letter of the alphabet and thus no failure transition.On the contrary, every other state has exactly one failure transition to another state (including to the start state).In Fig. 1, a dashed arrow shows a failure transition to a state other than the start state 0, while all failure transitions that go to the start state are omitted.If the AC machine is in state s and makes a failure transition to state t, we write f (s) = t.
We say that state s corresponds to string U if and only if U is the string spelled out by following the shortest path of goto transitions from the start state to s. Aho and Corasick [1] proved the following key lemma on suffix/prefix overlaps.
Lemma 1 (Aho-Corasick lemma [1]).Let state s correspond to string U and state t correspond to string V in the AC machine of a set R of strings.Then, we have that f (s) = t if and only if V is the longest proper suffix of U that is also a prefix of some string in R.
For example, in the AC machine of Fig. 1, let s = 16 and t = 6.Then, f (16) = 6.Note, V = b is the longest proper suffix of U = abcb that is also a prefix of some string in R.
The Aho-Corasick lemma implies that the depth of the states on a failure transition path is monotonically decreasing.Let state s correspond to string U .Further let the failure transition f (s) = t and S j ∈ R be any string corresponding to a goto path containing t. Then U has a suffix/prefix overlap V with S j , where V is the string corresponding to t.Thus, all suffix/prefix overlaps between U and the strings in R can be found by following the failure transition from s to another state s , then following again the failure transition from s to another state s , and so on.Let U be some string from R. The state corresponding to U is termed terminal state.By repeatedly following the failure transitions from terminal states of the AC machine, we can find all pairwise suffix/prefix overlaps of all strings in R.This observation was used by Ukkonen in [16].

The algorithm
Our new optimal algorithm has two main stages.In the first one, we show that any instance of -APSP can be reduced to some instance of an abstract tree problem in linear time.In the second stage, we solve the tree problem in optimal time and linear space.A full running example is provided along the way.

Reducing the -APSP problem to the ULIT problem
By [i, j] d we denote an integer interval labeled by an integer d.We write that [i 1 , Note that the labels of the intervals play no role in the intersection or containment relationship.Given a collection of r labeled intervals, we define the union of I as the set Note that U (I) can be represented more compactly as a set of labeled intervals; e.g., we can represent We denote the output of ULIT for all K leaf nodes by OUTPUT K .
The first key observation is that every state of the AC machine (except the start state) has a single failure transition.This, together with the fact that the depth of the states on a failure transition path is monotonically decreasing, lead naturally to the following definition, which forms the basis of our reduction.
Definition 1 (Failure transition tree (FTtree)).Given a set R of strings, the failure transition tree (FTtree, for short) of R is the rooted tree induced by the set of (reversed) failure transitions of the AC machine of R.
We prove the following lemma.

Lemma 2. Any instance of the -APSP problem can be reduced to some instance of the ULIT problem in O(n) time.
Proof.We first construct the AC machine of R = {S 1 , . . ., S k }.We then obtain the FTtree of R, which we prune by excluding nodes of string depth smaller than .For conceptual convenience, we also prune leaf-to-root branchless paths, for leaf nodes that do not correspond to terminal states.Finally, using a DFS traversal, we decorate each nonroot node of string depth d of the FTtree with the interval are all strings that share a prefix of length d.We denote the residual tree by T .
We now prove that T is a valid input tree to ULIT.The fact that d u < d v , when u is an ancestor of v, follows by the Aho-Corasick lemma.Consider the case when node u at string depth d u is an ancestor of v at string depth d v and the intervals of u and v share some element s.By construction, all elements in the interval I u of u share a common prefix of length d u and all elements of the inter- Since the intersection of the two intervals is non-empty, and all elements in I v have a common prefix of length d v with s, by transitivity, the union of elements of I u and I v must have a common prefix of length d u < d v .Thus, when u is an ancestor of v, I u must contain I v .By the Aho-Corasick lemma we know that the failure transition path from a terminal state to the root encodes all suffix/prefix overlaps between the string corresponding to the terminal state and all other strings in R. Thus, in order to maintain the maximal suffix/prefix overlaps for a terminal state, we need to have the union of the collection of labeled intervals decorating every node that lies on such a path.This is precisely the output of ULIT: every element  The AC machine can be constructed in O(n) time.The FTtree can be constructed from the AC machine in O(n) time and decorated with the labeled intervals in O(n) time.This completes the proof.Remark 1.Let us remark that the reduction arguments we provide in Lemma 2 resemble the arguments of correctness of Park et al. in [13].However, our encoding (labeled intervals) and the ULIT problem are novel as they serve solving a problem which is computationally harder than the problem studied in [13].

Solving the ULIT problem
A branching node of T is a non-root node with at least two children.A branchless subpath is an upward maximal path of nodes starting from a branching node or a leaf node and ending at the node right before a branching node 2. Sort all tuples constructed in Step 1 together.For each branchless subpath, take the compact representation of the union of its set of labeled intervals, which are now sorted.Inspect Table 1. 3. Visit the branching and leaf nodes using a BFS traversal on T .From node u to node v, such that v is the child of u, take the union of the two sets of labeled intervals (one from u and one from v), and update the union associated with v.At the leaf nodes we have the output sets.Inspect Fig. 4.
The following lemma together with Lemma 2 implies Theorem 1.A stack is maintained to obtain the largest label per element (or interval).Note, the compact representation of the union of any two labeled intervals is of size at most three.
Step 3 can be done in O(N + |OUTPUT K |) time: taking the union of the two sets of intervals can be done in time linear in their total size as the sets are sorted.The unions of labeled intervals, which are stored at branching nodes, correspond to leaf nodes and so their total size is in O(|OUTPUT K |).The space complexity of the algorithm described above is upper bounded by the time complexity.
Let us now explain how we can improve the space complexity of this algorithm to O(N).Instead of BFS in Step 3, we perform DFS (Euler tour) on T maintaining the union of labeled intervals associated with the currently active node u.This represents the union of labeled intervals from the root to u.Since we perform a DFS traversal, we make a union operation when going downward and a set difference operation when going upward.A vector of K stacks is maintained to obtain the currently largest label per element (or interval) during this process.As an example, consider visiting node 7 from active node u = 1 (downward) during DFS.When u = 1 is active, its associated union is { [1,4] 1 } (see Fig. 4).The union stored for branchless subpath 7 is {[5, 5] 2 } (see Table 1).The resulting union of the two is thus

Final remarks
All existing optimal algorithms for APSP rely on sorting the suffixes of all strings in R. Constructing the generalized suffix tree [17] or the generalized suffix array [11] of R requires (n) space by definition, in any case and for any alphabet , for storing the tree or the array, respectively.The AC machine of R can be constructed in O(n)

Fig. 1 .
Fig. 1.The AC machine over R = {abaa, abac, abb, abcb, baba, bbaa, bbba}.Solid arrows correspond to goto transitions and dashed arrows to failure transitions.The lexicographic ranks of the strings in R are the indices of array A.
The AC machine for a set R = {S 1 , . . ., S k } of k strings of total length n over an ordered alphabet can be constructed in O(n log | |) time without suffix sorting [3].Dori and Landau showed that, when = {1, . . ., n O(1) }, the AC machine can be constructed in O(n) time using suffix sorting [4].For each state s, let L(s) be the set of indices i such that the goto path that spells out string S i contains state s.Let d(s) denote the string depth of state s, that is, the length of the string corresponding to s.By performing a DFS traversal on the AC machine using the goto transitions, we can construct in O(n) time an array A = A[1 ..k], which stores A[r] = s, r ∈ [1, k], if and only if for state s and string S i , L(s) = {i}, d(s) = |S i |, and S i has lexicographic rank r in R.This array is A = [4, 17, 5, 16, 9, 12, 14] in Fig. 1; e.g., abac has lexicographic rank 2 in R.
8] 2 }.We call this the compact representation of U (I).Let us now define the following auxiliary abstract problem.Union of Labeled Intervals on a Tree (ULIT)Input: A rooted tree T of size N with K leaf nodes.Every non-root node u of T is labeled with an intervalI u = [i, j] d u ,where i, j ∈ [K ] and d u ∈ [N].For any two labeled intervals I u and I v , such that u is an ancestor of v, we have that d v > d u and either I u contains I v or I u and I v do not intersect.Output: For each leaf node w of T , return the union U w of labeled intervals from w to the root of T .

Fig. 2 .
Fig. 2. FTtree T 1 for = 1.The output set per leaf node is in a squared box.

Fig. 3 .
Fig. 3. FTtree T 2 for = 2.The output set per leaf node is in a squared box.i d ∈ U w output by ULIT with T as input tree is in one-toone correspondence with the maximal suffix/prefix overlap of length d between the string represented by w and the string represented by A[i].

Lemma 3 .
The ULIT problem can be solved in O(N + |OUTPUT K |) time using O(N) space.Proof.That the presented algorithm is correct is trivial.We next analyze the time complexity.Step 1 can be done in O(N) time because T is of size N and thus we have no more than N tuples to create.Step 2 can be done in O(N) time.Sorting can be done in O(N) time using radix sort.By definition, for any two labeled intervals I u and I v , such that u is an ancestor of v, we have that d v > d u and either I u contains I v or I u and I v do not intersect.Thus taking the compact representation of the union of the set of labeled intervals can be done in O(N) time because the labeled intervals are sorted and their total number is in O(N), since we have one interval per node of the branchless subpath.