Efficient Construction of the BWT for Repetitive Text Using String Compression

We present a new semi-external algorithm that builds the Burrows–Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce the computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space, but it also speeds up the required computations. Our experiments show important savings in both space and computation time when the text is repetitive. On average, we are 3.7x faster than the baseline compressed approach, while maintaining a similar memory consumption. These results make our method stand out as the only one (to our knowledge) that can build the BCR BWT of a collection of 25 human genomes (75 GB) in about 7.3 hours, and using only 27 GB of working memory.


Introduction
The Burrows-Wheeler transform (BWT) [6] is a reversible string transformation that reorders the symbols of a text T according the lexicographical ranks of its suffixes. The features of this transform have turned it into a key component for text compression and indexing [32,25]. In addition to being reversible, the reordering produced by the BWT reduces the number of equal-symbol runs in T , thus improving the compresibility. On the other hand, its combinatorial properties [10] enable the creation of self-indexes [9,28] that support pattern matching in time proportional to the pattern length. Popular bionformatic tools [18,21] rely on the BWT to process data, as collections in this discipline are typically massive and repetitive, and the patterns to search for are short.
There are several algorithms in literature that produce the BWT in linear time [34,1,23,8,3]. Nevertheless, the computational resources their implementations require when the input is large are still too high for practical purposes. This problem is particularly evident in Genomics applications, where the amount of data is growing at an astronomical rate [35].
Although genomic collections are becoming more and more massive, the effective information they contain remains low compared to their sizes [27]. A promising solution to deal with this kind of data is then to design BWT algorithms that scale with the amount of information in the collection, not with its size.
Motivated by these ideas, some authors have developed new BWT algorithms that exploit text repetitions to reduce the computational requirements [14,5,13,15,4]. Their approach

▶ Definition 3. An LMS substring is (i) a substring T [i, j] with both T [i] and T [j]
being LMS symbols, and there is no other LMS symbol in the substring, for i ̸ = j; or (ii) the sentinel itself.
SA-IS is a recursive method. In every recursion i, it initializes an empty suffix array A i for the input text T i (i=1). Then, it scans T i from right to left to classify the symbols as L-type, S-type or LMS-type. As it moves through the text, the algorithm records the text positions of the LMS substrings in A i . More specifically, if T i [j] = a is the first symbol of an LMS substring, it inserts j in the right-most empty position in the bucket a of A i . After scanning T i , SA-IS sorts the LMS substrings in A i using ISS. This procedure only requires two linear scans of A i (we refer the reader to Nong et al. [30] for further detail).
ISS sorts the LMS substrings in a way that is slightly different from lexicographic ordering, we refer to it as ≺ LM S ordering. In particular, if an LMS substring T i [a, b] is a prefix of another LMS substring T i [a ′ , b ′ ], then T i [a, b] gets higher order. However, the higher rank of T i [a, b] implies that the suffix T i [a..] is lexicographically greater than the suffix T i [a ′ ..]. The cause of this property is explained in Section 2 of Ko and Aluru [17].
The idea now is to use the sorted LMS substrings to induce the order of the suffixes in T i that are not prefixed by LMS substrings. Still, LMS substring with the same sequence are still unsorted in A i . Nong et al. solve this problem by creating a new string T i+1 in which they replace the distinct LMS substrings with their orders in A i , and use T i+1 as input for another recursive call i + 1. The base case for the recursion is when all the suffixes in A i are prefixed by different symbols, in which case they return A i without further processing.
When the (i + 1)th recursive call ends, the suffixes of T i prefixed by the same LMS substrings are completely sorted in A i+1 , so SA-IS proceeds to complete A i . For doing so, it resets A i , inserts the LMS substrings arranged as they respective symbols appear in A i+1 , and performs ISS again to reorder the unsorted suffixes of T i . Once it finishes, it passes A i to the previous recursion i − 1. The final array A 1 is the suffix array for T .

Definitions
We consider a collection T = {T 1 , T 2 , . . . , T k } of k strings over the alphabet Σ [2, σ]. The input for our algorithm is thus the sequence T = T 1 $T 2 . . . T k $ of total length n = |T | that represents the concatenation of T . The symbol $ is a sentinel that we use as a boundary between consecutive strings in T . We map $ = Σ [1] to the smallest symbol in the alphabet. ] or one of them is not a proper suffix.

Overview of Our Algorithm
We call our algorithm for computing the BCR BWT of T grlBWT. This method relies on the ideas developed by Nong et al. in the SA-IS algorithm (Section 2.3), but includes elements of grammar and run-length compression (Section 2.2). These new features reduce the space usage of the temporal data that grlBWT maintains in memory, thus decreasing both working memory and computing time. We now give a brief overview of our approach. Our method works in two phases: the parsing phase and the induction phase. The parsing phase is similar to the recursive steps of SA-IS. In every iteration i (or parsing round), we first scan the input string T i (T 1 = T ) to build a dictionary D i with the phrases that occur as LMS substrings. We also record the frequency of every phrase, i.e., the number of times it occurs as an LMS substring in T i . Subsequently, we use the phrases in D i and their frequencies to construct a preliminary BWT for T i (pBW T i ), which we complete in the induction phase. We say pBW T i is partial because it has empty spaces we can not fill just with the information in D i . To make the completion more efficient during the induction phase, we encode pBW T i using run-length compression and D i using a technique similar to grammar compression. Finally, we store pBW T i and D i on disk, and create a new text T i+1 for the next parsing round i + 1. We construct T i+1 by replacing the LMS substrings of T i with their associated symbols in D i . The parsing phase finishes when no new dictionary phrases can be extracted from the input text T i (see Section 3.3).
Let h be the number of iterations the parsing phase of grlBWT incurred with T . The induction phase starts by building the BWT for T h . After obtaining BW T h , we start a new iterative process in which we revisit the data we dumped to disk during the parsing phase in reverse order (i.e., from round h − 1 to round 1). In every iteration i, the BW T i+1 of T i+1 is already computed, and we use it along with compressed version of D i to induce the order of the symbols in the empty entries of pBW T i . Once we finish the induction, we compact pBW T i using run-length encoding to create BW T i . The final BCR BWT for T is thus in BW T 1 .

The Parsing Phase
In this section, we explain the steps we perform during the ith iteration of the parsing phase of grlBWT. Assume we receive as input a text T i over the alphabet Σ i = [1, σ i ]. We first initialize a hash table H i that we will use to construct the dictionary D i . The keys in H i will be the phrases that occur as LMS substrings in T i while the values of H i will be the frequencies of the keys, i.e., the number of times the keys occur as LMS substrings in T i .
The basic idea to fill H i consists of scanning T i from right to left to classify its symbols according the definitions of Section 2.3, and hash an LMS subtring every time we reach an LMS-type symbol. This mechanism is almost the same as the one described by Nong et al. [30] to detect the LMS substrings (except for the hashing). However, we add an extra consideration. The detection of LMS substrings is oblivious of the fact that T 1 encodes a string collection rather than a single string. More precisely, in any parsing round i > 1, we could have an LMS substring of T i whose expansion 1 produces a substring of T 1 that covers two or more strings of T . These border phrases make the computation of the BCR BWT a bit more difficult as we need to treat them differently. We avoid this problem by maintaining a bit vector B i [1, σ i ] that marks which symbols in T i expand to suffixes of strings in T . Thus, during the right-to-left scan of T i , each time we reach a position  [j]. Notice that the truncated strings are not LMS substrings by definition, but they do not affect our algorithm (the reasons are explained in Definition 4 and Lemma 3 of Díaz-Domínguez et al. [7]). We receive B i as input along with T i at the beginning of the parsing round i, and we compute the next B i+1 [1, σ i+1 ] when we finish the round.
For practical reasons, we change the representation of D i , encoded in H i for the moment, to a more convenient data structure. First, we concatenate all the keys of D i in one single vector R i . We mark the boundaries of consecutive phrases in R i with a bit vector L i in which we set L i [j] = 1 if R i [j] is the first symbol of a phrase, and set L i [j] = 0 otherwise. We also augment L i with a data structure that supports rank 1 queries [33] to map each symbol D i [j] to its corresponding phrase. We store the values of D i in another vector We maintain the relative order so that the value N i [o] maps the oth phrase we inserted into R i . For simplicity, we will refer to the representation (R i , L i , N i ) just like D i . We still need H i to construct the parse T i+1 , so we do not discard it but store it into disk.
The next step is to build pBW T i from D i . For that purpose, we use the following observations: ▶ Lemma 4. Let X [1, x] and Y [1, y] be two different strings over the alphabet Σ i , with lengths x > 1 and y > 1 (respectively). Assume both occur as suffixes in one or more phrases of D i . Let X be the list of positions in T i where X occurs as a suffix of an LMS substring. More specifically, each j ∈ X is a position such that Section 2.3), then all the suffixes of T i starting at positions in X are lexicographically greater than the suffixes starting at positions in Y.
Proof. Assume first that X is not a prefix of Y (and vice versa). We compare the sequences of these strings from left to right until we find a mismatching position u (i.e., X[u] ̸ = Y [u]). We know that symbols X [u] and Y [u] define the lexicographical order of the suffixes in X relative to the suffixes in Y. In the other scenario, when one string is a prefix of the other, we can not use this mechanism as we will not find a mismatching position X[u] ̸ = Y [u]. For this case, we resort to the symbol types of Section 2.3. We assume for this proof that X is a prefix of Y , but the other way is equivalent. We know that X[x] and Y [x] have different types. X[x] is LMS type because X is a suffix of an LMS substring. On the other hand, Y [x] is L type because if it were S type, then it would also be LMS type, and thus Y [1, x] would be an occurrence for X. This observation is due to , the occurrences of X in X are always followed in T i by symbols that are greater than Y [x + 1], meaning that the suffixes of T i starting at positions in X are lexicographically greater than the suffixes starting at positions in Y. This observation does not hold when X or Y have length one: X[x] equals Y [1] and both are LMS type, so there is no enough information to decide the lexicographical order of the suffixes in X and Y. ◀ The consequence of Lemma 4 is that the suffixes of length > 1 in D i induce a partition over SA i (the suffix array of T i ): Proof. We demonstrate the lemma by showing that the lexicographical sorting does not interleave suffixes of T i in SA i that belong to different lists of O. Assume a string S u ∈ S, associated with the list O u ∈ O, is a prefix in another string S u ′ ∈ S, which in turn is associated with the list O u ′ ∈ O. Even though we do not know the symbols that occur to the right of S u in its occurrences of O u , we do know that both S u and S u ′ are suffixes of LMS substrings, and by Lemma 4, we know that all the suffixes of T i in O u are lexicographically greater than the suffixes in O u ′ . Hence, the interleaving of suffixes in SA i from different lists of O is not possible, even if S is not a prefix-free set. ◀ Lemma 5 gives us a simple way to construct the preliminary BWT for T i (pBW T i ). We consider for the moment pBW T i to be a vector of lists to simplify the explanations. We first sort the strings of S in ≺ LM S order. Then, for every oth string S ∈ S in ≺ LM S order, we insert in the list pBW T i [o] the symbols that occur to the left of S in D i . There are three cases to consider for this task: contains more than one distinct symbol, and it is not possible to decide the relative order of those symbols with the information of D i .
Proof. Let X and Y be two phrases of D i where S occurs as a suffix. Assume the left symbol of S in X is x ∈ Σ i and the left symbol in Y is y ∈ Σ i . In this scenario, the relative order of x and y is not decided by S, but for the sequences that occur to the right of X and Y in T i . However, those sequences are not accessible directly from D i . Hence, it is not possible to decide the order of x and y in pBW T i [o].  The problem is that we do not store O, so we do not know value for l in (s, l). Nevertheless, we do have the frequencies of the phrases in D i , in the vector N i . In this way, we can compute l by summing the frequencies in N i for the phrases of D i where S occurs as a suffix. Now that we have covered all the theoretical aspects of the parsing phrase, we proceed to describe our procedure to build pBW T i .

23:8
Efficient Construction of the BWT Using String Compression

Constructing the Preliminary BWT for the Parsing Round
The computation of pBW T i starts with the construction of a generalized suffix array SA D i for D i . We say SA D i is generalized because it only considers the suffixes of the dictionary phrases. If a string S ∈ S appears as a suffix in two or more phrases, those occurrences maintain in SA D i the relative order in which their enclosing phrases appear in D i . In practice, the values we store in SA D i are the positions in R i , the vector storing the concatenated phrases of D i (see the encoding of D i in Subsection 3.3).
We compute SA D i using a modified version of the ISS method mentioned in Subsection 2.3. The first difference is that, in step one, we insert in SA D i the position in R i of the last symbol of each phrase. Put it another way, suppose R i [j], with L i [j + 1] = 1, is the last symbol of a phrase F , then we insert j in the right-most available cell in the bucket R i [j] of SA D i . The step one in the original ISS puts LMS-type symbols at the end of the buckets. In our case, the last symbol of a phrase is, by definition, LMS type in T i , so the operation is homologous. The second difference of our ISS variation is that, during step two and three, we skip each position The next step is to scan SA D i from left to right to compute pBW T i . From now on, we consider pBW T i to be a run-length compressed vector instead of a vector of lists. As we move throughout the suffix array, we search for every range SA D i [j, j ′ ], with j ′ − j + 1 ≥ 1, that encode suffixes with the same sequence 2 . Nevertheless, we consider only the ranges that either represent suffixes of length > 1 or suffixes of length 1 that expand to suffixes of T . Recall that the left-most symbol of an LMS substring is the same as the right-most symbol of the LMS substring that precedes it. Hence, considering all the suffixes in D i will produce a redundant (and incorrect) BWT. The only exception to this rule are the LMS substrings at the beginning of the strings of T as they do not share a symbol with the LMS substring to their left. This kind of substrings only appear when we cross from T u+1 to T u in T i , with T u , T u+1 ∈ T . We can detect this situation using B i , the bit vector marking the symbols in Σ i that expand to suffixes of strings in T (see Section 3.3).
We define the length of . This value is the sum of the frequencies of the phrases where the suffixes in If all the suffixes of SA D i [j, j ′ ] are followed by the same symbol s ∈ Σ i , we append (s, l) to pBW T i (see Lemma 8). Otherwise we append (*, l). The symbol * represents an empty entry and it is out of Σ i . We will resolve (*, l) in the next phase of grlBWT (see Lemmas 6 and 7). After scanning SA D i , we store pBW T i into disk and discard SA D i .

Grammar Compression and Next Parsing Round
Once we finish constructing pBW T i , the next step in the parsing round i is to store D i in a compact form to use it later during the induction phase of grlBWT. We first explain why we need D i during the induction phase and then describe the format we choose to encode it.
Broadly speaking, the induction process consists of scanning BW T i+1 from left to right, mapping every symbol BW T i+1 [j] ∈ Σ i+1 back to the phrase F ∈ D i from which it originated, and then checking which of the proper suffixes of F produced empty entries in pBW T i (see Lemmas 6 and 7). Assume the suffix F [u..] = S ∈ S produced an empty entry, then we append F [u − 1] in the BWT range associated with S (see Lemma 5).
The process described above requires D i and a mechanism to map the left-maximal suffixes in D i back to the empty entries they produce in pBW T i . We solve the problem by encoding D i with a representation that is similar to grammar compression (Section 2.2).
We start by discarding N i and the rank 1 data structure, as they are no longer necessary (see the current encoding of D i in Section 3.3). For our method to work, we also need each string S ∈ S associated with an empty entry (*, l) of pBW T i to be a member of D i . This property might not hold when the string S that produced an empty entry meets Lemma 6. The problem arises if S always appears as a proper suffix in D i , not as a full phrase. If that is the case, we create 3 a new independent entry for S in D i .
After expanding D i , we create a hash table M i in which we insert each phrase F ∈ D i occurring as a left-maximal suffix. If F has ≺ LM S rank b in D i , then we insert the pair (F, b) into M i , where F is the key and b is the value. Once we construct M i , we reorder the way in which the phrases of D i are concatenated in R i according to their ≺ LM S ranks.
The next step consists of compressing D i . We scan R i from left to right, and for , where * is a dummy symbol. After updating the sequence of F , we mark the symbols in R[j + 2, j ′ ] as discarded if j ′ − j + 1 = |F | ≥ 2. When we finish the scan of the dictionary, we left-contract R i by removing the discarded symbols. This process reduces the phrases in D i to strings of length two, so the vector L i is no longer necessary. Now we explain the rationale of our encoding. We develop our argument as a chain of implications. Consider again the phrase F , which we replaced with the sequence F [u − 1]·b ′ . We obtained b ′ ∈ Σ i+1 when we performed a lookup operation of S = F [u..] in M i during the compression of D i . The value b ′ that the lookup returned is the ≺ LM S order of S in D i . The membership of S to the keys of M i implies that S appears as a left-maximal suffix in D i , which in turn implies that S is a full phrase in D i too (we enforced this property when we expanded the dictionary). Additionally, the left-maximal condition of S implies that there were at least two suffix occurrences of S preceded by different symbols. This is why S produces an empty entry in pBW T i . Now, recall that we sorted the phrases of D i in ≺ LM S order in R i . Therefore, if we want to access S, we have to go to the substring R i [2b ′ − 1, 2b ′ ]. This substring does not encode the full sequence of S, but its longest left-maximal suffix R i [2b] ∈ Σ i+1 (which is also a left-maximal suffix of F ) along with the left-context symbol for that suffix (R i [2b − 1] ∈ Σ i ). Recursively, the longest left-maximal symbol of S is not a sequence either, but a pointer to another position of R i . We access this nested left-maximal suffix by setting b ′ = R i [2b ′ ] and updating the values R i [2b ′ − 1, 2b ′ ]. We continue applying this idea until we reach a range which implies that we reached the last suffix of F . This last range will store the sequence *·F [|F | − 1]. Notice that the right symbol in this case is not the last symbol of F but its left context F [|F | − 1]. This is because the LMS substrings overlap by one character in T i , so F [|F |] is redundant as it also appears as a prefix in another phrase. However, we need F [|F | − 1], as we will append it to one of the empty entries of pBW T i in the next phase of grlBWT. The only exception to this rule is when F expands to a suffix of a string in T . In that case, we store F [|F |] instead pBW T 1 = is not a prefix in any other phrase. On the other hand, we need the dummy symbol * to maintain the invariant that all the phrases encoded in R i have length two. Once we finish the compression, we store R i on disk. From now on, we use D i to refer to R i . The final step for the parsing round i is to create the new text T i+1 . We first reload from disk the hash table H i we produced at the beginning of the iteration, and replace its values with the keys' ≺ LM S orders. More specifically, if a key F ∈ D i of H i has ≺ LM S order b among the other strings that produced empty entries in pBW T i , then we update the value of F in H i to b. Notice that the strings we stored as keys in H i are not the same as those we have now in D i because we compress them. Therefore, we can not lookup the phrases of D i in the keys of H i to update the hash table values. Still, we can overcome this problem if we modify H i after sorting D i in ≺ LM S order but before we compress it.
Once we update H i , we construct T i+1 by scanning T i again and replacing the LMS substrings with their associated values in H i . If T i+1 has length k (the number of strings in T ), then we stop the parsing phase as all the strings in T are now compressed to one symbol. An example of the parsing step is depicted in Figure 1.

The Induction Phase
The induction phase starts with the computation of BW T h , the BCR BWT for the text T h of the last parsing round h. This step is trivial as each symbol in T h encodes a full string of T (see the ending condition of the parsing phase). Hence, the left context of every symbol is the symbol itself. BCR BWT maintains the relative order of the strings in T (see We now describe the steps we perform during every induction step i < h. In this case, assume we receive as input (i) BW T i+1 from the previous phase, (ii) D i , and (iii) pBW T i . Before explaining our procedure, we describe some important properties of BW T i+1 . If we generalize Lemma 9 to x ≥ 1 occurrences of S, then we can use the following lemma to compute the sequence for the empty entry of pBW T i generated by S: when S always appears as a proper suffix in the phrases of D i . When S matches a full phrase (see Lemma 7), there is no left-context symbol for S we can extract from D i . Nevertheless, there is only one phrase F ∈ D i where S can be a non-proper suffix, and because S comes from BW T i+1 , F has to be an LMS substring in T i . This observation implies that F maps to a symbol b ∈ Σ i+1 in T i+1 . Hence, we can extract the left-context symbols of S from the range in BW T i+1 that corresponds to the bth bucket of SA i+1 . We explain how to carry out this process in the next subsection.

The Induction Algorithm
Let p be the sum of the lengths in the empty entries of pBW T i . These lengths correspond to the second field in the run-length representation of pBW T i . We start the induction by creating a vector P i [1, p], which we logically divide into σ i+1 buckets (recall that σ i+1 matches the number of empty entries in pBW T i ). Every bucket b of P i will be of size l b , the length of the bth empty entry (from left to right) of pBW T i . Subsequently, we perform a scan over BW T i+1 from left to right. For each symbol BW T i+1 [j] = b ∈ σ i+1 , we first check if its associated LMS substring F ∈ D i (the string from which we obtain the symbol b during the parsing round i) exists as a suffix in other phrases of D i . This information is already encoded in a bit vector V i [1, σ i+1 ] we constructed during the parsing round i. When F occurs as an LMS substring and as a proper suffix in other dictionary phrases (V i [b] = 1), we append a dummy symbol in the bucket b of P . This is the situation we described at the end of the previous subsection. After processing b, we decompress the left-maximal suffixes of its phrase F from the compressed representation of D i .
The decompression of F begins by accessing the range D i [2b − 1, 2b] (see Section 3.3.2). If o = D i [2b] belongs to Σ i+1 , then the symbol o encodes a string F [u..] = S, with u > 1, whose sequence is a left-maximal suffix in D i . During the parsing phase of grlBWT, we inserted S to D i as an independent string as it yields an empty entry for pBW T i (see Lemma 6). The decompression of F stops when D i [2s] belongs to Σ i , which means we reach the last symbol of F . For the moment, we do not know for which phrase of D i D i [2s] is its left context. Hence, we set BW T i+1 [j] = D i [2s] and leave this position on hold to process it later. After finishing the scan of BW T i+1 , its symbols are now over the alphabet Σ i . These values are the ones we have to insert in the dummy positions of P i . Notice that the entries in P i and BW T i+1 are already sorted by their right contexts in T i . Hence, the completion of the dummy positions reduces to a merge of two sorted lists.
The last step in the induction round i consists of merging pBW T i , BW T i+1 and P i in BW T i . We scan pBW T i and we append its entries to BW T i as long as they are not empty. Then, when we reach an empty entry (*, l), we proceed as follows: assume the current pair (*, l) is the bth empty entry of pBW T i . Then, we check if the phrase F ∈ D i that produced this entry (the one with ≺ LM S order b) only occurs as a full LMS substring in T i (V i [b] = 0). If that is the case, we append the next l symbols of BW T i+1 into BW T i . On the other hand, when F occurs as an LMS substring, but also as a proper suffix in other phrases of D i (V i [b] = 1), the next l symbols of BW T i are a mix of entries from the bucket b of P i and BW T i+1 . We append symbols from the bucket b of P i as long as they are not dummy. When we reach a dummy symbol in P i , we change the list, and append the next x symbols of BW T i+1 into BW T i , where x is the number of consecutive dummy symbols we saw in P i .
Once we process the x entries of BW T i+1 , we go back to P and continue back and forth between P and BW T i until we process all the symbols in the bucket b of P .
The last case we have to cover for the merge is when F always occurs as a proper suffix in D i (i.e., it is not an LMS substring of T i ). This situation is simple as we marked F in V i (V i [b] = 1). Hence, we just copy the content of the bucket b of P into BW T i . Notice this bucket will not have dummy entries as b does not appear in B i+1 as a symbol. We obtain occurrences of b while decompressing other phrases of D i whose ≺ LM S ranks do appear as symbols in B i+1 , and where F is a proper left-maximal suffix. Once we complete all the induction rounds, the final BCR BWT is in BW T i .

Speeding up the Induction with Compression
If we run-length encode BW T i+1 , the induction becomes more efficient. Every position BW T i+1 [j] is not a symbol b, but a pair (l, b) that represents l consecutive copies of b. Thus, instead of decompressing l times the left-maximal suffixes of the phrase associated with b, we decompress them only once, and copy the result l times to the different buckets of P i .
Maintaining P i as a run-length encoded sequence also improves efficiency. A compact representation of P i reduces the working memory, which in turn reduces the number of cache misses. The only problem with this idea is that we do not know before the induction how many equal-symbols runs P i will have. There are two solutions to this problem. First, we could represent P i as a dynamic vector [2]. Its initial size would be σ i+1 , and then we expand the buckets as we insert new equal-symbol runs into them. The second option is to perform a preliminary scan of BW T i+1 to compute the size of the run-length compressed version of P i , then we scan BW T i+1 again to perform the induction.

Complexity of our Method
We show that the construction of the BWT remains linear, even though we perform compression during the intermediate steps. Proof. The complexities in the theorem were already proved for the linear-time algorithms that construct the suffix array [30] and the BWT [34] using ISS. We show that these complexities are not altered by our compression scheme. Let n i = |T i | be the length of the input text we receive at parsing round i. Hashing the dictionary phrases from T i runs in O(n i ) expected time and requires O(n i ) bits of space. The construction of SA D i runs in O(n i ) time and space as we use ISS to build it, and the number of symbols in D i is never greater than n i . The extra steps of the parsing round only require a constant number of linear scans over SA D i . During the induction phase, we only perform linear scans over BW T i and pBW T i . We still have the cost of accessing the left-maximal suffixes of D i when we scan BW T i+1 during the induction phase. However, our simple compressed representation for D i (Section 3.3.2) supports random access in O(1) time to the symbols, and the number of left-maximal suffixes we visit during the scan of BW T i+1 is no more than n i . In every parsing round, the size of T i+1 is at most half the size of T i , so the sum of the text lengths n 1 , n 2 , . . . , n h is O(n) (see Nong et al. [30]). This property also implies that grlBWT never visits more than O(n) left-maximal suffixes during its induction phase.   Table 1 Datasets. The upper rows are the Illumina reads while the lower rows are the human genomes. Columns four and five are the minimum and average string length (respectively) in the collection. The value for r is the number of equal-symbol runs in the BCR BWT of the collection.

Experiments
We implemented grlBWT as a C++ tool, also called grlBWT. This software uses the SDSL-lite library [11] to operate with bit vectors and rank data structures. Our source code is available at https://github.com/ddiazdom/grlBWT. We compared the performance of grlBWT against other tools that compute BWTs for string collections: ropebwt2 4 : a variation of the original BCR algorithm of Bauer et al. [1] that uses rope data structures [2]. This method is described in Heng Lee [20]. We also considered the tool bwt-lcp-em [3] for the experiments. Still, by default it builds both the BWT and the LCP array, and there is no option to turn off the LCP array, so we discarded it. We compiled all the tools according to their authors' description. For grlBWT, we used the compiler flags -O3 -msse4.2 -funroll-loops.
We considered two common types of genomic data for the experiments: short reads and assembled genomes. We downloaded five collections of Illumina reads produced from different human genomes 9 . We concatenated the strings so that our dataset 1 had one read collection, dataset 2 had two collections, and so on. We named the files ILLX, where X is the number of read collections concatenated. We also downloaded from NCBI 10  for our experiments so that every dataset has five more genomes than the previous one. This setup aims to increase the repetitiveness as the collection size increases. We named each file using the prefix HGA concatenated with the number of genomes it had. The only prepossessing step we performed on the genomes was to put every chromosome in one line and set all the characters to upper case. All our inputs are described in Table 1.
We also investigated the effect of page cache [24,Ch. 16] in grlBWT. In every parsing round i, we keep T i on disk, and linearly scan its file by loading from disk to RAM one data chunk of 8 MB at the time. Similarly, we keep a buffer of 8 MB in RAM for T i+1 , which we dump into disk every time it gets full. We manipulate BW T i (respectively, BW T i+1 ) in the same way. We used the function posix_fadvise to turn off the page cache for T i , T i+1 , BW T i , and BW T i+1 . Then we assessed the performance of grlBWT on the assembled genomes using posix_advice and not using it. We did not evaluate the effect of the page cache in the other tools.
We limited the RAM usage of egap to three times the input size. For BCR_LCP_GSA, we turned off the construction of the data structures other than the BCR BWT and left the memory parameters by default. In the case of gsufsort, we used the flag -bwt to build only the BWT. For ropebwt2, we set the flag -L to indicate that the data was in one-sequence-per-line format, and the flag -R to avoid considering the DNA reverse strands in the BWT. We ran the experiments on the Illumina reads using one thread in all programs as not all support multi-threading. For this purpose, we set the extra flag -P to ropebwt2 to indicate single-thread execution. We tested the human genomes only on ropebwt2, grlBWT and pfp-ebwt. By default, ropebwt2 uses four working threads, so we set the same number of threads for grlBWT and pfp-ebwt. We did not report results for pfp-ebwt with dataset ILL25 as the execution crashed. We carried out the experiments on a machine with Debian 4.9, 736 GB of RAM, and processor Intel(R) Xeon(R) Silver @ 2.10GHz, with 32 cores.

Results and Discussion
We summarize our experiments in Figures 2 and 3. The results we report for grlBWT do not consider the use of posix_advice to turn off the page cache. In Illumina reads, the fastest method was ropeBWT2, with a mean elapsed time of 4.14 hours. It is then followed by BCR_LCP_GSA, gsufsort, grlBWT, pfp-bwt, and egap, with mean elapsed times of 9.58, 9.43, 10.05, 13.08, and 27.30 hours, respectively ( Figure 2B). We notice that grlBWT is competitive with BCR_LCP_GSA and gsufsort. However, it gets slightly faster than them from input ILL4 onward. We expected this behaviour since the largest datasets are more repetitive.
Regarding the working space, the most efficient was BCR_LCP_GSA, with an average memory peak of 5.73 GB. It is then followed by ropebwt2, with an average memory peak of 26.64 GB. In both cases, the memory consumption increases slowly with the input size.
In the case of grlBWT, the memory peak is more considerable; 42.20 GB on average, with a memory consumption that grows faster than the previous methods (see Figure 2A). However, egap, gsufsort, and pfp-ebwt are far more expensive, and their memory consumption grow even faster. The tool egap uses 110.94 GBs on average. On the other hand, pfp-ebwt and gsufsort have similar average memory peaks: 331.98 and 372.68 GBs, respectively. In the repetitive datasets (human genomes), the results changed drastically (see Figure 3). Our tool grlBWT outperformed ropebwt2 and pfp-ebwt in elapsed time, with an average of 4.89 hours versus 20.95 and 9.55 hours of ropebwt2 and pfp-ebwt, respectively. As expected, the time for grlBWT grows smoothly with the input size, while the time for ropeBWT grows fast. The time function for pfp-ebwt also grows smoothly, but the results are still slower than those of grlBWT (see Figure 3B). Regarding memory peak, ropebwt2 is the most efficient tool, with a mean of 18.05 GB. Still, grlBWT obtained competitive results, with an average peak of 20.38 GB. In this case, the memory consumption growth in grlBWT is slightly steeper than in ropebwt2, but it remains smooth. In contrast, pfp-ebwt has a more dramatic growth in memory consumption, with an average memory peak of 156.74 GB ( Figure 3A).
We observe that ropeBWT2 and pfp-ebwt performed well in one measure, but not in both. In contrast, grlBWT maintained a low footprint for both measures, elapsed time and memory consumption. This result demonstrates that our strategy of keeping the intermediate data of the BWT algorithm in compressed format works well when the text is repetitive.
Our experiments on the page cache showed there is an average slowdown of 19% in grlBWT when the cache is disabled with the function posix_advice. This slowdown factor increases with the input size, being the lowest with HGA05 (12%) and the highest with HGA20 (26%). This result is expected as we are only using a static buffer of 8 MB. A simple solution would be to set a dynamic buffer that uses, say, 0.5% of the input instead of the fixed 8 MB.

Concluding Remarks
We introduced a method for building the BCR BWT that maintains the data of intermediate stages in compressed form. The representation we chose not just reduces space usage, but also reduces computation time. Our experimental results showed that our algorithm is competitive with the state-of-the-art tools under not so repetitive scenarios, and that it greatly reduces the computational requirements when the input becomes more repetitive, standing out as the most efficient tool to date (and to our knowledge) in this context.