Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

Summary We present Raptor, a system for approximately searching many queries such as next-generation sequencing reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the interleaved Bloom filters (IBFs) as a set membership data structure and probabilistic thresholding for minimizers. Our approach allows compression and partitioning of the IBF to enable the effective use of secondary memory. We test and show the performance and limitations of the new features using simulated and real datasets. Our data structure can be used to accelerate various core bioinformatics applications. We show this by re-implementing the distributed read mapping tool DREAM-Yara.


INTRODUCTION
The recent improvements of full genome sequencing technologies, commonly subsumed under the term NGS (next-generation sequencing), have tremendously increased the sequencing throughput. Within 10 years, it rose from 21 billion base pairs (Venter et al., 2001;International Human Genome Sequencing Consortium, 2001) collected over months to about 400 billion base pairs per day (current throughput of Illumina's HiSeq 4000). The costs for producing one million base pairs could also be reduced from many thousands of dollars to a few cents. As a result of this dramatic development, the number of new data submissions, generated by various biotechnological protocols (ChIP-Seq, RNA-Seq, etc.), to genomic databases has grown dramatically and is expected to continue to increase faster than the cost and capacity of storage devices can keep up. Ongoing projects like the 100,000 Genome Project (Caulfield et al., 2019) or the American 1,000,000 Genome Project (All of Us (NIH), 2020) will easily produce data in the range of several petabases. This growth not only challenges the storage infrastructures and the processing pipelines of public databases because of the sheer data throughput but also challenges algorithm engineers to improve the efficiency of sequence analysis pipelines and develop new strategies for compression, data parallelism, and concurrent computing.
The main task in analyzing NGS data is to search sequencing reads or short sequence patterns (e.g., read mapping and variant calling) or analyzing expression profiles in large collections of sequences (i.e. a database). Searching the entirety of such databases mentioned above is usually only possible by searching the metadata or a set of results initially obtained from the experiment. Searching (approximately) for specific genomic sequence in all the data has not been possible in reasonable computational time. The demand for solutions can be seen by the various attempts toward enabling sequence searches on large databases (see (Marchet et al., 2019) for an overview). While the NIH SRA provides a sequence search functionality, the search is restricted to a limited number of experiments. Full-text indexing data structures, such as the FM-index, are currently unable to mine data of this scale. Wordbased indices, such as those used by internet search engines, are not directly appropriate for edit-distance-based biological sequence searches (Bingmann et al., 2019). The sequence-specific solution Ca-BLAST (Berger and Peng, 2013) and its variants require an index of known genomes, genes or proteins, and thus cannot search for novel phenomena in raw sequencing files. This holds also true in the field of mapping-based metagenomic binning and quantitation where the relevant microbial databases grow about as fast as the sequence archives. The NCBI Refseq database of prokaryotic genomes contains about 30 GiB of sequence, still small enough to build an FM-index for the genomes, which takes about 24 hr time and about 50 GiB memory (Dadi et al., 2018). However, including the draft genomes into the

Related work
The problem of approximately searching queries in ultra-large databases has recently been addressed by several groups, focusing on different applications, but all using methods based on the k-mer content of the databases. In the field of alignment-free metagenomic analysis, which focuses on k-mer based analysis, the size of the data also becomes slowly prohibitive. For example, Kraken (Wood and Salzberg, 2014) needs z 147 GiB RAM for indexing 380 GBases. For analyzing RNA-Seq data, some groups aimed at searching the raw files directly for a set of transcripts ( (Solomon and Kingsford, 2016) and shortly afterward (Sun et al., 2017)). They propose novel solutions to the problem of searching a transcript of interest in all relevant RNA-Seq experiments. Up until recently, these searches were only based on the sequences itself; the tool REINDEER (Marchet et al., 2020) is the first approach to also account for the sequence abundances.
As a benchmark, all three publications use a dataset of 2; 652 RNA-seq sequencing runs of human blood, breast and brain tissue (a total of z6:5 TiB) in which they search for 214; 293 known transcripts. For a single query their methods need in the range of 2 À 20 minutes, which is a tremendous improvement and a speedup of orders of magnitude compared to previous methods. Although a breakthrough, the methods presented by the groups need 4 and 0:3 À 2 days for processing the above set of 214; 293 queries, respectively. Very recently, this time was improved by the Patro group with the tool Mantis in (Pandey et al., 2018) to 82 min. Moreover, Bradley et al. (Bradley et al., 2019) propose a Bloom filter based solution that can index about 170 TiB of (repetitive) raw sequence into an index of 1.5 TiB. However, searching, for example, 220 MiB of plasmid sequence takes 11 days using 8 cores. The same group followed up with a newer approach called COBS (Bingmann et al., 2019). Finally, the construction time of an index build on top of 170 TiB of input data was further improved by the tool RAMBO (Gupta et al., 2019) which only needs 14 hr on a cluster of 100 nodes.
Taken together, all approaches are still very demanding in terms of memory consumption and run time.

Our contribution
In this paper, we propose a data structure, called binning directory, that can distribute approximate search queries based on an extension of the recently introduced Interleaved Bloom Filters (IBFs) (Dadi et al., 2018). A binning directory combines a so called x-partitioned IBF (x-PIBF) with winnowing minimizers and probabilistic thresholding that takes into account the varying number of minimizers in each read. We present our implementation, called Raptor, discuss its capabilities and limitations, and compare it with the state-of-theart methods Mantis and COBS. Our comparison shows that Raptor is up to times faster than the competitors in answering approximate string searches with full sensitivity and a very good specificity. In addition, we only use a fraction of the memory and can further trade run time for main memory consumption.

General method
Raptor stores a representative transformation of the k-mer content of the database that is divided up into a number of bins, typically a few hundred to a few thousand (see STAR Methods section for details). The term representative indicates that the k-mer content could be transformed by a function which reduces its size and distribution (for example, using winnowing minimizers on the text and its reverse complement or using gapped kmers). In this work, we use ungapped ðw; kÞ-minimizers for computing representative k-mers (see also Chikhi et al. (2016)). A ðw;kÞ-minimizer is essentially the lexicographically minimal k-mer of all k-mers and their reverse complements in a window of size w. The same transformation is applied to the k-mers of the query (see Figure 1 for an example and details). Raptor uses a set membership data structure, the x-PIBF, to retrieve binning bitvectors indicating whether a representative k-mer is in a bin or not. It then combines the binning bitvectors of all representative k-mers in a query into a counting vector and applies a thresholding step to determine the membership of a query in a bin. The process is illustrated in Figures 2-4.

Evaluation
In the following, we report our computational experiments for Raptor. First, we will use an artificial dataset to discuss the limitations of binning directories, the impact of compression, the impact of the ll OPEN ACCESS 2 iScience 24, 102782, July 23, 2021 iScience Article use of ðw;kÞ-minimizers for different window sizes, and the time/space trade-off when using different partition sizes. We will also compare different binning directories with Mantis and COBS using this dataset.
Secondly, we will evaluate Raptor using a real data set used by several groups to determine the membership of transcripts in RNA-Seq files (Pandey et al., 2018) and compare Raptor with Mantis (Pandey et al., 2018) and COBS (Bingmann et al., 2019).
Lastly, we evaluated the use of Raptor in conjunction with the distributed read-mapper DREAM-Yara (Dadi et al., 2018).
All experiments were conducted on a Dell PowerEdge T640 with an Intel Xeon Gold 6248 CPU using 32 threads and 1 TiB of main memory. All file I/O for the artificial dataset was performed to and from a memory mapped file system (/dev/shm) to eliminate I/O effects, i.e., disk caching, from the measurements. Unless otherwise stated, all other I/O was performed to and from hard disk drives.

Datasets
We created a random DNA sequence of 4 GiB size and divided it into b bins which would correspond to b different genomes in, e.g., a metagenomic data set. Using the Mason genome variator (Holtgrewe, 2010), we then generated 16 similar genomes in each bin which differ about 1% from each other on average. This could be seen as bins containing the genomes for a very homologous species. The total sequence length is hence 64 GiB, however, containing b groups of highly similar sequences. Finally, we uniformly sampled a total of 2 20 reads of length 100 bp from the genomes and introduced 2 errors in each read to simulate a sequencing experiment. This artificial dataset is very balanced, which is the ideal case for the x-PIBF, since its overall size is dependent on the largest bin. We will discuss this issue in the discussion section.
On this data set, we use ð19; 19Þ-minimizers and ð23; 19Þ-minimizers in conjunction with thresholds derived by the k-mer Lemma or our probabilistic thresholding for determining which bin contains the query. The value k = 19 was chosen to make random occurrences seldom (see STAR Methods for a detailed discussion).
In order to evaluate our method on real data, we took the dataset from (Pandey et al., 2018) which consists of 2; 568 RNA-Seq experiments. Similarly to (Pandey et al., 2018), we excluded all experiments that have an average read length below 50bp because reads shorter than that are rarely relevant in practice. Furthermore, this allowed us to test the minimizer approach with a broader window size. This left us with 1; 742 RNA-Seq experiments which have a size of around 6 TiB (gzipped FASTQ files). All tools were tested on this dataset using k = 20, a value used in the competitors' publications which also results in an effective text ratio near 1 (see Table 8).
For the evaluation of DREAM-Yara, we downloaded the NCBI RefSeq for both archaea and bacteria as of February 14th 2021 and applied TaxSBP to create a taxonomic clustering of the dataset into 64 and 1024 bins. We split the RefSeq according to the clustering into 64 and 1024 gzipped FASTA files containing the sequence information of the respective clusters. Similarly to the artificial dataset experiment, we sampled A B Figure 1. Examples of (w,k)-minimizers (A) and (B) show the same k for different window sizes, the former with a window size w = k and the latter with a larger window of 8. Note that the reverse complemented sequences, shown in lower case, have to be read from right to left. The window width is indicated by a dash and the minimizing k-mer is placed within the window. Subsequent windows often share the same minimizer which we illustrated by showing those as well, although they are only stored once.

OPEN ACCESS
iScience 24, 102782, July 23, 2021 3 iScience Article Speed and space consumption of raptor with ðw; kÞ-minimizers We start by investigating the false positive (FP) count for different IBF sizes, which in turn affects the size of the individual Bloom filters in the IBF. A Bloom filter has a FP rate (FPR) depending on the ratio of stored elements to its size. For a fixed number of elements stored, it holds that the less space we allocate, the more FPs will occur. In our experiment this will lead to overcounting k-mers and hence lead to FP assignments of reads to bins. While one can easily compute the FPR for each individual Bloom filter (i.e. bin) of an IBF, it is harder to evaluate when a high Bloom filter FPR results in the FP assignment (FP) of a read to a bin.
To evaluate this effect, we allocated IBFs of 1; 2; 4 and 8 GiB size and report the used RAM, construction time, search time and FP bin assignments for b = 64 and b = 1; 024 bins for uncompressed and compressed vectors using h = 2 hash functions. Also, we use ð19; 19Þ-minimizers and the traditional k-mer counting lemma threshold in one experiment and ð23; 19Þ-minimizers in conjunction with a new probabilistic threshold in a second experiment. The results are shown in Table 1.
Our experiments show that allocating only 1 GiB for an IBF using ð19; 19Þ-minimizers results in a high number of FPs for all values b -we would have to conduct about 6,10 5 and 9:6,10 6 wrong verifications for the IBF for b = 64 and b = 1; 024, respectively. For ð23; 19Þ-minimizers, the numbers are over one order of magnitude smaller (2:8,10 4 and 4:4,10 5 ). This is to be expected since we store a smaller set of representative k-mers. By doubling the size of the IBF, the number of FPs is already heavily reduced for ð19; 19Þ-minimizers. Indeed, there are no more FPs caused by the Bloom filter. Note that the 189 FP for b = 64 are reads whose iScience Article minimizer composition actually occurs in a different bin than its original bin by chance. Since distributing the k-mers to more bins reduces the chance of the same minimizer composition being present in different bins, the FP count for b = 1; 024 is 0. For ð23; 19Þ-minimizers, we can still see 197 À 189 = 8 FP induced by the Bloom filter for b = 64 and 141 for b = 1; 024. This indicates that the distribution of the ð23; 19Þ-minimizers is not completely uniform or that our probabilistic threshold for the counting lemma introduced a few FP. In general, the effect of using minimizers on the FP rate is negligible. For larger sized IBF, no FP searches induced by the Bloom filter occur for both minimizer sets. The FP counts are obviously the same if we apply lossless compression to the bitvector.
Next, we look at the time and space usage for index construction. The construction time for b = 64 is between 11 and 13 min and for b = 1; 024 between 4 and 6 min. For 1; 024 bins, the wall clock time is smaller, since we can easily parallelize the construction (in chunks of 64), which is not possible for 64 bins. The time for 64 bins is larger than for 1024 bins since we insert more data in a single bin than in the case for 1024 bins. The space needed for construction is the size of the IBF and thread-local storage for the input sequences. In order to compress the IBF, both the uncompressed and compressed version must be in memory for a short amount of time, resulting in an increased memory peak.
For ð23; 19Þ-minimizers, the construction time is generally lower since we insert fewer representative kmers. While for b = 64, the times are comparable, the IBF can be built almost twice as fast for b = 1; 024 compared to ð19; 19Þ-minimizers.
Now we discuss the time and space usage for the search. For b = 64, Raptor needs about 1 s to search for the 2 20 reads for all IBF sizes. This holds true for both minimizer sets. Although Raptor searches fewer representative k-mers in case of the ð23; 19Þ-minimizers, we need to compute the minimizers of the query beforehand, which is additional work. For b = 1; 024 we need between 2.2 s for a 2 GiB IBF and 1.6 s for an 8 GiB IBF. The increase for larger b is to be expected since we need to check for all bits in the binning bitvector. This takes longer for larger binning bitvectors. For ð23; 19Þ-minimizers, we only see a slight increase and still need about 1 s. The benefit of querying fewer k-mers becomes pronounced and the IBF is up to twice as fast as the IBF for ð19; 19Þ-minimizers.
When searching, it is also interesting how large the memory footprint is if we use compressed bitvectors. For b = 64 and ð19; 19Þ-minimizers, we see for the IBF that compression actually increases the memory footprint until we use ann IBF of 8 GiB. This means that the bitvectors are not sparse and that the space overhead of the compression algorithm outweighs the benefit of compressing the data.
In addition, the search time increases by a factor of about 4 for b = 64 and about 8 À 11 for b = 1; 024, which makes compression here unattractive.
This changes for ð23; 19Þ-minimizers. For b = 64, the search time increases from about 1 s to only 1.6 s while we can compress the bitvector from 4.3 to 2.7 GiB or from 8.4 to 3.1 GiB. For b = 1; 024, the compression is similar since we store the same number of k-mers, but the run time increases by a factor of 5 À 8. This is due to the need to decompress the 1; 024 bit long binning bitvector which takes longer than for the 64 bit long bitvector. Still, for ð23; 19Þ-minimizers and smaller b, using compression offers an attractive time/space trade-off. For querying, we can observe that, in general, a sparser bitvector returns the results faster.
Finally, we investigated the impact of partitioning the IBF into x = 1; 2; 4; 8 parts. Since Raptor cannot directly evaluate the counting vector for each read after having looked at one part, we need to store the intermediate results and check if we match the threshold after having counted the k-mers in all parts of the partition. To do this, Raptor allocates a buffer vector of size 10 7 where each position holds a vector of b bits that is assigned to one of the reads. After counting the occurrences of the k-mers of a read in one partition, we can add the result to the vector and use the vector for the next batch of reads. We report on the construction and search time, as well as on the maximum memory allocated by the resulting x-PIBF, for b = 64 and b = 1; 024 bins. We use an 8 GiB IBF in these experiments. The results are shown in Table 2.
In general, we observe for all minimizers that the construction and query times grow higher the more parts iScience Article 1 s for a 1-PIBF to 4.3 s for an 8-PIBF. When using more parts, the run time increases, but the space needed to hold a single part in memory decreases. While we need 8 GiB memory to use a 1-PIBF, we only need 1.4 GiB if we use an 8-PIBF. When using ð23; 19Þ-minimizers, we see similar trends for b = 64. Furthermore, like in the unpartitioned case, the search time is faster. Indeed, for an 8-PIBF we need only 2.68 s for the query and for an 8-PIBFc only 6.9 s while using only 857 MiB peak memory.
For b = 1; 024 and ð19; 19Þ-minimizers, the search times for the IBF increases from 1.69 s to 11.7 s for ann 8-PIBF. As before, compression is unattractive for this case, while it pays off for the ð23; 19Þ-minimizer version.
In general, the construction time of Raptor's index increases, the more parts we create, since we have to stream over our input x times and store x parts on the disk. However, we observe that this increase has a lower rate than the increase in the parts, as both constructing and querying an x-PIBF do take less than x times the time of a 1-PIBF. The reason for this is that we do not have to access the bitvector for k-mers that are not in the current part.

Impact of probabilistic thresholding on false negatives
In this section, we show that our probabilistic thresholding is crucial in avoiding false negatives. When we use ð19; 19Þ-minimizers, the k-mer lemma ensures that we have no false negatives, but this is no longer true when using minimizers with w > k. This is apparent since the number and distribution of ðw;kÞ-minimizers is sequence dependent and hence leads to a different threshold for each read. In the methods section, we describe how we derived a method to compute, for a given maximal number of errors, a threshold depending on the parameters w, k and the number of minimizers a query has. iScience Article Tools like Mantis and COBS, which use a simple percentage cutoff, would suffer in a similar increase in false negatives if they used minimizers. However, they could use our results to adapt their thresholding. In our dataset, the number of minimizers for each read ranges from 15 to 35 while our thresholds range from 4 to 13.
For example, for a query of length 100 with 20 minimizers the threshold is 6, which is lower than 0:41, 20 = 8:2. Using our threshold avoids falsely filtering out the query. In general, the percentage of minimizers that need to be present ranges from 26% to 38%. This shows that applying a single threshold is not sufficient. Table 3 shows the resulting FPs and FNs of our adapted Lemma for various IBF sizes.

Comparison with other tools
In the following, we compare Raptor with the state-of-the-art tools Mantis (Pandey et al., 2018) and COBS (Bingmann et al., 2019) using the artificial data set and a real data set of 1; 742 RNA-Seq experiments also used in (Pandey et al., 2018) as described earlier. Note that the computational experiments for the real data set only use one thread for all tools, the same as it was done in the competitors' publications. The effect of parallelization was tested using the artificial data set where we used 32 threads.
We built an index over the artificial dataset (separated in 64 and 1; 024 bins) with COBS and Mantis for a kmer size of 19. Afterward, we queried the same reads we have already searched with Raptor using BDs. Both COBS and Mantis consider a transcript found if the amount of k-mers found is more or equal to a given threshold. Instead of using the default threshold of 80 percent, we determined a threshold according to the standard k-mer counting lemma, which was 53 percent.
Moreover, we had to adapt our input for the index construction of Mantis by adding random quality scores to our FASTA files because Mantis only accepts FASTQ files as input. But even with this adaptation, Mantis, or more precisely the helper tool Squeaker, resulted in a segmentation fault for the artificial dataset separated in 64 bins, thus we only present result for Mantis with 1; 024 bins.
As can be seen in Table 4, the construction of COBS and Mantis takes at least three times longer than for Raptor. Furthermore, searching with Raptor only needs a fraction of the space (about 5 À 8 GiB vs. 20 GiB) COBS and Mantis need, while being as accurate. The most striking difference is the search time. For ð19; 19Þ-minimizers, Raptor needs between 0.9 s for b = 64 and 1.6 s for b = 1; 024. This is about 144 times faster than COBS and (for b = 1; 024) about 30 times faster than Mantis.
Next, we evaluated the tools on the real dataset. We built an index over the dataset (that means for 1; 742 bins) with COBS, Mantis, and Raptor using a binning directory for a k-mer size of 20 (since this value was used in (Pandey et al., 2018)). For Raptor, we created two versions, one using a binning directory with an IBF with ð20; 20Þ-minimizers and one version using a binning directory with an IBF with ð40; 20Þ-minimizers.
Mantis uses a cutoff in order to sort out low-frequency k-mers that probably resulted from sequencing errors (Pandey et al., 2018). In order to be comparable, Raptor applied the same cutoffs for both versions of the binning directories. The results are shown in Table 5.
Raptor's construction time of the binning directory is faster than Mantis and COBS. The space consumption drops to only 5% of that of Mantis when using ð40; 20Þ-minimizers. COBS's construction time and space The resulting threshold is z41%. False positives are reads originating from bin i assigned to a bin jsi, neglecting the fact that the read may match with bin j when allowing for 2 errors. False negatives are reads originating from bin i not assigned to bin i.

OPEN ACCESS
iScience 24, 102782, July 23, 2021 7 iScience Article consumption is nowhere near the other two applications, because COBS has no preprocessing step and does not use cutoffs to filter out erroneous k-mers. Therefore, further comparisons to COBS are omitted.
In order to compare the query times, three differently sized sets (100, 1; 000, and 10; 000 transcripts) were used. Each set was created by randomly picking human gene transcripts. The lengths varied between 46 bp and 101; 518 bp.
The FPR was determined by comparing Raptor's results to Mantis, assuming Mantis correctly finds all experiments as it claims to be exact. A similar evaluation was applied in (Pandey et al., 2018). Also, the definition of a found transcript is based on the evaluation of (Pandey et al., 2018). Therefore, both Mantis and Raptor consider a transcript found in an experiment if 80% of its representative k-mers are found.
As shown in Table 6, Raptor using a BD with ð40; 20Þ-minimizers is significantly faster (12 À 58 times) than Mantis and uses only a fraction of main memory while still being specific with a low FP rate of about 0.017. Even when using ð20; 20Þ-minimizers, Raptor outperforms Mantis in space and time consumption.

Biological case study
To evaluate the use of Raptor in conjunction with other bioinformatics applications, we re-implemented the distributed read mapping tool DREAM-Yara (Dadi et al., 2018). These changes consisted of changing to the IBF implementation used in Raptor as well as the incorporation of minimizers. Additionally, smaller changes to address inconsistencies in command line usage and bugs where applied to both the original and new DREAM-Yara.
We will compare the original DREAM-Yara using 19-mers to the new DREAM-Yara using 19-mers and both ð23; 19Þ-and ð31; 19Þ-minimizers for different IBF sizes.
The steps involved building the FM-indices used by DREAM-Yara were not affected by the adaptations, hence the build time and memory consumption for the FM-Index in DREAM-Yara are for both new and original around 40 min and 85 GiB, respectively, for 1024 bins and 5 hr 20 min and 89 GiB, respectively, for 64 bins.  iScience Article As can be seen in Table 7, the memory consumption of both versions of DREAM-Yara is virtually the same when using 19-mers and an IBF of size 16 GiB. The IBF build step uses around 17.1 GiB and 17.4 GiB of memory while the mapping requires approximately 30 and 23 GiB for 64 and 1024 bins, respectively.
The new DREAM-Yara shows a performance improvement regarding run time, reducing the time needed to build the IBF from 9:17 to 7:3 and 9:14 to 7:49 for 64 and 1024 bins, respectively. The mapping time similarly decreased from 5:15 to 3:26 min and from 7:39 to 6:18 min. While the original version already had 255 (64 bins) and 220 (1024 bins) unmapped reads, the new version increased this by a handful of reads to 257 and 234, respectively.
Applying minimizers allows using smaller IBFs, for ð23; 19Þ-minimizers, a 4 GiB IBF is sufficient, and for ð31; 19Þ-minimizers, even 2 GiB are enough. This decreases the memory consumption by around the amount of memory saved by the smaller IBF, resulting in needing around 5 and 3 GiB required to build the IBF. The memory consumption of the mapping step decreases to 17.5 GiB for ð23; 19Þ-minimizers and to 15.2 GiB for ð31; 19Þ-minimizers when using 64 bins and to 10.8 and 9 GiB when using 1024 bins. The time needed to build an IBF only varies by a few seconds between 64 and 1024, with 19-mers for the new DREAM-Yara having the biggest difference with 12 s (7:37 and 7:49 for 64 and 1024 bins, respectively), while the difference is 3 s otherwise. When using the same parameters, the new version is around 1.5 min faster to build the IBF (from 9:17 to 7:47). Using minimizers amplifies this effect, resulting in an IBF build time of 4:15 for ð23; 19Þ-minimizers and 2:15 for ð31; 19Þ-minimizers due to a smaller IBF size  iScience Article and less representative k-mers to process. The new version of the IBF is also able to answer queries faster than the version used in the original DREAM-Yara, hence the time needed for mapping decreases when switching to the newer version. Comparing for 19-mers, there is a change from 5:15 to 3:25 for 64 bins and from 7:39 to 6:18 for 1024 bins. Since the thresholds for minimizers are heuristic, this generally results in more verification being necessary, i.e. the mapper needs to map more reads. Hence, while still faster than the original version, the runtime for 64 bins increases to 3:54 and 4:24 for ð23; 19Þ-and ð31; 19Þ-minimizers, respectively. Likewise, the runtime for 1024 bins increases to 6:59 and 6:52 for ð23; 19Þ-and ð31; 19Þ-minimizers, respectively.
When using minimizers and hence a probabilistic thresholding, the number of unmapped reads increased only by 2 respectively 14 which is quite acceptable considering the memory decrease.

DISCUSSION
In this paper, we presented an approach to answer approximate string queries using a representative set of k-mers of the database and query. We stored a set of ðw;kÞ-minimizers as representative k-mers of the database in a binning directory using a partitioned, IBF.
Binning directories could be used in various settings which we discuss below.

Using BDs for metagenomic profiling
Tools like Kraken (Wood and Salzberg, 2014) or Centrifuge (Kim et al., 2016) perform metagenomic binning by querying the k-mer content of genomes using NGS reads and inferring the presence or absence of organisms in the sample using the taxonomy of a phylogenetic tree.
Hence, we could use BDs for a classification based on taxonomic levels (e.g., species, genus, .) or assembly level, and group the reference genome sequences accordingly. Using the counts for k-mers given by the BD, we can infer the composition of a metagenomic sample. Indeed, (Piro et al., 2020) already applied this idea as described in (Dadi et al., 2018) for this task. This resolved the problem of uneven bin sizes by applying a preprocessing step to distribute the k-mer content of bins more evenly.

Using BDs for querying file content
Another application for BDs which we also used in one benchmark is to query all existing human RNA-Seq files in the SRA for the presence of transcripts. For this application, the bins would be defined by the respective file content. We would expect that the effective text size nðkÞ is considerably less than 5 TBases since we sample from human genes. Of course, this application scenario is not limited to RNA-Seq files.

Using BDs for read mapping
In the context of read mapping, we can use the BD as follows. The database would be the reference genome(s) we want to map our reads against. Assume we have divided them into bins such that similar parts of the genomes are placed within the same bin. In the context of metagenomics analysis, this could be achieved by using a taxonomic tree (see also (Dadi et al., 2018)); alternatively, the sequences could be clustered based on similarity. The sequences in the bins could then be indexed using a compressed suffix array or other suitable indices and the BD can be used to distribute the approximate searches. The biological case study in this paper is an example of such an application.

Possible extensions
Currently, Raptor stores all representative k-mers, even if some representative k-mers in the reference dataset are ubiquitous, i.e. they appear in all or almost all, e.g., 95%, bins. While some approaches, like Mantis, exclude certain k-mers from consideration, one could instead exclude them from the IBF and store them in iScience Article a small lookup table. Whenever such a k-mer is queried, we can increase the counters on all bins and save the lookups in the IBF. This might reduce the size of the IBF and speed up the search time.
While not shown in this paper, the update operation on an IBF was already used in DREAM-Yara (Dadi et al., 2018) and ganon (Piro et al., 2020). Adding data is trivial since we just need to set the corresponding bits in the x-PIBF. For removing data from the x-PIBF we need to clear and rebuild the affected bins of the update.

Limitations of study
The main limiting factor of our method is its susceptibility to uneven bin sizes as well as a high number of bins. The number of bins b directly affects the run time and memory consumption since the processed subbitvectors are of size b. While the increase in memory consumption is negligible, processing several ten thousand of bins will lead to a moderate slowdown. The size of the IBF is determined by the bin with the highest k-mer content. Hence, having unevenly distributed k-mer content across the bins will force us to increase the size of the IBF to accommodate a reasonable FP rate for the biggest bin, even though the smaller bins could achieve the same FP rate with a much lower size. However, we are actively working on a version of the IBF that will automatically adapt to inputs with uneven bin sizes. Furthermore, we will enhance the IBF by adding a hierarchical structure, i.e. a multi-level tree structure, which will dramatically reduce the number of bins needed to represent a dataset. A B Figure 3. Impact of an error on the number of k-mers The sequences (A) and (B) represent the same sequence without and with an error at position 6 replacing a T with a G, respectively. The sequence in (A) has 3 minimizers, one of which (caca) is destroyed by the error position. Hence, we could assume that a sequence with one error at this position has a count of 2. However, introducing the error by replacing T with G has the effect that the first window now has a different minimizer not covering the error position (ggca) and hence (B) still has a minimizer count of 3. Thus, (A) and (B) would be wrongly deemed not matching with 1 error.

OPEN ACCESS
iScience 24, 102782, July 23, 2021 11 iScience Article In conclusion, we presented a novel, versatile, fast, and memory efficient data structure for k-mer-based analysis of large sets of sequences using binning directories. Our implementation, Raptor, is ready for secondary memory use and its data structures can be efficiently compressed if the used bitvector is sparse. Furthermore, we showed that the concept of ðw;kÞ-minimizers allows to effectively reduce the set of representative k-mers without sacrificing specificity nor sensitivity by applying our probabilistic thresholding. Raptor outperformed the state-of-the-art tools Mantis and COBS in both run time and space consumption. The use of ð40; 20Þ-minimizers was able to reduce the memory footprint of our method from 8 to 0.9 GiB for the RNA-Seq dataset introduced in (Pandey et al., 2018), which is about one order of magnitude less compared to Mantis (z19 À 23 GiB). Using ðw; kÞ-minimizers, the run time was better by factors between 12 and 144 compared to Mantis and COBS which enables completely new ways for analyzing large sequencing archives in ways that were not possible before. Raptor and binning directories are available in the SeqAn library (Reinert et al., 2017) of efficient data types and algorithms.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

Materials availability
This study did not generate new materials.

Data and code availability
The data used for our experiments pertaining RNA-Seq is the same as used in Mantis (Pandey et al., 2018). It consists of 2652 human sequencing experiments which originate from blood, breast and brain tissue samples. A list of SRRs and URIs of these experiments is provided by the authors of Mantis via Zenodo at https:// doi.org/10.5281/zenodo.1186393.
All scripts and supplemental code used in this paper, including those related to the artificial dataset and DREAM-Yara, are available at https://github.com/seqan/raptor.
Pre-built DREAM-Yara indices for parameters shown in this paper are available at https://ftp.imp.fu-berlin. de/pub/raptor/.

ðw; kÞ-Mminimizers
In this work we introduce the concept of ðw; kÞ-minimizers for computing representative k-mers. In Figure 1 iScience Article used the lexicographically smallest k-mer for clarity. In practice, this leads to a skewed distribution of minimizers which can be corrected by, for example, applying an XOR operation with a random value to each kmer hash value before taking the numeric minimum (see (Març ais et al., 2017) for a discussion). In the Figure you see a short example of a) ungapped 4-mers in a window of size 4, which means we take the lexicographically smallest of the k-mer and its reverse complement as the minimizing k-mer. The second case b) shows the minimizing 4-mers for a window of size 8. We form the minimum of all k-mers and their reverse complement in this window. We denote the window span with '-' and place the minimizing k-mer at the respective window position. For the properties and the size of the data we will handle, k will usually be in the range of 16 À 32.

Effective text size and ratio
In general, Raptor assumes that we have a collection of strings fT j g over an alphabet S, with a total length n = P j T j . Raptor stores the k-mer content of fT j g or a representative transformation of it. Raptor uses ðw; kÞ-minimizers for computing the set of representative k-mers.
To capture the repetitiveness of the fT j g, we define the effective text length nðkÞ as the number of distinct, representative k-mers in all the T j . In order to store the set of texts fT j g, we further assume that we have divided the T j into b bins B i (mind that a single T j itself could be divided into several bins without many adaptations). For the strings in a bin B i , we denote the set of representative k-mers with B i ðkÞ and the effective text length with n i ðkÞ as the number of representative k-mers of the strings in B i , i.e. the cardinality of B i ðkÞ.
Dividing the strings into bins could result in a large or small intersection of their representative k-mer content, depending on the method. To capture this, we define the effective text ratio r i ðkÞ as P i ni ðkÞ nðkÞ . The effective text ratio is a measure of how well we have partitioned our k-mer content into the bins. Ideally it is 1 and in the worst case it is b. We want to point out that the effective text length nðkÞ is a crucial measure for the problem of indexing large genomic text collections. For example,  compute an index for 170 TiB of sequence data. However, this data set is quite repetitive since its effective text length nð31Þ is only 6,10 10 .
For our artificial data set, we give the effective text ratio rðkÞ for different k for both 64 and 1; 024 bins in Table 8.
One can see that we need a k-mer size of at least 16 to achieve an effective text ratio under 2. For kR 19 the effective text ratio is near 1 which means that most k-mers in the bins are unique. For this reason we used k = 19 in our experiments.

Binning directories
We define a binning directory (BD) for the text collection fT j g divided into b bins B i as a data structure that returns the counts of the representative k-mers in the query multiset IðkÞ for each bin B i . In this work a binning directory uses a set membership data structure, namely the x-PIBF, that returns the bin membership as a (compressed) bitvector which we call the binning bitvector. The BD then combines the binning bitvectors into count vectors. Our Tool Raptor uses (probabilistic) thresholding to determine whether a query is in a bin or not.
Implementing a simple version is indeed not difficult. The problems lie in the fact that the effective text size nðkÞ can be very large, i.e. 10 10 to 10 12 . For example, the metagenomics data set used in (Dadi et al., 2018) contains about 2,10 10 different 19-mers. A naive implementation that stores all those 19-mers in a hash table containing the binning vectors for 1; 024 bins would need about 40 TiB (an open addressing hash table with about 4,10 10 entries, each pointing to a 1; 024 bit bitvector). Hence, the challenge is implementing the BD in a more space-efficient way while maintaining a fast run time.
We approach this problem in two ways in this paper. For implementing a binning directory, we adapted the IBF data structure presented in (Dadi et al., 2018) to work well on secondary memory. We call it the x-partitioned IBF (x-PIBF). Secondly, we employ ðw;kÞ-minimizers to reduce the number of representative k-mers significantly while still accurately answering the question in which bins a query can occur.

OPEN ACCESS
iScience 24, 102782, July 23, 2021 15 iScience Article Answering a query with Raptor Answering a query includes the retrieval of the binning bitvectors and the counting of k-mers to determine the bins to be searched. Using a x-PIBF, Raptor has to compute h hash functions, retrieve h sub-bitvectors and compute a bitwise AND. We can use a standard bitvector of size n that uses n bits, or the compressed bitvector of the SDSL (Gog et al., 2014) that uses approximately m, 2 + log n m bits, where m is the number of bits set and n the length of the bitvector.
For counting, Raptor has to traverse the binning bitvector of size b and increment counters for each bit set to 1. To speed up this crucial step we used, for uncompressed bitvectors, the lzcount intrinsic operation which counts the number of leading zeros in a 64 bit word. This accelerated the bin counting step by a factor of almost 2 compared to the individual checking of each bit. A further speed up is possible once the AVX512 SIMD extensions are available on standard computers (already possible for Intel's Skylake processor). These optimizations cannot be directly applied to compressed bitvectors.
Having the counts, we apply a thresholding according to the original k-mer counting lemma (Jokinen and Ukkonen, 1991) or according to a probabilistic model for ðw; kÞ-minimizers.
Lemma 1. For a given k and number of errors e, there are k p = jpj À k + 1 many k-mers in p and an approximate occurrence of p in T has to share at least t = ðk p Àk ,eÞ k-mers.
Hence, if the count exceeds the threshold for the bin, we report the pattern to occur in this bin, otherwise not. This approach is depicted in Figure 2. However, using minimizers makes the direct application of Lemma 1 impossible for w > k. We present a solution in the next section.

Probabilistic thresholding
Lemma 1 works only for ðk; kÞ-minimizers. It does not hold for general ðw; kÞ-minimizers. The latter is apparent since the number and distribution of ðw;kÞ-minimizers is sequence dependent and hence leads to different thresholds for each read. This problem is exemplified in Figure 3. The examples show that an error does not only directly invalidate the minimizers covering the error position but also indirectly affects minimizers not covering the error position, resulting in a different count of minimizers.
Taking these indirectly destroyed minimizers into consideration, there are several ad hoc ways to compute the threshold. The first is to adapt Lemma 1 such that we compute the threshold as follows: For a given k, w and number of errors e, there are w p = jpj À w + 1 many windows in p and if we take the multiplicity of the minimizers into account, an approximate occurrence of p in T has to share at least t = ðw p Àw ,eÞ minimizers, i.e. we replace k with w. However, this leads to low thresholds. The threshold in Figure 3 aÞ would be negative, i.e. 13 À 8 + 1 À 8 = À 2, and thus useless for filtering.
Another way to compute an individual threshold is to repeat the following two steps for each error: 1) compute the minimizer coverage of a query p (counting each minimizer only once) 2) One maximum coverage position is chosen and the minimizers covering this position are removed. The overall number of removed minimizers is subtracted from the number of minimizers to obtain the threshold t. This works better than the first approach, but is time-consuming to compute.
We can show that a simple probabilistic model yields thresholds that are on average much better than the above ad-hoc methods and removes the need to compute an individual threshold for each read. For this, we proceed as follows. We refer to the size of the query as p, the size of the window in which we compute minimizers is denoted by w, and m is the number of minimizers in a query. As pointed out above, an error affects several minimizers depending on its position and the character that is replaced. Each error lowers the threshold of the counting lemma. We want to derive a probabilistic model that computes this threshold for a given p as good as possible with only w, k, and m as input. We aim at having an as high as possible threshold without missing too many hits (i.e. having false negatives).
For this, we use the following definitions:

Index based model for one error
In order to define PðD n Þ for one error, we need the distribution of X i . In practice, assuming an uniform distribution of the minimizers in ½0; p Àk + 1 yields good results: This can now be used to compute PðD n Þ. Let j be the position within a sequence where the error is located. Thus, it can affect at most k minimizers directly. The minimizers that are affected by the error at position j are the minimizers starting at indices j À i, ci<k. Naturally, these minimizers are the only ones that include the position j. Hence, there are at most k minimizers that include the position j and thereby there are at most k minimizers directly modified by the error. This can be expressed via a binomial distribution and yields: cn˛½0; k : PðD n Þ can now be used to compute the new filtering threshold. Let d be the number of ðw; kÞ-minimizers affected such that Then the following lemma holds: Lemma 2. For a given p, w, and k and one error, at most d many k-mers are affected with a probability of t. Thus, an approximate occurrence of the query in T has to share at least t = m À d k-mers with probability at least 1 À t.
This method enables us to use t to control the false positive and false negative rates of our filter. The higher we choose t, the more minimizers are affected, i.e. the threshold t decreases and the false positive rate increases. The lower we choose t, the fewer minimizers are affected, i.e. the threshold increases t and the false negative rate decreases. Furthermore, this model only depends on m which is known for each read. In order to obtain the threshold, we simply have to look it up in a precomputed table for the specific parameters or compute the table once if it has not been computed yet.

Extension for indirect errors
The previous model can be extended to account for errors affecting minimizers indirectly. The example in Figure 3 shows that an error can affect a minimizer even though the error does not overlap with the minimizer.
We define the random variable C n that is 1 if the error affects n k-mers indirectly. Thus, if we assume for simplification that the events of directly and indirectly affecting a minimizer are independent, PðD n Þ becomes: cn˛½0; w : PðD n Þ = P iScience Article where P 0 ðD n Þ is the distribution of the previous model.
The main challenge with this extension is the computation of PðC n Þ. We tried to find a tractable formulation of C n , but did not succeed and leave this as an open problem. This is why we estimate PðC n Þ in this work experimentally by sampling. We generate a query with and without an error and then check the number of minimizers that are indirectly modified by the error. This sampling method is flexible for different parameters and can be easily applied for the relevant range of parameters or be quickly computed on the fly. In our experiments, a sampling of 10; 000 cases lead to convergence of PðC n Þ. The thresholds are computed for t = 0:99.

Extension for multiple errors
We can further extend our model to consider multiple errors. For example, assume that there are 2 errors, e 1 and e 2 , that affect d 1 and d 2 many k-mers, respectively. Thus, the two errors combined affect at most d = d 1 + d 2 many k-mers. Hence, affecting at most d many k-mers can be achieved by all different combinations of ðd 1 ; d 2 Þ such that d Rd 1 + d 2 and 0 %d 1 %w, 0 %d 2 %w.
Generalizing this for e errors affecting d 1 ; .; d e minimizers leads to the following distribution: cn˛½0; w,e : PðD n Þ = X d 1 + d 2 + . + de = n 0%d 1 %w;.;0%de%w with P 0 ðD n Þ being the distribution of the previous model allowing for indirect errors.
We also developed more involved models to account for two errors being in close vicinity (not shown), however the rounded thresholds very seldomly change. Hence, we omit them.
In summary, the threshold for the counting lemma can be looked up in a table given the number of errors, the number of minimizers of a query, and the parameters ðw; kÞ and hence the filtering speeds up considerably. The method models the occurrences of ðw; kÞ-minimizers within the windows and how they affect each other (see Figure 3 for an example).
x-partitioned IBF Finally, we propose our last contribution, the use of an x-partitioned interleaved Bloom filter (x-PIBF) in binning directories, which are an extension of the IBF proposed in (Dadi et al., 2018). An IBF for b bins combines b standard Bloom filters (Bloom, 1970). A Bloom filter is simply a bitvector of size n and a set of h hash functions that map a value, in our case a representative k-mer, to one of the bit positions. A value is present in the Bloom filter if all h positions return a 1. Note that a Bloom filter can give a false positive answer. However, if the Bloom filter size is large enough, the probability of a false positive answer is low. A Bloom filter of size n bits with h different hash functions and m elements inserted has a probability of giving a false positive answer of approximately For this reason, we have to allocate sufficient space such that p fp does not become too large. Still, the problem of using a simple Bloom filter is that it does not point us to the binning bitvectors. To alleviate the problem, the IBF uses b Bloom filters (one for each bin) with identical hash functions and then interleaves their bitvectors. Putting it differently, this means that it replaces each bit in the Bloom filter by a (sub)-bitvector of size b, where the i-th bit ''belongs'' to the Bloom filter for bin B i . The resulting IBF has a size of b , n. When inserting a k-mer from bin B i into the IBF, it computes all h hash functions which point to the position of the block where the sub-bitvectors are and then sets the i-th bit from the respective beginnings. Hence, the IBF effectively interleaves b Bloom filters in a way that allows us to easily retrieve the binning bitvectors for the h hash functions. When querying in which bins a k-mer can be found, we retrieve the h sub-bitvectors and apply a logical AND to them which results in the required binning bitvector indicating the membership of the k-mer in the bins. The procedure is depicted in Figure 4. ll OPEN ACCESS iScience 24, 102782, July 23, 2021 iScience Article Finally, the binning bitvectors are summed up to obtain the count vectors. For this, we allocate t many counters, each with b entries, where t is the number of threads used. These counters are reused as we process the reads in parallel.
If the set of stored k-mers is very large or if we want to achieve a very low FPR of the IBF, it might be too big to keep in the main memory. For those cases we implemented the x-partitioned IBF (x-PIBF) where we partition the set of stored k-mers into x parts as follows: For a x-PIBF of size s bits, we create x parts, each with ½s=x bits. Then we partition the k-mers based on the first q characters with q Rlog s ðxÞ, where s is the alphabet size. We count the q-mer frequencies of all kmers and assign them as evenly as possible to the x parts (see below table for an example). The counting step can be omitted, in which case we assume a uniform prefix distribution.
Finally, we have to adapt our hash functions such that all h hash values for a k-mer lie in the same part of the x-PIBF. This can easily be done by storing offsets in a q s large table and adding those to the hash values for an IBF of size ½s=x.
If we query a set of k-mers, we load the first part of the x-PIBF into memory, stream over all k-mers counting the relevant ones for this part and ignoring the others. Then we repeat this for all other parts after loading them. In the Results section we report on the time/memory trade-off.

Compressing bitvectors
Binning directories use a large bitvector containing all the binning bitvectors for all representative k-mers.
In this work we also allow the use of a compressed bitvector implementation from the SDSL (Gog et al., 2014). While a standard bitvector of size n uses n bits, the compressed bitvector of the SDSL uses approximately m, iScience Article