SRPRISM (Single Read Paired Read Indel Substitution Minimizer): an efficient aligner for assemblies with explicit guarantees

Background: Alignment of sequence reads generated by next-generation sequencing is an integral part of most pipelines analyzing next-generation sequencing data. A number of tools designed to quickly align a large volume of sequences are already available. However, most existing tools lack explicit guarantees about their output. They also do not support searching genome assemblies, such as the human genome assembly GRCh38, that include primary and alternate sequences and placement information for alternate sequences to primary sequences in the assembly. Findings: This paper describes SRPRISM (Single Read Paired Read Indel Substitution Minimizer), an alignment tool for aligning reads without splices. SRPRISM has features not available in most tools, such as (i) support for searching genome assemblies with alternate sequences, (ii) partial alignment of reads with a specified region of reads to be included in the alignment, (iii) choice of ranking schemes for alignments, and (iv) explicit criteria for search sensitivity. We compare the performance of SRPRISM to GEM, Kart, STAR, BWA-MEM, Bowtie2, Hobbes, and Yara using benchmark sets for paired and single reads of lengths 100 and 250 bp generated using DWGSIM. SRPRISM found the best results for most benchmark sets with error rate of up to ∼ 2.5% and GEM performed best for higher error rates. SRPRISM was also more sensitive than other tools even when sensitivity was reduced to improve run time performance. Conclusions: We present SRPRISM as a flexible read mapping tool that provides explicit guarantees on results.


Background
Rapid development of DNA sequencing technology resulted in an increasingly large amount of sequence data being generated and submitted to public databases, creating a demand for efficient tools to align nucleotide reads to reference genomes. Several such tools have emerged, and they generally fall in two categories.
One group consists of BLAST-like programs that search for local alignments of reads and aim for high sensitivity. These tools are relatively slow, require computers with a large amount of RAM or clusters of several computers running in parallel, and do not guarantee full sensitivity. Such tools as SHRiMP [1], BFast [2], and nucleotide BLAST itself [3] fall in this category.
The other group consists of fast aligners that can align large volumes of reads. Aligners in this group have an additional advantage in that they do not require a lot of memory to work, which makes using them practical on medium-to highlevel desktop computers. However, such aligners usually make some sacrifices to achieve good run time performance. Sensitivity is most commonly sacrificed, with some aligners having limits on the reference genome size, the ability to align with gaps, the number of alignments reported, or the alignment of paired reads. Examples of such tools are BWA [4], GEM [5], and Bowtie2 [6,7] based on Burrows-Wheeler transform; -RA aligner [8] based on suffix arrays; SNAP [9], SARUMAN [10], and SeqAlto [11] using k-mer indexing, and ZOOM [12] using spaced seeds. Kart [13] uses both Burrows-Wheeler transform and hash indexing. Some tools try to maintain sensitivity and achieve speed improvements using hardware-specific optimizations, such as SOAP3-dp [14], BarraCUDA [15], and CUSHAW [16]. Only a very small number of aligners, such as BatMis [17], RazerS 3 [18], Hobbes [19], Yara (formerly Masai) [20], and mrFAST [21], explicitly attempt to report all good alignments. STAR [22] is a popular splice-aware aligner with options that can be set to find unspliced alignments.
Comparison of read alignment tools has been done for software robustness [23], choice of parameters and algorithm features [24], effect of aligner on downstream processing [25,26], and properties of the reference genome to which reads are aligned [27]. A review of the read alignment problem for various applications and technologies was presented by Reinert et al. [28]. Comparisons of software tools use simulated reads, real-world read sets, or benchmarks developed using tools such as DWGSIM [29], Rabema [30], or Seal [31].
SRPRISM (Single Read Paired Read Indel Substitution Minimizer) is an aligner that aims to achieve low memory footprint, capability of being run on multiple platforms and commodity hardware, and options that provide a good balance between sensitivity and speed. SRPRISM has many capabilities present in most aligners-it can align both single and paired reads, supports alignments that have substitutions as well as gaps, and can process a large volume of reads efficiently. In addition, there are several important features that distinguish SRPRISM from other aligners, as described below. SRPRISM generates output in SAM format [32].
An important property of SRPRISM is that it is possible to list precisely the conditions that guarantee full sensitivity. Specifically, if certain limits on the read length and requested number of errors are satisfied and that read can be aligned to a reference with at most that many errors, then SRPRISM is guaranteed to report best scoring mappings for that read. Moreover, if the number of equivalent best mappings does not exceed a configurable upper bound on the number of mappings to be reported, then all such mappings are guaranteed to be reported.
Another distinguishing feature of SRPRISM is its support for genome assemblies, such as the current human assembly GRCh38, which uses the Genome Reference Consortium (GRC) [33] defined assembly model. This model accommodates sequence representations that can introduce allelic duplication, including alternate loci and patches. Alternate loci are extrachromosomal assembly sequences that provide variant representations for chromosomal regions. Patches are operationally similar to alternate loci and provide a means for the public release of assembly updates without disruption to reference chromosome coordinates. The relationship of alternate loci and patches to the chromosomes is defined by alignments that are included as part of the assembly release. SRPRISM currently does not distinguish between alternate loci and patches and treats them as equivalent substitutions for defined chromosomal regions. SRPRISM does not support patches or alternate loci for other alternate loci. Consequently, all sequences in the genome assembly are divided into two disjoint classes: "primary sequences" and "alternate sequences." The alignment between alternate and primary sequences is used by SRPRISM to adjust the mapping quality scores when reads align to chromosomal regions that have alternate sequence representations. This also finds mappings that partially cover both the alternate locus and the primary sequence at the junctions where the alternate locus joins the primary sequence. To our knowledge, SRPRISM, BWA-MEM [34] (an algorithm in BWA), and iBWA [35] are the only software that have support for assemblies with alternate loci. However, the alignment information between primary and alternate sequences in the reference needed by the tools differs. BWA-MEM requires base by base alignment information (commonly called "traceback"), but SRPRISM and iBWA only require end-points of alignments.
When aligning paired reads using SRPRISM, the desired range of insert sizes and strand configuration of mates can be specified explicitly. Alternatively, SRPRISM has the ability to discover these parameters automatically.
When reads are aligned globally, SRPRISM offers three schemes for ranking alignments: "minimum errors," "bounded errors," and "sum of errors." In the minimum error mode, the maximum number of errors in alignment of either mate of a read is minimized with minimum number of errors in alignments of both mates as the second key. In bounded error mode, all alignments with number of errors at most the given bound are considered equally good. In sum of errors mode, alignments are ranked using the sum of errors seen in both mates of a paired alignment. Sum of errors limits total number of edits for paired reads and is the same as minimum errors for single reads. Minimum errors mode could be more appropriate when aligning to a very close genome where only a negligible number of differences per read are expected. Bounded errors could be useful for applications assessing number of similar copies for a region in an assembly or doing cross-species alignments.
In addition to aligning reads globally, SRPRISM offers a "partial" alignment mode. In the partial mode, SRPRISM looks for best partial read alignments (i.e., maximizing a certain score function) provided a user-designated trusted region of the read, called the "seeding region," is fully aligned within the maximum number of errors allowed by the length of the specified seeding region. This gives an opportunity to align read sets with known defects, like low-quality tail regions.
SRPRISM has some limitations that must be considered when selecting it as an alignment tool. The maximum number of errors searched for by SRPRISM is limited to 15 errors in the aligned portion of the read. Second, SRPRISM can handle read lengths in the range 192 bp, but because of the limit on the number of errors, SRPRISM is best suited for aligning reads of length up to 250 bp, such as reads generated by Illumina sequencing technology, the dominant technology at this time [36].
An emerging alternative way to capture an organism's genetic variation is by modeling genomes as graphs. Methods and tools to align sequences to graph genomes are described, e.g., in [37][38][39][40][41][42][43][44]. Comparison of SRPRISM to these aligners is outside of the scope of this paper.
In the following sections, we briefly describe the design of SRPRISM software. We also describe the testing methodology including details on benchmark sets and software settings used for comparison to GEM, Kart, STAR, BWA-MEM, Bowtie2, Hobbes, and Yara. We did not include iBWA for comparisons because BWA is the core aligner for both iBWA and BWA-MEM. Technical details, including pseudocode and software optimizations, are available in the supplementary material.

Algorithm
SRPRISM creates an index and stores it once per reference assembly. The index can then be used for multiple searches against that reference. We present the structure of the reference index, steps taken by the search procedure when reads are aligned using the index, properties of results reported, and additional information about the major steps including processing of alternate sequences in reference and key optimizations.

Reference index
SRPRISM indexes words on the positive strand of the reference genome and stores them in a database. An SRPRISM database consists of two main parts. The first part contains compressed sequence data encoded at two bits per nucleotide (A, C, G, and T encoded as integers 0, 1, 2, and 3, respectively) along with metadata that has information about ambiguities, individual sequence location in the database, correspondence between alternate and primary sequences, and a frequency map containing the base 2 logarithm of frequency lf(w) for each word w in the database. This part is always loaded in memory, requiring a relatively small amount of space (e.g., ∼800 MB for GRCh38).
The second part of an SRPRISM database contains the locations of occurrences of unambiguous words of length 16 in the genome, sorted in the order of integer values of those words in the 2-bit per base encoding. This represents the bulk of the database. The SRPRISM pattern of access for this part is strictly sequential; as a consequence, only a small part of the database has to be present in memory at any given time during the search. Additional information for the neighboring words is kept for the repetitive words (see supplementary material for details).

SRPRISM search: bird's eye view
SRPRISM aligns both single reads and paired reads. For ease in describing the alignment procedure, we use the term "query" for user input that can be single reads or paired reads and the term "read" for single reads or individual mates of paired reads.
SRPRISM is designed for batch queries. The batch size can be specified explicitly by the user (in which case the user can also directly control which batches to process) or is otherwise inferred from the amount of available memory. The following subsections describe the processing of a single batch of queries. If there is more than one batch, they are all processed in the same way. Fig. 1 gives a high-level overview of the stages involved in the processing of one batch of queries by SRPRISM.
SRPRISM operates by selecting and sorting a subset of 16-bp words from all reads. This sorted subset is matched against the sorted set of words in the database. These initial matches are then extended via an alignment procedure that is greedy in nature but guarantees optimal results up to the number of errors requested by exploring the search space needed for the guarantee. For queries with paired reads, if a paired mapping is found for the query, no unpaired mappings are reported for the reads in that query. Four search modes are supported: (i) Minimum error mode: For alignments of a query with single read or unpaired alignments of reads in a paired query, the rank is the same as the number of errors in the alignment. For a query with paired reads, the rank of a paired mapping has its first key as the maximum of the number of errors in individual read alignments and its second key as the minimum of the number of errors in individual read alignments. We report alignments with minimum rank.
(ii) Bounded error mode: All alignments for each read with up to the user-specified number of errors are reported.
(iii) Sum of errors mode: For single reads, the rank is the same as the rank in the minimum error mode. For queries with paired mappings, the rank is the sum of the number of errors in the individual read alignments. We report alignments with minimum rank.
(iv) Partial mode: The individual read alignments are assigned a rank equal to the number of read bases in the alignment, excluding the unaligned portion of the read, minus the number of errors. The rank of a paired alignment is the sum of ranks of the individual read alignments. We report alignments with maximum rank.
SRPRISM operates as a global aligner in the minimum error, bounded error, and sum of errors modes. In the partial mode, non-global alignments are allowed.

What is reported
Let N be the user-defined maximum number of mappings to be reported per query (default 10, maximum 254). If the number of best-ranked mappings for a query to primary or an alternate exceeds N, then SRPRISM reports N mappings to the primary and each alternate exceeding the limit; all the best mappings are reported for the rest. When more than N mappings are available for a query to primary or an alternate sequence, mappings with a larger number of mismatches are preferred for reporting to those with gaps. For paired mappings, the number of mismatches for each mapping is taken to be the maximum in the individual read alignments in the mapping.
Let be the length of the seeding region SR as described in the next section. SRPRISM is guaranteed not to miss a mapping that has at most K errors in the seeding region, where K = min (K 0 , K 1 , 15), K 0 is the user-specified maximum number of errors for reported alignments (default 5), and K 1 is defined as follows: Each reported mapping is assigned a mapping quality value in the following way. Let R be the number of mappings found for query q that will be reported. All mappings for q in R will have the same best rank and are assigned the same quality value The computation of the number of mappings R includes adjustments needed for alternate loci, which are described in the section "Alignment to alternate sequences." Q = 0 indicates that more mappings of the desired rank can be found in the database but are not being reported because only N mappings were requested.
SRPRISM supports a "heuristic" mode, where seeding is restricted to words with a frequency less than the user-specified threshold. In this mode, search is done exactly as in the sensitive mode. In the heuristic mode, if SRPRISM determines that it can find the complete set of best ranked mappings for a query, this is indicated in SAM format by a flag XA:i:1. Otherwise the flag is set to XA:i:0. This flag is supported for the minimum error and sum of error modes.

Query analysis and seed selection
In this subsection, we describe how a set of seeds S(r) is selected for a given read r. For a set S of words from r, its combined fre-  quency is defined as F(S) = w ∈ S lf(w). The collection S(r) is selected differently depending on the length (r) of r as illustrated in Fig. 2.
Let w be a word within the read r and s be a subsequence of w. We define E s (w) to be a set of all words that can be obtained from w by introducing one error in s (in the case of insertions/deletions, one letter from r is shifted in or out of w either from the left or from the right).
The seeding region SR for a query can be explicitly specified by the user. Otherwise, the entire read is considered to be the seeding region. All words in S(r) originate from an interval SA(r) ⊆ SR of the read r, called the "seeding area" of r. If SR is less than 48 bases, SA(r) is taken to be the whole of SR. Otherwise, SA(r) is defined in item 1 below, where is the length of SR and K is the number of errors.
1. Case 16(K + 1) ≤ : let W p be the set of K + 1 consecutive nonoverlapping words starting at offset p of SR (see Fig. 2a), where 0 ≤ p ≤ − 16(K + 1). Then S(r) is taken to be one of W p with the least combined frequency. In this case, SA(r) is the interval between the start of the first word and the end of the last word in S(r). 2. Case K = 1 and 32 > ≥ 16: let s be the area of overlap of w 1 and w 2 (see Fig. 2b) andw be one of w 1 and w 2 with the least frequency (w 1 = w 2 when = 16). Then Fig. 2c) with the least combined frequency (w 3 = w 4 when = 32). Letw 1 andw 2 be two overlapping words in W and s be their area of overlap. Let alsow be one ofw 1 andw 2 with the least frequency. Then A deterministic tie-breaking procedure is used in cases where the frequency comparisons, as described above, result in a tie (see supplementary material for details).
The selection of seeds ensures that for any potential alignment of a read r to the database, satisfying conditions 1, there is at least one word in S(r) that aligns exactly (see supplementary material for additional details).
It is possible that the same alignment can be found by extending several seeds in S(r). To avoid duplication of mappings, for each read, SRPRISM keeps track of which seeds it has seen already and will not keep the alignments that could have been found by previously seen seeds.

Processing of queries with single reads
SRPRISM search proceeds by scanning sorted lists of 16-bp seeds from the reads and from the database. For any match found, the locations for the matching word are extracted from the database and alignments of corresponding reads to the database sequences at those locations are attempted. Successful alignments are stored in a temporary file, grouped by query, postprocessed, and reported to the output stream.

Processing of queries with paired reads
For queries consisting of paired reads, alignments are first searched for each read independently, treating each read as a query with single read. If both reads in a query produce mappings, the number of mappings found for each read is within the bound for the number of mappings requested, and a paired alignment in the correct insert size range and with the correct strand configuration can be computed from the read alignments; then paired mappings for such queries are produced from the alignments of reads.
For all remaining queries where both reads produce at least one mapping, frequency information is used to decide which of the two reads in the query is less repetitive. The read that is less repetitive is selected as "master" and the other read is designated as "slave." The single-read search is repeated for the master to find all alignments with up to the specified number of errors. For each candidate alignment found, an attempt is made to align the slave in-place within the insert size and strand configuration dictated neighborhood of the alignment location.
Alignments for reads in queries for which no paired alignments are found are reported as single alignments.

Alignment to alternate sequences
The relationship between each alternate sequence and corresponding primary sequences is available as a file in the assembly release. Specification has mapping information for the end points of alternate sequences to the coordinate space of the primary genome. If the precise mapping of an end point for an alternate sequence to its primary sequence is not known, such end points are marked as "fuzzy." SRPRISM extends alternate sequences over the non-fuzzy end points via segments of the primary sequences of configurable length as shown in Fig. 3. The length of the extension depends on read length and insert size and is determined so as to ensure that alignments overlapping the end point of an alternate sequence are correctly extended to the primary assembly. For any alternate sequence s in the database after such exten- Figure 3: Metadata records for an alternate sequence and the corresponding primary sequence. Alternate sequence is extended in one or both directions with segments from the primary sequence of configurable length (shown in blue) to allow for correct identification of paired alignments. Extension is done for end points where the alignment of alternate to the primary is not fuzzy in the information provided by GRC. The dashed red sequence is the region of the primary sequence that is conceptually replaced by the solid red region from the alternate sequence when adjusting mapping quality scores. sion, we use notation s for the original non-extended alternate sequence (s and s are the same if both ends of the alternate sequence are marked as fuzzy).
We say that an alignment a of read r to an alternate locus sequence s is "proper" if the aligned portion of s overlaps with s . A paired-end alignment is proper for s if at least one of the individual read alignments is proper.
The computation of quality values Q for the mappings corresponding to alternate sequences is described by equation (2), but R is defined differently.
For a query q and alternate sequence s corresponding to the primary sequence s pri , let R pri s be the number of mappings of q to the primary portion of the database, not counting the mappings that overlap the region of s pri replaced by s. Let R s be the number of proper mappings of q to the sequence s. We then define R = R s + R pri s in equation (2) for the purpose of computing the quality of a mapping aligning q to s.

Key optimizations
SRPRISM implements a number of optimizations intended to improve the performance of the search.
For every 16-bp word that appears in the SRPRISM database more than hundred times, the database also contains information about its 16-bp neighbor words. This information is used by the aligner to quickly filter out 32-bp words that do not match the read sequences with up to two errors. This greatly reduces the number of initial matches designated for further extension without violating the guarantees (see supplementary material for details).
For each query, SRPRISM tracks the best alignment rank seen, which allows it to reduce the search space for subsequent alignments and to stop the search early when it can prove that all of the best alignments are already found.
In addition to the above, SRPRISM can be instructed not to use as seeds the 16-mers that appear more than a specified number of times in the reference database. This allows some sensitivity to be traded for performance. Our testing shows that limiting the word frequency to 4,096 can greatly improve SRPRISM run time performance while sacrificing very little in sensitivity. The number of queries that have XA:i set to value 0, indicating that the result is not guaranteed, is also typically a very small percentage of the total number of queries (data not shown).

Operation
SRPRISM is implemented in C++ on Linux OS. We recommend providing 4 GB memory to SRPRISM for alignment to reference genomes similar in size to the human genome. Steps needed for running SRPRISM are creating an index for the reference and searching the index with reads (see supplementary material Tables 1-3 for command lines). The reads can be provided as files in fasta or fastq format, or they can be accessed directly from the SRA.

Methods
Benchmark datasets generated, parameter settings used for different software compared, and the method used for evaluating alignments are presented in this section.

Datasets used for comparison
Paired query sets were created using DWGSIM version 0.1.12 (DWGSIM, RRID:SCR 002342) [29]. Reads of length 100 and 250 bp were generated using primary sequences in GRCh38 with insert size of 500 and 600 bp, respectively. Each set contains one hundred million paired queries. Error rates considered for both read lengths were in steps of 0.5% from 0.5% to 4%, excluding 3.5%. This gives a total of 14 different benchmark sets. Table S1 in the supplementary material has command lines for creating benchmark sets.
Benchmark sets with single queries consisted of the first read of the paired benchmark sets generated above.
For comparing the running time, we generated one million queries at the same read lengths, insert size, and error rates.
Two sets of SRPRISM runs were performed: (i) runs with full sensitivity and (ii) runs where seeds were limited to k-mers occurring at most 4,096 times in the reference database. We refer to these runs as "sensitive" and "fast" runs correspondingly. In both cases SRPRISM was instructed to use 4 GB of RAM. Only the sum of the error ranking scheme was used for comparison.
GEM runs were performed using mapping mode fast and sensitive and are referred to as "fast" and "sensitive," respectively. Kart was run with the option to report multiple mappings. We tested BWA-MEM with two sets of runs: (i) with most parameters set to their default values, which matches the common use; and (ii) with parameters set to closely match SRPRISM. These runs are referred to as "default" and "custom," respectively. Bowtie2 runs were done with --very-sensitive mode and requesting up to 10 mappings per query to estimate the best sensitivity. STAR options for minimum and maximum intron length were set to 2 and 1 bp, respectively, to find unspliced alignments.
The database index for Hobbes was created with the recommended qgram length of 11. All programs that have options to For each benchmark set, cell in bold italic has the best result (lowest number) and cell in bold has the second best result among all methods tested. For SRPRISM sensitive mode, numbers in rows labeled "(check count)" give the number of queries where a mapping at the target number of errors is expected to be found but was not. All such cases were found to be due to an error in the software that reported the target number of errors. Every read for which SRPRISM sensitive mode found the target number of errors but SRPRISM fast mode did not had the XA:i flag set to 0 in the alignment of SRPRISM fast mode to indicate that an exhaustive search was not done on that read in the fast mode. specify insert size were given a range of 10-990 bp. Tables S2-S4 in the supplementary material provide the command lines for each aligner that were used for aligning paired reads, single reads, and creating the index.
Methods that gave the best result on at least one benchmark were GEM, BWA-MEM, and SRPRISM. We performed run time performance comparison using both settings for each of these three methods.

Evaluation of results
For each query q in each benchmark set S, we find the target number of errors for q in S as the minimum sum of errors for any valid alignment inferred from any of the methods tested. Every alignment reported is valid for single queries. Additional requirements for a paired alignment inferred from read alignments for a paired query reported by a method to be valid are that the paired alignment be in proper forward-reverse orientation and within the specified insert size. We use the number of queries that did not find the target number of errors as the criterion for comparing the sensitivity of different methods.
For sensitive SRPRISM runs, we further investigated all queries for which (i) there was a valid result reported by any method, (ii) the read in the single query or both reads in the paired query had at most the number of errors that SRPRISM guarantees to find (5 errors for 100 bp and 14 for 250 bp reads), and (iii) SRPRISM did not report a valid result at the target number of errors.
For each read r in each benchmark set S, the benchmark specifies position P in the genome from where r is generated and the number of errors E introduced. The second measure of evaluation defines the position p of an alignment for r in S to be at an "acceptable position" if and only if p differs from P by at most E. The deviation from P by E positions is to account for potentially equally good alignments at the same location in the genome. We use the number of reads that did not find an alignment at an acceptable position as a criterion for comparing the correctness of alignments reported by different methods. This criterion is also independent of the scoring scheme used by each software package.
All runs for run time performance tests were performed on a 2.2-GHz Intel Xeon E5-2660 CPU, with 128 GB of RAM. Each run was performed 3 times, and the final time was taken as a minimum total user and system time over 3 runs.

Quality of results
For each benchmark set and each method tested, Table 1 reports the number of queries for which a valid result at the target number of errors was not found. The sensitive mode of GEM performed well for 100-bp paired and single queries at all error rates except at 0.5%. The sensitive mode of GEM also performed well at high error rates for 250-bp paired and single queries. Both modes of SRPRISM performed well for paired and single 250-bp queries for error rates up to 2%. The sensitive mode of SRPRISM also performed best at 0.5% error rate for 100-bp paired and single queries. BWA-MEM in custom mode narrowly outperformed GEM in sensitive mode at high error rate of 4% for 250-bp single queries. Hobbes and Yara performed well only for single queries at low error rates.
For all queries where SRPRISM sensitive mode did not report the best result, we verified that either at least one read in the query had more errors than what SRPRISM guarantees to find or the valid alignments giving the target number of errors underreported the number of errors. There were 131 paired queries and 71 single queries across all benchmark sets where underreporting of errors led to SRPRISM sensitive mode not finding the best result. Alignments at the target number of errors for these queries were generated by Kart or BWA-MEM. In the case of Kart, it seems to us to be incorrect reporting of flags or alignment information in the SAM output format. For BWA-MEM, all such alignments had an ambiguous letter in the genome that was not counted as an error.
Hobbes reported a large number of alignments. Sometimes the same alignment was reported tens of times. Yara showed very good performance on single queries but did not perform well on paired queries because it did not find paired alignments within the insert size using the alignments of single reads. The sensitivity of Kart, STAR, and Bowtie2 was poor on our benchmark sets.
For each software package and each benchmark set, Table 2 reports the number of reads for which an alignment was not reported at an acceptable position. These results show that SR-PRISM performed best for up to ∼1.5-2% error rate for 100bp single and paired reads and up to ∼2.5% error rate for 250-bp single and paired reads. GEM performed best at higher error rates. GEM run time performance was most uniform across different error rates while run time for other methods tended to increase with respect to the error rate. GEM in fast mode was the fastest of all methods. However, for 100-bp single and paired query sets, GEM in sensitive mode was the slowest.

Conclusions
We designed SRPRISM for reliable alignment of large volumes of sequences to large genomic databases. Its main strengths are guaranteed sensitivity and features that include support for paired alignments, support for up to 15 errors (including gaps) in alignments, configurable number of reported mappings, and support for alternate loci in the reference assembly. It has a relatively low memory footprint, which makes it suitable for running on most modern hardware even when searching very large query sets against human genome-sized databases. It can also be configured for faster performance at the expense of some sensitivity, and the mappings that are not guaranteed are flagged as such.
We compared SRPRISM performance with GEM, Kart, STAR, Bowtie2, BWA-MEM, Hobbes, and Yara. We found that the fast mode of SRPRISM provides a good compromise between running speed and sensitivity and the sensitive mode of SRPRISM has reasonable speed for sets with low error rates. We also found that changing parameters for both GEM and BWA-MEM can improve sensitivity with a relatively modest increase in running time for BWA-MEM but significant increase in running time for GEM. We showed that Hobbes and Yara do not find all expected mappings and Kart, STAR, and Bowtie2 have poor sensitivity.
SRPRISM software in its current form has room for enhancements and optimizations. The features planned for the future versions include support for concurrency, additional scoring schemes for alignments, and improved processing of ambiguities in reference.
The data presented support SRPRISM being an efficient aligner that has a combination of unique features including explicit guarantees for the result set, support for alternate loci, For each benchmark set, cell in bold italic has the best result (lowest number) and cell in bold has the second best result among all methods tested.
global and partial alignments of reads, and equally efficient handling of both gaps and substitutions in alignments.

Availability of Source Code and Requirements
Project name: SRPRISM

Availability of Supporting Data and Materials
A README and a binary for SRPRISM are available at ftp://ftp. ncbi.nlm.nih.gov/pub/agarwala/srprism.
The files needed for generating index and alignments to GRCh38 human genome assembly are available at ftp://ftp.ncbi.nlm.nih .gov/pub/agarwala/srprism/GRCh38. The files needed for generating index and doing a test run using GRCh38 human genome assembly files above are available at ftp: //ftp.ncbi.nlm.nih.gov/pub/agarwala/srprism/testrun. Snapshots of the code are also available in the GigaScience Gi-gaDB repository [45].

Funding
This research was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health.