Accuracy, speed and error tolerance of short DNA sequence aligners

Aligning short DNA sequence reads to the genome is an early step in the processing of many types of genomics data, and impacts on the fidelity of downstream results. In this work, the accuracy, speed and tolerance to errors are evaluated in read of varied length for six commonly used mapping tools; BWA aln, BWA mem, Bowtie2, Soap2, Subread and STAR. The accuracy evaluation using Illumina-like simulated reads showed that accuracy varies by read length, but overall BWA aln was most accurate, followed by BWA mem and Bowtie2. BWA mem was most accurate with Ion Torrent-like read sets. STAR was at least 5 fold faster than Bowtie2 or BWA mem. BWA mem tolerated the highest density of mismatches and indels compared to other mappers. These data provide important accuracy and speed benchmarks for commonly used mapping software.


Introduction
The development and adoption of high throughput sequencing (HTS) in the past 15 years had brought about a new era in genomics. Illumina sequencing in particular has been a major contributor to public DNA sequence databanks and continues to be the platform of choice for large scale genomics projects [1,2](Abecasis et al, 2012; Peplow 2016 . Aligning short reads to the genome is an early data processing step and impacts the quality of downstream findings. There are a range of sequencing applications suited to short or longer reads. For example, relatively short reads of 50 to 100 nt are commonly used to map epigenetic modifications with ChIPseq. However, exome and whole genome and metagenome sequencing commonly utilises pairedend format with read lengths of 100 nt or longer. Reported maximum read lengths are 250 nt for HiSeq2500 Rapid and 300 nt for MiSeq. As paired end mode is available in all these, length of merged pair sequences of up to 600 nt can be obtained Genomic alignment has its challenges, particularly the presence of sequence polymorphisms, genomic repeats, and the large size of some genomes such as human. Accuracy in terms of precision (specificity) and recall (sensitivity) are important in genomics applications to limit false negative and false positive findings. Aligners must therefore be aware of genomic repeats, robust to sequence polymorphisms and efficient enough so that the huge volumes of DNA sequence data can be processed in a reasonable computational time. and human (~3.1 Gbp). The speed and robustness to sequence errors including mismatches, insertion and deletions is also evaluated. These findings will be relevant for many investigators looking to optimise short read mapping procedures.

Aligner speed evaluation
The speed of aligners was assessed by determining the time taken to process read sets of 1

Accuracy of mappers with Illumina like reads of various length
In order to ascertain which aligners are most accurate for read lengths currently generated by Illumina systems, read sets with lengths 50, 100, 200 and 480 bp were generated from Arabidopsis and human templates, mapped to the respective genomes and the number of correct and incorrectly mapped reads was quantified. Overall accuracy was evaluated using the Fmeasure. As the Fmeasure is dependent on the applied mapping quality (mapQ) threshold, we calculated Fmeasure at a range of mapQ values for 50 bp reads ( Figures 1a,b ) and demonstrate that a mapQ threshold of 10 was appropriate for these six mappers. This mapQ threshold is used throughout the paper.

Accuracy of mappers with Ion Torrent like reads of various length
In order to investigate which mappers were best for processing Ion Torrent sequence reads, we performed read simulation using the DWGSIM tool

Aligner speed
In light of the increasing volume of data being produced from high throughput sequencers, alignment speed may be a factor in the selection of a mapper. The throughput at which reads were processed on a standard server allowing up to 30 threads was determined using Illuminalike read sets as above in Arabidopsis and human ( Figure 3 )

Robustness with regards to mismatches
Mismatches are common in all sequencing data and may be the result of sequencing errors or genetic variation. Accurate mapping of mismatch containing reads enables better resolution of genetic differences and reduces the number of unmapped and unused reads. To quantify the robustness with regards to single nucleotide mismatches (SNM), perfectly matching read sets were first generated, followed by incorporation of up to 16% SNMs in reads with length 50 to 500 nt.
After mapping, the Fmeasure was determined and plotted as a function of SNM frequency ( Figure   4 ). BWA mem was the most robust to SNM in reads ≥100 nt, recording F measures >0.9 even with 16% SNMs. Bowtie2 and STAR were the next best, followed by BWA aln then Subread and lastly Soap2. The tolerance of BWA aln and Soap2 to SNM diminished with read length >100 nt, whereas BWA mem was more tolerant to SNM in longer reads. With 50 nt reads, STAR was the most accurate with SNM rich reads. These results demonstrate that BWA mem accuracy is superior to other mappers with mismatch rich reads of 100 nt or longer.

Robustness with regards to indels
Indel errors can be relatively common in sequencing data, depending on instrument type. For instance the Ion Torrent and Roche 454 machines yield indel rates much higher than Illumina systems due to differences in sequencing chemistry [7] (Loman et al, 2012) . Indels also occur randomly by mutation, accumulating at 1/8th the rate of mismatches according to studies of the  Wei et al, 2013) . To assess mapper tolerance to indels, insertions or deletions were incorporated into reads at a frequency up to 16%. The mapping results for indel containing human reads are shown in Figure 5. Results for Arabidopsis were virtually identical (not shown). Overall, the effect of indel incorporation was more deleterious to accuracy as compared to SNM. The effect of insertions and deletions was quite similar, and this was consistent in Arabidopsis and human tests. Soap2 was least tolerant to indels, followed by BWA aln. Bowtie2, Subread and STAR were moderately tolerant to indels, while BWA mem was clearly the most tolerant to indels.            Figure 4. Performance of aligners with reads containing single nucleotide mismatches. The F0.25 measure is used as a quantitative measure of precision and recall.