Genetically improved BarraCUDA

Background BarraCUDA is an open source C program which uses the BWA algorithm in parallel with nVidia CUDA to align short next generation DNA sequences against a reference genome. Recently its source code was optimised using “Genetic Improvement”. Results The genetically improved (GI) code is up to three times faster on short paired end reads from The 1000 Genomes Project and 60% more accurate on a short BioPlanet.com GCAT alignment benchmark. GPGPU BarraCUDA running on a single K80 Tesla GPU can align short paired end nextGen sequences up to ten times faster than bwa on a 12 core server. Conclusions The speed up was such that the GI version was adopted and has been regularly downloaded from SourceForge for more than 12 months.


Why Run Bioinformatics on Gaming Machines
The explosive growth in Biological datasets has coincided with a similar exponential increase in computer processing power (known as Moore's Law [1]).Before 2005 the doubling of integrated circuit complexity every 18 months, went hand-in-hand with doubling of computer processor clock speeds.However in the last ten years clock speeds have increased little.This has not (as yet) limited the exponential growth in Bioinformatics datasets and hence processing demand.Fortunately Moore's Law has continued to apply to the number of transistors per silicon chip.Whilst some of these extra transistors have been used to support more powerful computer instructions, largely they have been and will continue to be used to support parallel computing.In 2005 a typical computer contain one CPU, nowadays quad code (i.e. 4 CPUs) are common place with 6, 8 and 12 cores also being available.This trend will continue.
Modern consumer applications demand high quality and instant response.With user interfaces containing millions of display elements (pixels) and thousands of input sensors, the only practical approach has been parallel processing.Rather than using several CPUs, hardware dedicated to graphical displays typically contains hundreds or even thousands of processing elements.As each each pixel is processed in the same way, the graphics processing units (GPUs) can take short cuts in the hardware.For example, since each of the hundreds of pixel processing programs is doing exactly the same thing, the logic to decode program instructions can be shared.This means the transistors used to decode the program actually drive many streaming processing cores (rather than just one).The research and development of these specialised but highly parallel graphics accelerator cards has been paid for largely by the consumer gaming market.One of the main players in this market is nVidia.They have sold hundreds of millions of their GPUs.(These GPUs are capable of running CUDA, which is nVidia's general purpose framework for programming their GPUs.It is used by BarraCUDA.)About the time of the end of the serial processor clock speed boom, computer scientists and engineers started to treat GPUs as low cost but highly parallel computers and started using their GPUs for general purpose computing (GPGPUs [2]).This trend continues.Indeed GPGPU has been combined with some enormous volunteer user cloud systems.For example, much of the raw computer power used by the SETI@HOME project is actually derived from GPUs within domestic PCs.Another aspect of GPGPU, has been the introduction by both nVidia and Intel of "screen less" GPUs, where the hardware is dedicate to computing applications rather than computer graphics.Indeed today half of the ten fastest computers on the planet are based on GPUs (http://www.top500.org/May 2015).
Bioinformaticians have not been slow in seizing the advantages of GPGPU programming.CUDA versions of several popular applications have been written.However, as with other branches of super computing, it is often not easy to write code to gain the best of parallel machines.For example, often parallel applications are limited not by the processing power available but by the time taken to move data inside the computer to the processing elements.
At the forthcoming GECCO conference [3] we shall present an approach in which a small part of the manually written code has been optimised by a variant of genetic programming [4,5] to give a huge speed up on that part.(The raw graphics kernel can process well over a million DNA sequences a second [3, Fig. 1].)The next section will describe the target system, BarraCUDA [6].Section 3 gives details of the programs and DNA benchmarks.In particular the standard GCAT Bioinformatics DNA sequence alignment benchmarks [7] and short human paired end next generation DNA sequences taken from The 1000 Genomes Project [8].This is followed (Section 4) by the overall performance changes genetic improvement [9,10,11,12,13,14] gives and comparison with bwa.(See particularly Table 3, page 6.) 2 NextGen DNA Sequence Alignment Since the human genome was sequenced in 2000 [15], increasingly powerful nextGeneration sequencing machines have generated vast volumes of short noisy DNA sequences.Initially sequences where only 30 or so bases long.(The sequences are stored as strings of the four letters A, C, G and T. Each character representing one base.)Where the genetic sequence is variable, simple statistics (i.e., 4 30 length of the genome) that suggest 30 or so bases would be sufficient to identify where the sequence lies in the reference genome.Whilst these data are inevitably noisy, the main difficulty with this approach is that (in particular) the Human genome contains many repeated sequences.Thus sometimes 4 30 can only identify the repeated pattern not the location itself.This lead to 1) longer sequences but also 2) sequencing both ends of much longer sequences.The second (paired end) approach requires more sophisticated computer algorithms.Each end is matched against the reference genome as before.When an end lies in a repeated sequence, and so gives multiple match points, the matches the other end gives are consulted.As the approximate length of the DNA sequence is known (or can be inferred), in many cases potential matches for the two ends can be ignored as they are simply too far apart.Paired end analysis is now typical.Whilst Barracuda can deal both with single ended and paired end DNA sequences, we shall only benchmark paired end data.
BarraCUDA uses the Burrows-Wheeler algorithm (BWA) [16].Indeed it source code is derived from a serial implementation of the BWA algorithm, simply called bwa.Barracuda gets its speed by using a GPU to processing hundreds of thousands of short DNA sequences in parallel.Typically finding where each DNA sequences matches the reference human genome is the most time consuming part.With paired end (pe) data, Barracuda "aln" matches each end separately and then Barracuda "sampe" combines them.Thus Barracuda sampe does not need a GPU (although it, unlike bwa sampe, can exploit multiple CPU cores).
With noise free data and where the DNA sequence matches the reference genome exactly, the Burrows-Wheeler algorithm is relatively straight forward.Before hand, offline, the reference genome is encoded into a compressed format so that all the sequences in the reference genome with the same starting subsequences are given the same location in the compressed file.Since there are four possible bases, extending the prefix sequence by one means this location leads to four subsequences prefix strings which are one base longer.However as the reference genome is finite, the branching factor quickly falls from four to one.If a prefix sequence can be followed by exactly one prefix which is one base longer, this means all the sequences with this particular prefix have the same base in the next position.The index file is arranged to enable rapid sequence look up.Barracuda and bwa index files are interchangeable.On look up, an upper and a lower pointer into the index file are kept.They span all possible matches, and so are initially far apart.As each base in the DNA sequence is processed, data are read from the index file and the position of the two pointers are updated.Actually the distance between them is the number of positions in the reference genome which match the DNA sequence processed so far.If the distance becomes one, then there is a unique match.If the two pointers cross this means the sequence does not exist in the reference genome.In good quality data from the 1000 Genomes Project, about 85% of sequences match uniquely.Where sequences do not match, this may be either due to noise in the data or to real mutations in the patient.To cope with non-exact matches, the algorithm must carefully back up its search and start trying out alternatives.This slows things down considerably.
For the approach to be feasible, Barracuda must load the whole of the index file into the GPU's memory.
Thus the GPU must have enough memory to hold it all.For the human reference genome, this means the GPU must have at least four gigabytes of on-board RAM.Also the Burrows-Wheeler algorithm does not allow short cuts.I.e., every base in the sequence must be processed.Thus, even before considering mismatches, Barracuda must make heavy access to the index.Fortunately modern GPUs have high bandwidth to their on-board memory (see last column in Table 1).
Typically the Burrows-Wheeler algorithm scales linearly with the length of the DNA sequences to be looked up.This makes it more suitable for shorter sequences than for longer ones.
Taking The 1000 Genomes Project as an example, [17,Fig. 4] shows some sequence lengths are much more common than others.In Section 4 we report tests on paired end data comprised of 36 bases per end and of 100 bases per end.Both are common in The 1000 Genomes Project.In fact the most popular is 101 bases, which is almost the same as one of the benchmarks provided by BioPlanet's Genome Comparison and Analytic Testing (GCAT) platform [7].
3.2 Barracuda 0.6.2 For comparison, the previous version of BarraCUDA, i.e. 0.6.2, was compiled with default settings (i.e.again including support for multi-threading).
Again it was built with default setting (including support for multi-threading).However a second version was built specifically for the GT 730 which was compiled with -arch 2.1 to support compute level 2.1, cf.column 4 in Table 1.(The default is now compute level 3.5 or higher).

36 base pairs: 1000 Genomes project
The 1000 Genomes Project [8] has made available a vast volume of data via its FTP site.One of its normal (i.e.not color space encoded) paired end data with 36 DNA bases per end was chosen at random (ERR001270) and downloaded in compressed form using wget.ERR001270 consists of two files (one per end of the DNA sequence) each containing 1.1 gigabytes (compressed).ERR001270 contains 14 102 867 36 base DNA sequences.Approximately 5.7% of sequences occur more than once in ERR001270.The files are in "fastq" format and so also contain quality values but these are not used by bwa or by either version of Barracuda.
Initially bwa objected to the sequence names provided by The 1000 Genome Project.However this was readily resolved so that each pair of sequences had its own unique name.

100 base pairs: GCAT Benchmark
BioPlanet.comhosts several sequence alignment and variant calling next generation DNA benchmarks on its GCAT web pages [7].We report results on their 100bp-pe-small-indel alignment benchmark (gcat set 037).This consists of two files (one per end) each of 3 gigabytes (uncompressed), each containing 5 972 625 100 base sequences.(Less than 0.1% of sequences were repeated.)The files are again in "fastq" format but only contain dummy quality values however again these are not used by bwa or by Barracuda.

Parallel operation on multi-core CPUs with bash pipes
Barracuda and bwa have similar operations and command lines.For paired end data, "aln" is run separately for each fastq sequence file (Sections 3.5 and 3.6 above).For every fastq sequence, "aln" produces zero or more possible alignments in the reference genome (Section 3. .sai Figure 2: Processing paired end DNA sequences."aln" is run twice (once per end) and its alignments are piped (red arrows) into "sampe" (sam (pe) paired end)."sampe" also reads the index of the reference human genome and both ends of each DNA sequence in order to give the combined alignment in sam format.In the case of barracuda, the two "aln" process each use a GPU and "sampe" uses multiple host threads.For bwa "aln" uses multiple host threads but "sampe" is single threaded.
files are generally not compatible between different versions).For each pair of sequences, "sampe" takes the alignments from the .saifiles, the sequences themselves and the index file for the reference genome to produce alignments for each end in sam format.
Notice the .saifiles are intermediate and can be deleted after the .samfile has been created.However the .saifiles are large.(E.g.830 mega bytes for Barracuda 0.6.2 on the 1000 Genomes Project example and 340 mega bytes for the GCAT example.)Since the .saifiles are both written to and read sequentially, under Unix using bash, it is not necessary to explicitly store them.Instead "aln" can write them to a unix pipe and "sampe" can read them from those pipes (see Figures 2 and 3).With large memory multi-core CPUs it is quite feasible to run both "aln" processes and the "sampe" process in parallel (see Table 2).
The sam files are plain text and also large, nearly 6 GB for ERR001270 and well over 4 GB for the GCAT benchmark.GCAT uses the compressed binary format bam, even so gcat set 037.bam is almost a gigabyte.gcat set 037.sam was converted to gcat set 037.bam by samtools.The time for this post processing (a few minutes) is not included in Table 3.

Problems and Work Arounds
A single GeForce GT 730 was available.It was mounted in a desktop linux PC with 4 GB of RAM.This was enough to run "aln" but not "sampe".(The .saifiles could be transferred to a much larger linux server to run "sampe".)Typically in Barracuda "sampe" takes little time and on a typical large multi-core server can be run in parallel with the two "aln" processes with little impact on total wall clock time (see Figures 2  and 3).For ease of comparison the data in Table 3 are an estimate for two GT 730's mounted in a large multi-core CPU.They are calculated from the wall clock time of the slowest of the two .saifiles.
Table 3: Mean number of paired end sequences processed per second.In (brackets) speed relative to bwa 0.7.12.± gives standard deviation estimated from five runs.There was almost no variation in accuracy reported by GCAT

Results
bwa and the original and the GI improved version of Barracuda were each run five times on both the fourteen million real world paired end DNA sequences from The 1000 Genomes Project (Section 3.5) and the almost six million paired end DNA sequences provided by GCAT as a benchmark (Section 3.6).bwa was run on 12 core 2.60 GHz CPUs (see Table 2) whilst Barracuda was run on three GPUs, stretching from £50 low end GT 730 to the top of the range K80 Tesla (see Table 1).The results are summarised in Table 3.
Apart from the low end GT 730, Barracuda is typically between two and ten times faster than the current release of bwa on a 12 core CPU (see figures within round brackets in the upper part of Table 3).The lower part of Table 3 presents the Barracuda data in the upper part as ratios between the previous release of Barracuda (0.6.2) with the current genetically improved [3] version (0.7.107).Table 3 shows the newer version is up to three times faster on the real world DNA sequences (36bp) and typically about 10% faster on the longer benchmark strings (100bp).

Discussion
bwa "sampe" does not support multiple host threads.As shorter sequences are expected to give rise to more duplicate matches and so give "sampe" more work to do, this may explain why bwa performs relatively badly on the short 36 bp 1000 Genomes Project data (see row bwa 36bp in Table 3).On the 1000 Genomes Project example (36bp), even a lowly GT 730 running Barracuda 0.7.107 can beat bwa on at 12 core super computer node.On the longer GCAT benchmark (100bp), one GT 730 is about quarter of the speed of the 12 core computer.However both the pair of K20s and the K80 are faster than bwa on both real world and GCAT examples.
On the GCAT benchmark, the new version of Barracuda is more accurate than 0.6.2 and approaches the accuracy of bwa 0.7.12.
The variability of run time on the super computer (Table 3 columns 4, 10 and 14) is typical of use on shared disk systems.In contrast, typically elapse times of CUDA kernels are very consistent (cf.Table 3 column 6).We anticipate slightly higher and but more stable performance could be achieved on the super computer by placing files on local or ram disks.(Perhaps a couple of percent improvement may be obtained by using a ramdisk.)However ignoring the time to transfer data files within the super computer might give unrepresentative results.

Conclusions
The new version of Barracuda, particularly on large real world examples, is a substantial improvement.Details of the genetic improvement process used to both tune its parameters and code will shortly be presented at the GECCO conference [3].Both the new version and the genetic improvement process are freely available.(The genetically improved version of BarraCUDA has been in use via SourceForge since 20 March 2015.The GI code may be down loaded from the author's FTP site from file gp-code/barracuda gp.tar.gz.) Depending upon examples, even a £50 GPU running Barracuda can be faster than bwa on a twelve core 2.60 GHz CPU.With a top end nVidia Tesla GPU, Barracuda can be more than ten times faster than bwa on a 12 core CPU.

Figure 1 :
Figure 1: Exponential growth in peak processing power.Data from nVidia 4) and saves them in a binary .saifile.(.sai

Table 1 :
GPU Hardware.Year each was announced by nVidia in column 2. Price (column 3) is either actual (GT 730) or on line quote (May 2015, which may be lower than original list price).Fourth column is CUDA compute capability level (as can be used with the nvcc compiler's -arch parameter).Each GPU chip contains 2, 13 or 15 identical independent multiprocessors (MP, column 5).Each MP contains 48 or 192 stream processors (total given in column 7) whose clock speed is given in column 8. Onboard memory size and bandwidth are given in the right most two columns.ECC enabled.

Table 2 :
CPUs.The desktop computer houses one GT 730.The servers are part of the Darwin Supercomputer of the University of Cambridge and hold multiple Tesla K20 or K80 GPUs.

Table 2 b
Estimated for two GT 730 GPUs.See Section 3.
8Barracuda version 0.6.2beta"sampe" failed on ERR001270 if run with multiple threads.The only work around was to used -t 1 to prevent use of multiple threads.(Bugfixed before version 0.7.107.)There is a small HTML web page which lists a few issues (some open but several now fixed) and common misunderstandings about Barracuda at http://www.cs.ucl.ac.uk/staff/W.Langdon/barracuda/.