NGSeasy: a next generation sequencing pipeline in Docker containers [version 1; referees: 3 approved with reservations]

: Bioinformatic pipelines often use large numbers of components Motivation and deploying them incurs substantial configuration and maintenance burden that remains a significant barrier to reproducible research. Our aim is to define a new paradigm and best practices for developing, distributing and running pipelines encapsulated in Docker containers (lightweight virtualization), with a focus on next generation sequencing (NGS) workflows. This approach provides several advantages, namely: efficiency, portability, versioning and reproducibility. Using the NGSeasy pipeline, a user can quickly deploy any pipeline version in any environment (e.g. operating systems, workstations, clusters, clouds). While this might also be achieved with a virtual machine (VM); VMs lack portability, have substantial overhead (disk, CPU, RAM), and require allocated resources to be provisioned statically – Docker, to a large extent, solves these issues. : We demonstrate best practices for packaging and execution of a Results multicomponent pipeline for NGS using a set of container building blocks which are versioned, modular and reusable. We present a basic ”proof of concept” evaluation of a next generation sequencing pipeline in Docker containers, capable of producing meaningful results, that are comparable with public and ”best practice” workflows, with little to no impact on standard computing performance. : Both versioned Dockerfiles and container images for each Availability component are published on GitHub and Docker Hub, respectively. The pipeline and containers can be pulled from Docker Hub and executed on any environment capable of running

Bioinformatic pipelines are frequently composed of large numbers of loosely coupled pieces of software, each tool requiring substantial configuration, maintenance and management of dependencies.Historically to facilitate packaging and reuse of pipelines, management frameworks such as Galaxy 1 , Ruffus 2 , and Taverna 3 have been developed.While these workflow management systems work well, portability and deployment complexity limit their usability.
Our primary motivation for developing NGSeasy was to simplify pipeline deployment for academic and clinical labs, minimising the burden of informatic support.To achieve this, we used Docker 4 , an emerging container-based virtualization technology.Compared to virtual machines, Docker containers are simply a set of processes running in a multi-tenanted Linux host kernel, so are very lightweight as there is no underlying machine to emulate.These containers capture the initial investment of effort to build and configure them greatly facilitating re-use, they can be easily extended to modify or incorporate new components and shared on private or public (Docker Hub) registries.
Using NGSeasy and Docker, bioinformaticians and more importantly, researchers with fewer bioinformatic skills can very quickly deploy the pipeline to different environments e.g.development, testing and production, with the knowledge that the containers should always run consistently.Furthermore, we support multiple versions of the NGSeasy containers on Docker Hub, as each container packages its own dependencies and is versioned, the fidelity of the analysis is preserved in future execution -a requirement for reproducible research and clinical auditing 5 .

Dockerising an NGS pipeline
NGSeasy has provided us with the opportunity to start defining and thinking about best practices for building Dockerised modular pipelines.Many of these practices have been adapted in our images.Our (compbio/ngseasy-base) image forms the foundation layer on which each pipeline container application is built.
We include what we think of as some of the best and most useful NGS "power tools" in compbio/ngseasy-base image (Table 1).These are all tools that allow the user to manipulate BED/ SAM/BAM/VCF files in a variety of ways.
Our feature rich base image, allows pipes and streamlined system calls for manipulating the output of NGS pipelines, namely, BED/ SAM/BAM/VCF files.Therefore, we built these into a single development environment for NGSeasy.This image is used as the base for all of our compbio/ngseasy-* tools.
A more Docker-esque approach, would be to have separate containers for each NGS tool.However, this belies the fact that many of these tools are required to interact, e.g. through pipe calls, when used as part of a streamlined pipeline.
Many of the raw NGSeasy images are fairly heavy (2-4GB).As a result, we flattened all images in order to compress multiple Docker layers into one, creating an image with fewer and smaller layers, before committing and pushing to Docker Hub.
With exception of the content built into the base image, each NGSeasy pipeline component (Table 2) is encapsulated in a separate container.Using separate containers helps to minimize container size, reduce unexpected interactions between components, and maximise the re-usability of containers.

Results and discussion
Overview of the NGSeasy pipeline A typical NGS pipeline for variant calling and discovery involves the following steps, all of which are implemented in the current version of NGSeasy (1.0-r001):  options to include the Genome Analysis Toolkit (GATK) recommendations for de-duplication (using Picardtools MarkDuplicates), GATK's base quality score recalibration (BQSR) and GATK's realignment around indels [18][19][20] .
We also include alternatives to GATK's BQSR and indel realignment tools, specifically, BamUtil's recab function http://genome.sph.umich.edu/wiki/BamUtil:recab),and for indel realignment, use of glia (https://github.com/ekg/glia).These options are provided for use in commercial and/or clinical laboratories who do not want to use or pay for a GATK licence.

Operation and implementation
Containerised software is automatically deployed, so we have opted to provide a wide variety of tools, including multiple tools for alignment and variant calling where available.
To keep the NGSeasy pipeline small and portable, input files, indexed reference genomes and generated output should bypass the container's root file system instead using a host mounted directory or volume (Figure 1).In certain instances it may be necessary to inspect a running container and this can be done by injecting a new process (e.g. a shell terminal) into the container with the docker-exec command, a valuable feature for debugging or monitoring.For resource allocation, Docker uses cgroups to control memory and CPU allocation (hard or soft allocation).
The container images are only provided for software which is freely available.For software components which require registration (e.g.GATK), or are proprietary (e.g.novoalign), we provide a short Dockerfile to complete the build with the additional components which the user must acquire.We believe this is a pragmatic solution for packaging and publishing pipelines that provide the option to use components with a restricted licence.In this way we provide maximum automated deployment with the minimum burden on the end user.
NGSeasy consists of a set of shell (bash) script wrappers, that orchestrate and call all parts of the Dockerised NGS pipeline -where the system calls are to docker run -i -t NGSTool instead of /bin/bash NGSTool, for example.Docker is agnostic, however, in that any workflow management software can be used to orchestrate a Docker based pipeline (eg.rufus 2 or nextflow 32 ).
Our design choice was largely influenced by our desire to provide a lightweight and fairly dependency free solution, that is "easy" to set up and maintain.We did not want the user to be tasked with installing a large number of software dependencies before being able to run NGSeasy.In this way, NGSeasy takes advantage of the fact that any modern computer, running any operating system with Docker (or for example boot2docker https://github.com/boot2docker/boot-2docker-cli)installed, will come pre-packaged with all of the basic software needed to run a NGS pipeline.
NGSeasy gives the user several options to call a complete NGS pipeline, going from raw FASTQ files to aligned BAM files, variant calls (VCF) and annotations using a range of software.All options are defined in a simple configuration file that can be made, for example, using any spreadsheet application, and then saved as a tab-delimited text file.With this, the user is able to choose from a wide selection of sequence aligners, and variant callers, see Table 2.
The NGSeasy scripts enforce specific naming conventions and directory structures upon the user -allowing sensible and reproducible organisation of NGS projects and associated data on the users local machine.This also avoids all of the potential issues with typographical errors that are typical of manual input.
All NGSeasy applications are run as a non-root user within each container.This is hard coded in the NGSeasy ecosystem and provides some security for Docker containers running in shared computing environments.
For useful cutting edge discussion and testing of NGS pipelines, we also refer readers to the Blue Collar Bioinformatics site at http://bcb.io/.

Getting and running NGSeasy
All Dockerfiles used to generate the NGSeasy images are available at https://github.com/KHPInformatics/ngseasyalong with documentation on installing and running NGSeasy.The pre-built containers are available to download from https://registry.hub.docker.com/repos/compbio.
Getting and running NGSeasy is simple and outlined in the code block below.Users should note that deploying the pipeline containers is fairly fast, dependant on network speeds, however, downloading the reference genomes and test datasets for the resources folder can take a while.For example, the install time averages at about 94 min on machines connected to relatively fast networks (i.e.> 500 Mbit/s).

System requirements
See Table 3 for our recommended system requirements.The hard disk requirements are based on our experience, and result from the fact that the pipeline/tools produce a range of intermediary and temporary files for each sample.The full NGSeasy install includes indexed genomes for hg19 and b37 for all aligners, annotation files from the GATK's resource bundle (ftp://ftp.broadinstitute.org/bundle,34), and all of the NGSeasy Docker images.
Based on our experience, a basic NGS computing system for a small lab would consist of at least 4TB disk space, 60GB RAM and at least 32 CPU cores.Network speed is a major bottle neck when dealing with NGS sized data, and groups are encouraged to think about these issues before embarking on multi sample or population level studies -where computing requirements can very quickly escalate, and transferring NGS data between sites becomes a major rate limiting step.

Genome comparison and analytic testing
We tested basic NGSeasy functionality -going from raw .fastq to .bam to .vcf-on an Illumina 100bp paired end whole exome (30x coverage) dataset available from GCAT: Genome Comparison and Analytic Testing -An analytical framework for optimizing variant discovery from personal genomes (http://www.bioplanet.com/gcat).For more details about GCAT, please refer to 35.
For this report, a basic/fast "non-GATK" based pipeline was tested.
We skipped FASTQ quality control trimming, re-alignment around indels and BQSR.The selected pipeline first runs FastQC on the raw data, followed by read alignment using all of the selected aligners: stampy, snap, novoalign, and bowtie2.All reads were aligned to the UCSC hg19 reference genome available at http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/.
The alignment stage outputs a duplicate marked (samblaster), sorted and indexed BAM file (sambamba), annotated with the appropriate read group information (e.g.sample name, platform unit etc).The alignment stage also includes generation of basic alignment statistics using sambamba's flagstat function, and a bed file of aligned regions using the bedtools function bamtobedthese extras steps are reflected in the average run times for NGSeasy's alignment stage (see Table 4).Note that stampy alignment is contingent on aligning reads with bwa first, and hence, we chose not to report separate results for bwa.
Variant calling was peformed using the haplotype based variant callers Platypus 31 and FreeBayes 30 , and the resulting VCF files uploaded to the GCAT server for comparisons to the genome in a bottle (GIB) call set 36 .The GCAT results for the tests listed above are available at the following urls: 1.All aligners + FreeBayes: http://goo.gl/G9tHRK.
A full discussion on GIB performance statistics is beyond the scope of this paper.Briefly, for the 30x whole exome dataset, NGSeasy is achieving GIB sensitivities and specificities of 81.1-85.8% and 99.996-99.998%,respectively.There are obvious gains to be made by further pipeline optimisations, and the planned inclusion of structural variant callers and variant re-calling and filtering options.
We are presenting these results solely as a "proof of concept".That is, we have successfully Dockerised a full NGS pipeline, that is capable of producing meaningful results, that are comparable with public and "best practice" workflows.

Run performance
For the testing carried out in this paper, NGSeasy was run on Rosalind, an Openstack private cloud based at Kings College London, using a virtual machine with 256GB RAM and 32 cores.
We have also successfully tested NGSeasy on workstations running a wide variety of environments (OSX, Windows 7, Ubuntu 14.04).
Average representative run times for a full NGSeasy pipeline and its components are presented in Table 4.
The obvious winners for alignment, based purely on speed, are bwa and snap.The two software are comparable.The extra run time seen for snap are due to loading/reading of the indexed reference genome.Once this has been done, snap will run at speed, and is the fastest aligner these authors have seen.The reported runtime for stampy is dependent on bwa having been run first.
Note, that fastQC and read quality trimming need only be applied once.After which, the pipeline is set up to test for, and skip these stages, if the have already been run -speeding up subsequent pipeline Running a full NGS pipeline using Docker containers had no real noticeable reduction in computing performance (run time) when compared to our original native (non-Dockered) NGS pipeline.The differences are in the milliseconds to seconds range, and largely depend on the underlying system hardware (and data quality).These observations are similar to those reported in 37.
Strikingly, depending on available compute, read depth and the selected pipeline components, the observed runs times indicate that a full clinical NGS pipeline could be run, and achieve actionable results in less than 2 hours.This has major positive implications for molecular diagnostics and projects like the 100,000 Genomes Project (http://www.genomicsengland.co.uk/the-100000-genomesproject/).That is, alignment and variant calling are no longer a major bottle neck.More work is needed to speed up and improve library preparation, sequencing machine run times and solutions for variant annotation, prioritisation and clinical reporting.

Use cases
NGSeasy demonstrates the utility of Docker as a means to package software used in modular workflows.We envisage NGSeasy as a method for deploying drop-in analyses, in scenarios where data cannot be shared (either for size or privacy reasons) and an analysis must be carried out in-situ.In such cases, using a pipeline like NGSeasy, it is simple to develop an analysis off site, package it and deploy it on computational facilities where access to the data is provided, examples of such scenarios include the 100,000 Genomes Project and Illumina BaseSpace 38 Docker 'apps'.
In addition, NGSeasy is being tested across a select group of NHS Labs (under the NHS England Open Source Initiative) for molecular diagnostic and clinical research pipelines.In particular, a version of NGSeasy has been adapted by Viapath at King's College Hospital (publication pending; personal communication from Dr Barnaby Clark and Dr David Brawand http://www.viapath.co.uk/locations/kings-college-hospital).The advantages being, the ease of use and set up, the built in version control and the ability for audit tracking and reproducibility conferred by the use of Docker and the open source community built around GitHub.

NSGeasy future developments
NGSeasy is under continual development.What we demonstrate here is the pre-production release and basic proof of concept evaluation of NGSeasy :a next generation sequencing pipeline in Docker containers.We want to present this to the scientific community at large, especially those working in the bioinformatics domain, and wish to encourage and invite collaboration on NGSeasy and our groups efforts to Dockerise bioinformatic pipelines.
The group is currently working on a GUI for NGSeasy and along with a modular benchmarking suite.In planned extensions, NGSeasy will provide options for consensus calling, trio/family and population based calling pipelines, human leukocyte antigen (HLA) calling, structural variant calling, cancer pipelines, more optimisations, improved logging, and the latest b38 indexed genomes.
In later versions we will publish detailed benchmarking statistics for all aligners and variant calling on whole exome, genome and clinical panels from a range of depths and platforms.
Development work on Docker continues at pace.The present Docker daemon, runs as root, and there remain security issues with the notion of providing access to this daemon in a shared user environment, such as a typical cluster, a solution to this exists using Linux kernel user namespaces but this is presently undergoing review.

Michael Barton
Joint Genome Institute, Walnut Creek, CA, USA My understanding of this article is that NGSeasy pipeline aims to simplify the distribution of common tools used in sequencing analysis.A still significant problem in bioinformatics is getting the third-party tools installed and working, by using Docker containers as described in this article, the authors will make this process easier.The code is available on github as described and they provide extensive documentation.

Major
One concern is the install instructions in the article.Specifically: sudo make INTSALLDIR="/media/scratch" all sudo make intsall I am wary of using 'sudo' to install.I know that using tools like 'apt-get' require 'sudo' however for most bioinformatics software I prefer to install in my user directory simply to avoid any possible security problems.I took a look at the Makefile and I believe that sudo is not necessarily required to install, only the the INSTALLDIR and TARGET_BIN are owned by the user.Also there is typo here in 'intsall' The project doesn't include any NGS tools related to assembly or transcriptomics.Though not stated specifically, the tools and data described here leads me to believe this project is focused around clinical applications and human genomics.If that is the case perhaps this should be clarified in the article and the title.
The Docker security issue at the end feels tagged-on.This, I think, is a pressing concern that prevents many people from using Docker on HPC machines as opposed to on-demand computing such as AWS.This is the case at the JGI where I currently work.I would suggest expanding on this point a little more to describe why it is an issue.
I think a short paragraph would be useful to end the article with.This would summarise the points described above and potential impact of the work.
I think Figure 1 could be expanded upon.It currently assumes a familiarity with Docker that could make it difficult to interpret without a good understanding of containers and volumes.Minor The scripts installed from https://github.com/KHP-Informatics/ngseasy/tree/master/binF1000Research agree with your security points once you have the NGSeasy approach setup, but getting there can be a challenge.You do mention this later in "NGSeasy future developments" and that would fit better in the initial installation section.
For reporting of download/install times, please also list install times from more standard connection speeds.A majority of users will not have 500Mb/s or better download.Is it possible to download subsets of the data?It looks like it currently grabs both hg19, b37 and hs37d5, tripling the download times and space required.Digging into the code it wasn't clear how to get other mentioned genomes like hs38DH.

Validation
For the GCAT/Genome in a Bottle validations, I'd suggest reporting precision instead of specificity.Specificity is not especially useful for calling since it's dominated by true negatives.For example, the precision rates show clear differences between FreeBayes and Platypus, and also differences between novoalign and the other aligners.The specificity numbers do not reveal these.
It's hard to judge the results of your validation without comparing to another best-practice pipeline like bwa + GATK HaplotypeCaller.Having these as a baseline next to your comparisons would strengthen the argument that the current implementation does a comparable job to expected best practice.
It would be useful to have bwa-mem alignment results also listed in the GCAT validations.bwa-mem is a widely used aligner, separate from stampy.
Do you have validation of using non-GATK tools (recab and glia) versus GATK tools in terms of the output quality?This would be useful to report.I've had good output success avoiding these step entirely but would like to see differences between avoiding the steps and using freely available alternatives.

Timings
The timing information is really useful and a great addition to the paper.I'd suggest adding some caveats to the conclusion and tables to make it clearer about the inputs, since the numbers are exome with only 30x coverage.Most standard exomes would be higher coverage and WGS is becoming increasingly standard.Some of the statements like "alignment and variant calling are no longer a major bottle neck" seem overextended from timings on this smaller dataset.Scaling up is not linear and things get harder for WGS projects like 100k genomes project.

2.
I have read this submission.I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

7.
for packaging NGS code in a docker image?
The abstract and introduction define the domain of application of NGSEasy as "next-generation sequencing (NGS)".However, the manuscript is about methods for variant calling, which is an important, but smaller scope that the full NGS data analysis.For instance, NGSEasy does not include tools for analysis of RNA-Seq.I recommend to revise the abstract and introduction to clearly indicate the scope of the software tool.
A strength of the manuscript is to use the GCAT server to evaluate the pipeline, but the results are not presented in the context of the performance of other pipelines, so the readers have no easy way of knowing if the sensitivity and specificity measures presented on page 6 are competitive.For instance, on Page 6, the manuscript claims: "we have successfully Dockerized a full NGS pipleline that is capable of producing meaningful result, that are comparable with public and "best practice" workflows".However, there is no reference for the workflows the work is compared to and no simple way to establish if the results are comparable, let alone competitive.I strongly recommend to include a comparison directly in the manuscript to help the readers objectively assess performance.
The manuscript would be strengthened by providing a discussion of the limitations of the work.For instance, it is unclear what support is provided for parallelization across nodes, rather than SMP parallelization.(Multi-node parallelization is important when more than one or two samples need to be analyzed.)I am unable to locate Reference 37 using the citation information: "37.Matzke M, Jurkschat K, Backhaus T, et al.: PrePrints PrePrints.2014; (1): 1-34.".This reference is used when discussing performance of docker containers and I am unable to determine if this is appropriate.A valid reference for this point is .https://peerj.com/articles/1273/The reference provided for Nextflow is wrong.The tool should be cited using the web site ( ) or FigShare poster, and the correct authors given credit.http://nextflow.ioMinor comments: Page 3, "NGSEasy contains all the basic tools needed for manipulation and quality control.." should be toned down.Using all in a manuscript is inviting contradiction.For instance, I could point out that the NGSEasy do not contain SpeedSeq, a recently published set of tools that considerably accelerates variation calling in HTS data.Therefore, I would argue that NGSEasy does not contain all the basic tools that I would like to use.Consider revising as "NGSEasy contains a set of tools sufficient for manipulation and quality control.." Page 5.The word "all" is used again (left column, 6th paragraph).I doubt that the practice, as described, eliminates all potential issues with typo, since end-users will be writing scripts using tools in the image, and I am not sure how consistent naming conventions can fully eliminate typos when writing scripts.
I have read this submission.I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed.Competing Interests:

Table 4 . Average run times:30× 100bp PE Illumina data.
that use the same data.Be aware that run times will vary depending on depth, quality of data, and compute power (e.g.available RAM and CPU).Both Platypus 31 and FreeBayes 30 , are highly parallelisable and run at speed; Platypus being 6x faster than FreeBayes in our test, but, less sensitive than FreeBayes; the average GIB sensitivity over all aligners from Platypus versus FreeBayes was 82.40% versus 84.15%. calls

variant tool set that discovers short variants from next generation sequencing data. Reference Source 16
. Chiang C: An

gatk resource bundle is a collection of standard files for working with human resequencing data with the gatk
. 2015.Reference Source 35.Highnam G, Wang JJ, Kusler D, et al.: An

analytical framework for optimizing variant discovery from personal genomes. Nat Commun. 2015; 6: 6275. PubMed Abstract | Publisher Full Text | Free Full Text 36
. Zook JM, Chapman B, Wang J, et al.: Integrating

doi:10.5256/f1000research.7650.r10673
Has there been scaling work across non-single machine setups?Our experience is that shared network issues and managing Docker containers can dominate scaling.If the target is single multi-core machines it would be worth specifying this directly.