Abstract
The rapid growth of high-throughput technologies has transformed biomedical research. With the increasing amount and complexity of data, scalability and reproducibility have become essential not just for experiments, but also for computational analysis. However, transforming data into information involves running a large number of tools, optimizing parameters, and integrating dynamically changing reference data. Workflow managers were developed in response to such challenges. They simplify pipeline development, optimize resource usage, handle software installation and versions, and run on different compute platforms, enabling workflow portability and sharing. In this Perspective, we highlight key features of workflow managers, compare commonly used approaches for bioinformatics workflows, and provide a guide for computational and noncomputational users. We outline community-curated pipeline initiatives that enable novice and experienced users to perform complex, best-practice analyses without having to manually assemble workflows. In sum, we illustrate how workflow managers contribute to making computational analysis in biomedical research shareable, scalable, and reproducible.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Code availability
Minimal example workflows and links to documentation are available under https://github.com/GoekeLab/bioinformatics-workflows.
References
Stephens, Z. D. et al. Big data: astronomical or genomical? PLoS Biol. 13, e1002195 (2015).
Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
Dozmorov, M. G. GitHub statistics as a measure of the impact of open-source bioinformatics software. Front. Bioeng. Biotechnol. 6, 198 (2018).
Nowogrodzki, A. How to support open-source software and stay sane. Nature 571, 133–134 (2019).
Mangul, S. et al. Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol. 17, e3000333 (2019).
Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).
Tiwari, K. et al. Reproducibility in systems biology modelling. Mol. Syst. Biol. 17, e9982 (2021).
Botvinik-Nezer, R. et al. Variability in the analysis of a single neuroimaging dataset by many teams. Nature 582, 84–88 (2020).
Grüning, B. et al. Practical computational reproducibility in the life sciences. Cell Syst. 6, 631–635 (2018).
van Vliet, M. Seven quick tips for analysis scripts in neuroimaging. PLoS Comput. Biol. 16, e1007358 (2020).
Leipzig, J. A review of bioinformatic pipeline frameworks. Brief. Bioinform. 18, 530–536 (2017).
Gronenschild, E. H. B. M. et al. The effects of FreeSurfer version, workstation type, and Macintosh operating system version on anatomical volume and cortical thickness measurements. PLoS ONE 7, e38234 (2012).
Stodden, V., Seiler, J. & Ma, Z. An empirical analysis of journal policy effectiveness for computational reproducibility. Proc. Natl Acad. Sci. USA 115, 2584–2589 (2018).
Reiter, T. et al. Streamlining data-intensive biology with workflow systems. Gigascience 10, giaa140 (2021).
Perkel, J. M. Workflow systems turn raw data into scientific knowledge. Nature 573, 149–150 (2019).
Love, M. I. et al. Tximeta: reference sequence checksums for provenance identification in RNA-seq. PLoS Comput. Biol. 16, e1007664 (2020).
Simoneau, J. & Scott, M. S. In silico analysis of RNA-seq requires a more complete description of methodology. Nat. Rev. Mol. Cell Biol. 20, 451–452 (2019).
Simoneau, J., Dumontier, S., Gosselin, R. & Scott, M. S. Current RNA-seq methodology reporting limits reproducibility. Brief. Bioinform. 22, 140–145 (2019).
Simoneau, J., Gosselin, R. & Scott, M. S. Factorial study of the RNA-seq computational workflow identifies biases as technical gene signatures. NAR Genom. Bioinform. 2, lqaa043 (2020).
Kim, Y.-M., Poline, J.-B. & Dumas, G. Experimenting with reproducibility: a case study of robustness in bioinformatics. Gigascience 7, giv077 (2018).
Kanwal, S., Khan, F. Z., Lonie, A. & Sinnott, R. O. Investigating reproducibility and tracking provenance—a genomic workflow case study. BMC Bioinformatics 18, 337 (2017).
Goble, C. et al. FAIR Computational Workflows. Data Intell. 2, 108–121 (2020).
Lamprecht, A.-L. et al. Towards FAIR principles for research software. Data Sci. 3, 37–59 (2019).
Abate, P., Di Cosmo, R., Treinen, R. & Zacchiroli, S. A modular package manager architecture. Inf. Softw. Technol. 55, 459–474 (2013).
Decan, A., Mens, T. & Grosjean, P. An empirical comparison of dependency network evolution in seven software packaging ecosystems. Empir. Softw. Eng. 24, 381–416 (2019).
Gruening, B. et al. Recommendations for the packaging and containerizing of bioinformatics software. F1000Res. 7, J-742 (2018).
Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods 15, 475–476 (2018).
Silver, A. Software simplified. Nature 546, 173–174 (2017).
Kurtzer, G. M., Sochat, V. & Bauer, M. W. Singularity: scientific containers for mobility of compute. PLoS ONE 12, e0177459 (2017).
O’Connor, B. D. et al. The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows. F1000Res. 6, 52 (2017).
da Veiga Leprevost, F. et al. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics 33, 2580–2582 (2017).
Beaulieu-Jones, B. K. & Greene, C. S. Reproducibility of computational workflows is automated using continuous analysis. Nat. Biotechnol. 35, 342–346 (2017).
Black, A., MacCannell, D. R., Sibley, T. R. & Bedford, T. Ten recommendations for supporting open pathogen genomic analysis in public health. Nat. Med. 26, 832–841 (2020).
Krumm, N. & Hoffman, N. Practical estimation of cloud storage costs for clinical genomic data. Pract. Lab. Med. 21, e00168 (2020).
Yang, A., Troup, M. & Ho, J. W. K. Scalability and validation of big data bioinformatics software. Comput. Struct. Biotechnol. J. 15, 379–386 (2017).
Krissaane, I. et al. Scalability and cost-effectiveness analysis of whole genome-wide association studies on Google Cloud Platform and Amazon Web Services. J. Am. Med. Inform. Assoc. 27, 1425–1430 (2020).
Larsonneur, E. et al. Evaluating workflow management systems: a nioinformatics use case. in 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2773–2775 (IEEE, 2018).
Bux, M. & Leser, U. Parallelization in scientific workflow management systems. Preprint at https://arxiv.org/abs/1303.7195 (2013).
Belcastro, L., Marozzo, F. & Talia, D. Programming models and systems for big data analysis. Int. J. Parallel Emergent Distrib. Syst. 34, 632–652 (2019).
Silva, V. et al. Raw data queries during data-intensive parallel workflow execution. Future Gener. Comput. Syst. 75, 402–422 (2017).
Grossman, R. L. Data lakes, clouds, and commons: a review of platforms for analyzing and sharing genomic data. Trends Genet. 35, 223–234 (2019).
Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 19, 325 (2018).
Lau, J. W. et al. The Cancer Genomics Cloud: collaborative, reproducible, and democratized—a new paradigm in large-scale computational research. Cancer Res. 77, e3–e6 (2017).
Yakneen, S. et al. Butler enables rapid cloud-based analysis of thousands of human genomes. Nat. Biotechnol. 38, 288–292 (2020).
Perez-Riverol, Y. & Moreno, P. Scalable data analysis in proteomics and metabolomics using BioContainers and workflows engines. Proteomics 20, e1900147 (2020).
Fjukstad, B., Dumeaux, V., Hallett, M. & Bongo, L. A. Reproducible data analysis pipelines for precision medicine. in 2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) 299–306 (IEEE, 2019).
Birger, C. et al. FireCloud, a scalable cloud-based platform for collaborative genome analysis: strategies for reducing and controlling costs. Preprint at bioRxiv https://doi.org/10.1101/209494 (2017).
Han, L., Canon, L., Casanova, H., Robert, Y. & Vivien, F. Checkpointing workflows for fail-stop errors. IEEE Trans. Comput. 67, 1105–1120 (2018).
Jackson, M., Kavoussanakis, K. & Wallace, E. W. J. Using prototyping to choose a bioinformatics workflow management system. PLoS Comput. Biol. 17, e1008622 (2021).
Goecks, J., Nekrutenko, A., Taylor, J. & Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010).
Fillbrunn, A. et al. KNIME for reproducible cross-domain analysis of life science data. J. Biotechnol. 261, 149–156 (2017).
Berthold, M. R. et al. in Data Analysis, Machine Learning and Applications 319–326 (Springer, 2008).
Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 46, W537–W544 (2018).
Batut, B. et al. Community-driven data analysis training for biology. Cell Syst. 6, 752–758 (2018).
Jalili, V. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update. Nucleic Acids Res. 48, W395–W402 (2020).
Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016).
Cordasco, G., D’Auria, M., Negro, A., Scarano, V. & Spagnuolo, C. Toward a domain-specific language for scientific workflow-based applications on multicloud system. Concurr. Comput. e5802 (2020).
Mölder, F. et al. Sustainable data analysis with Snakemake. F1000Res. 10, 33 (2021).
Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).
Bourgey, M. et al. GenPipes: an open-source framework for distributed and scalable genomic analyses. Gigascience 8, giz037 (2019).
Sadedin, S. P., Pope, B. & Oshlack, A. Bpipe: a tool for running and managing bioinformatics pipelines. Bioinformatics 28, 1525–1526 (2012).
Novella, J. A. et al. Container-based bioinformatics with Pachyderm. Bioinformatics 35, 839–846 (2019).
Kieser, S., Brown, J., Zdobnov, E. M., Trajkovski, M. & McCue, L. A. ATLAS: a Snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data. BMC Bioinformatics 21, 257 (2020).
Hölzer, M. & Marz, M. PoSeiDon: a Nextflow pipeline for the detection of evolutionary recombination events and positive selection. Bioinformatics 37, 1018–1020 (2020).
Zhao, Q. et al. LncPipe: a Nextflow-based pipeline for identification and analysis of long non-coding RNAs from RNA-seq data. J. Genet. Genomics 45, 399–401 (2018).
Cornwell, M. et al. VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis. BMC Bioinformatics 19, 135 (2018).
Lampa, S., Dahlö, M., Alvarsson, J. & Spjuth, O. SciPipe: a workflow library for agile development of complex and dynamic bioinformatics pipelines. Gigascience 8, giz044 (2019).
Amstutz, P. et al. Common Workflow Language v1. 0 (2016); https://doi.org/10.6084/m9.figshare.3115156.v2
Crusoe, M. R. et al. Methods included: standardizing computational reuse and portability with the common workflow language. Preprint at https://arxiv.org/abs/2105.07028 (2021).
Voss, K., Van der Auwera, G. & Gentry, J. Full-stack genomics pipelining with GATK4 + WDL + Cromwell. F1000Res 6, 1381 (2017).
Vivian, J. et al. Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol. 35, 314–316 (2017).
Kotliar, M., Kartashov, A. V. & Barski, A. CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language. Gigascience 8, giz084 (2019).
Yang, J. Cloud computing for storing and analyzing petabytes of genomic data. J. Ind. Inf. Integr. 15, 50–57 (2019).
Xu, B., An, L., Thung, F., Khomh, F. & Lo, D. Why reinventing the wheels? An empirical study on library reuse and re-implementation. Empir. Softw. Eng. 25, 755–789 (2020).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Bhardwaj, V. et al. snakePipes: facilitating flexible, scalable and integrative epigenomic analysis. Bioinformatics 35, 4757–4759 (2019).
Ewels, P. A. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat. Biotechnol. 38, 276–278 (2020).
Sicilia, M.-A., García-Barriocanal, E. & Sánchez-Alonso, S. Community curation in open dataset repositories: insights from Zenodo. Procedia Comput. Sci. 106, 54–60 (2017).
Leman, J. K. et al. Better together: elements of successful scientific software development in a distributed collaborative community. PLoS Comput. Biol. 16, e1007507 (2020).
Weber, L. M. et al. Essential guidelines for computational method benchmarking. Genome Biol. 20, 125 (2019).
Marx, V. Bench pressing with genomics benchmarkers. Nat. Methods 17, 255–258 (2020).
Angers-Loustau, A. et al. The challenges of designing a benchmark strategy for bioinformatics pipelines in the identification of antimicrobial resistance determinants using next generation sequencing technologies. F1000Res. 7, J-459 (2018).
Möller, S. et al. Robust cross-platform workflows: how technical and scientific communities collaborate to develop, test and share best practices for data analysis. Data Sci. Eng. 2, 232–244 (2017).
Carey, V. J. et al. Global alliance for genomics and health meets Bioconductor: toward reproducible and agile cancer genomics at Cloud scale. JCO Clin. Cancer Inf. 4, 472–479 (2020).
List, M., Ebert, P. & Albrecht, F. Ten simple rules for developing usable software in computational biology. PLoS Comput. Biol. 13, e1005265 (2017).
Karimzadeh, M. & Hoffman, M. M. Top considerations for creating bioinformatics software documentation. Brief. Bioinform. 19, 693–699 (2018).
Anzt, H. et al. An environment for sustainable research software in Germany and beyond: current state, open challenges, and call for action. F1000Res. 9, 295 (2020).
Mangul, S., Martin, L. S., Eskin, E. & Blekhman, R. Improving the usability and archival stability of bioinformatics software. Genome Biol. 20, 47 (2019).
Siepel, A. Challenges in funding and developing genomic software: roots and remedies. Genome Biol. 20, 147 (2019).
Malone, K. & Wolski, R. Doing data science on the shoulders of giants: the value of open source software for the data science community. Harvard Data Science Review https://hdsr.mitpress.mit.edu/pub/xsrt4zs2/release/4 (31 May 2020).
Acknowledgements
J.G. is supported by funding from the Agency for Science, Technology, and Research (A∗STAR), Singapore, and by the Singapore Ministry of Health’s National Medical Research Council under its Individual Research Grant funding scheme. L.W. was supported by the Singapore International Pre-Graduate Award (SIPGA) from A*STAR and the New Colombo Plan Scholarship from the Australian Department of Foreign Affairs and Trade. We thank B. Grüning for helpful comments and suggestions on this manuscript. We would like to thank R. Patro for contributing a test dataset for the example workflow implementations. We thank M. van den Beek for contributing the Galaxy workflow to the GitHub repository. We thank J. Köster for contributing the Snakemake workflow to the GitHub repository. We thank P. Di Tommaso for contributing the Nextflow workflow to the GitHub repository. We thank S. Lampa for contributing the SciPipe workflow to the GitHub repository. We thank J.H. Gálvez López, P.-O. Quirion, E. Henrion, and M. Bourgey for contributing the GenPipes workflow to the GitHub repository. We thank A. Novak, B. Paten, L. Blauvelt, and L. Koziol for contributing the Toil workflow to the GitHub repository. We thank S. Sadedin for contributing the Bpipe workflow to the GitHub repository. We thank S. Sadedin for contributing the Bpipe workflow to the GitHub repository.
Author information
Authors and Affiliations
Contributions
L.W., A.W., and J.G. planned the manuscript. L.W. and J.G. wrote the first draft. L.W., A.W., and J.G. wrote and revised the final manuscript.
Corresponding author
Ethics declarations
Competing interests
A.W. is an employee of ImmunoScape Pte Ltd. L.W. and J.G. declare no competing interests.
Additional information
Peer review information Nature Methods thanks Johannes Köster, Yasset Perez-Riverol, Anton Nekrutenko, and Paolo Di Tommaso for their contribution to the peer review of this work. Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Note.
Supplementary Table 1
Overview of workflow managers for bioinformatics.
Rights and permissions
About this article
Cite this article
Wratten, L., Wilm, A. & Göke, J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 18, 1161–1168 (2021). https://doi.org/10.1038/s41592-021-01254-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-021-01254-9
This article is cited by
-
MAW: the reproducible Metabolome Annotation Workflow for untargeted tandem mass spectrometry
Journal of Cheminformatics (2023)
-
SynBioTools: a one-stop facility for searching and selecting synthetic biology tools
BMC Bioinformatics (2023)
-
Irreproducible results and unsupported conclusions in Ahmad et al. [BMC genomics (2020) 21:656]
BMC Genomics (2023)
-
How is Big Data reshaping preclinical aging research?
Lab Animal (2023)
-
TidyMass an object-oriented reproducible analysis framework for LC–MS data
Nature Communications (2022)