Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers

Abstract

The rapid growth of high-throughput technologies has transformed biomedical research. With the increasing amount and complexity of data, scalability and reproducibility have become essential not just for experiments, but also for computational analysis. However, transforming data into information involves running a large number of tools, optimizing parameters, and integrating dynamically changing reference data. Workflow managers were developed in response to such challenges. They simplify pipeline development, optimize resource usage, handle software installation and versions, and run on different compute platforms, enabling workflow portability and sharing. In this Perspective, we highlight key features of workflow managers, compare commonly used approaches for bioinformatics workflows, and provide a guide for computational and noncomputational users. We outline community-curated pipeline initiatives that enable novice and experienced users to perform complex, best-practice analyses without having to manually assemble workflows. In sum, we illustrate how workflow managers contribute to making computational analysis in biomedical research shareable, scalable, and reproducible.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of bioinformatics analysis workflows using an example of transcript expression quantification.

Code availability

Minimal example workflows and links to documentation are available under https://github.com/GoekeLab/bioinformatics-workflows.

References

  1. Stephens, Z. D. et al. Big data: astronomical or genomical? PLoS Biol. 13, e1002195 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  2. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

    Article  CAS  PubMed  Google Scholar 

  3. Dozmorov, M. G. GitHub statistics as a measure of the impact of open-source bioinformatics software. Front. Bioeng. Biotechnol. 6, 198 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Nowogrodzki, A. How to support open-source software and stay sane. Nature 571, 133–134 (2019).

    Article  CAS  PubMed  Google Scholar 

  5. Mangul, S. et al. Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol. 17, e3000333 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).

    Article  PubMed  CAS  Google Scholar 

  7. Tiwari, K. et al. Reproducibility in systems biology modelling. Mol. Syst. Biol. 17, e9982 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Botvinik-Nezer, R. et al. Variability in the analysis of a single neuroimaging dataset by many teams. Nature 582, 84–88 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Grüning, B. et al. Practical computational reproducibility in the life sciences. Cell Syst. 6, 631–635 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  10. van Vliet, M. Seven quick tips for analysis scripts in neuroimaging. PLoS Comput. Biol. 16, e1007358 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Leipzig, J. A review of bioinformatic pipeline frameworks. Brief. Bioinform. 18, 530–536 (2017).

    PubMed  Google Scholar 

  12. Gronenschild, E. H. B. M. et al. The effects of FreeSurfer version, workstation type, and Macintosh operating system version on anatomical volume and cortical thickness measurements. PLoS ONE 7, e38234 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Stodden, V., Seiler, J. & Ma, Z. An empirical analysis of journal policy effectiveness for computational reproducibility. Proc. Natl Acad. Sci. USA 115, 2584–2589 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Reiter, T. et al. Streamlining data-intensive biology with workflow systems. Gigascience 10, giaa140 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Perkel, J. M. Workflow systems turn raw data into scientific knowledge. Nature 573, 149–150 (2019).

    Article  CAS  PubMed  Google Scholar 

  16. Love, M. I. et al. Tximeta: reference sequence checksums for provenance identification in RNA-seq. PLoS Comput. Biol. 16, e1007664 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Simoneau, J. & Scott, M. S. In silico analysis of RNA-seq requires a more complete description of methodology. Nat. Rev. Mol. Cell Biol. 20, 451–452 (2019).

    Article  CAS  PubMed  Google Scholar 

  18. Simoneau, J., Dumontier, S., Gosselin, R. & Scott, M. S. Current RNA-seq methodology reporting limits reproducibility. Brief. Bioinform. 22, 140–145 (2019).

    Article  PubMed Central  CAS  Google Scholar 

  19. Simoneau, J., Gosselin, R. & Scott, M. S. Factorial study of the RNA-seq computational workflow identifies biases as technical gene signatures. NAR Genom. Bioinform. 2, lqaa043 (2020).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  20. Kim, Y.-M., Poline, J.-B. & Dumas, G. Experimenting with reproducibility: a case study of robustness in bioinformatics. Gigascience 7, giv077 (2018).

    Google Scholar 

  21. Kanwal, S., Khan, F. Z., Lonie, A. & Sinnott, R. O. Investigating reproducibility and tracking provenance—a genomic workflow case study. BMC Bioinformatics 18, 337 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Goble, C. et al. FAIR Computational Workflows. Data Intell. 2, 108–121 (2020).

    Article  Google Scholar 

  23. Lamprecht, A.-L. et al. Towards FAIR principles for research software. Data Sci. 3, 37–59 (2019).

    Article  Google Scholar 

  24. Abate, P., Di Cosmo, R., Treinen, R. & Zacchiroli, S. A modular package manager architecture. Inf. Softw. Technol. 55, 459–474 (2013).

    Article  Google Scholar 

  25. Decan, A., Mens, T. & Grosjean, P. An empirical comparison of dependency network evolution in seven software packaging ecosystems. Empir. Softw. Eng. 24, 381–416 (2019).

    Article  Google Scholar 

  26. Gruening, B. et al. Recommendations for the packaging and containerizing of bioinformatics software. F1000Res. 7, J-742 (2018).

    Article  Google Scholar 

  27. Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods 15, 475–476 (2018).

    Article  PubMed  CAS  Google Scholar 

  28. Silver, A. Software simplified. Nature 546, 173–174 (2017).

    Article  CAS  PubMed  Google Scholar 

  29. Kurtzer, G. M., Sochat, V. & Bauer, M. W. Singularity: scientific containers for mobility of compute. PLoS ONE 12, e0177459 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  30. O’Connor, B. D. et al. The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows. F1000Res. 6, 52 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  31. da Veiga Leprevost, F. et al. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics 33, 2580–2582 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  32. Beaulieu-Jones, B. K. & Greene, C. S. Reproducibility of computational workflows is automated using continuous analysis. Nat. Biotechnol. 35, 342–346 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Black, A., MacCannell, D. R., Sibley, T. R. & Bedford, T. Ten recommendations for supporting open pathogen genomic analysis in public health. Nat. Med. 26, 832–841 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Krumm, N. & Hoffman, N. Practical estimation of cloud storage costs for clinical genomic data. Pract. Lab. Med. 21, e00168 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  35. Yang, A., Troup, M. & Ho, J. W. K. Scalability and validation of big data bioinformatics software. Comput. Struct. Biotechnol. J. 15, 379–386 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Krissaane, I. et al. Scalability and cost-effectiveness analysis of whole genome-wide association studies on Google Cloud Platform and Amazon Web Services. J. Am. Med. Inform. Assoc. 27, 1425–1430 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Larsonneur, E. et al. Evaluating workflow management systems: a nioinformatics use case. in 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2773–2775 (IEEE, 2018).

  38. Bux, M. & Leser, U. Parallelization in scientific workflow management systems. Preprint at https://arxiv.org/abs/1303.7195 (2013).

  39. Belcastro, L., Marozzo, F. & Talia, D. Programming models and systems for big data analysis. Int. J. Parallel Emergent Distrib. Syst. 34, 632–652 (2019).

    Article  Google Scholar 

  40. Silva, V. et al. Raw data queries during data-intensive parallel workflow execution. Future Gener. Comput. Syst. 75, 402–422 (2017).

    Article  Google Scholar 

  41. Grossman, R. L. Data lakes, clouds, and commons: a review of platforms for analyzing and sharing genomic data. Trends Genet. 35, 223–234 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 19, 325 (2018).

    Article  CAS  PubMed  Google Scholar 

  43. Lau, J. W. et al. The Cancer Genomics Cloud: collaborative, reproducible, and democratized—a new paradigm in large-scale computational research. Cancer Res. 77, e3–e6 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Yakneen, S. et al. Butler enables rapid cloud-based analysis of thousands of human genomes. Nat. Biotechnol. 38, 288–292 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Perez-Riverol, Y. & Moreno, P. Scalable data analysis in proteomics and metabolomics using BioContainers and workflows engines. Proteomics 20, e1900147 (2020).

    Article  PubMed  CAS  Google Scholar 

  46. Fjukstad, B., Dumeaux, V., Hallett, M. & Bongo, L. A. Reproducible data analysis pipelines for precision medicine. in 2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) 299–306 (IEEE, 2019).

  47. Birger, C. et al. FireCloud, a scalable cloud-based platform for collaborative genome analysis: strategies for reducing and controlling costs. Preprint at bioRxiv https://doi.org/10.1101/209494 (2017).

  48. Han, L., Canon, L., Casanova, H., Robert, Y. & Vivien, F. Checkpointing workflows for fail-stop errors. IEEE Trans. Comput. 67, 1105–1120 (2018).

    Google Scholar 

  49. Jackson, M., Kavoussanakis, K. & Wallace, E. W. J. Using prototyping to choose a bioinformatics workflow management system. PLoS Comput. Biol. 17, e1008622 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Goecks, J., Nekrutenko, A., Taylor, J. & Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010).

  51. Fillbrunn, A. et al. KNIME for reproducible cross-domain analysis of life science data. J. Biotechnol. 261, 149–156 (2017).

    Article  CAS  PubMed  Google Scholar 

  52. Berthold, M. R. et al. in Data Analysis, Machine Learning and Applications 319–326 (Springer, 2008).

  53. Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 46, W537–W544 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Batut, B. et al. Community-driven data analysis training for biology. Cell Syst. 6, 752–758 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Jalili, V. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update. Nucleic Acids Res. 48, W395–W402 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  57. Cordasco, G., D’Auria, M., Negro, A., Scarano, V. & Spagnuolo, C. Toward a domain-specific language for scientific workflow-based applications on multicloud system. Concurr. Comput. e5802 (2020).

  58. Mölder, F. et al. Sustainable data analysis with Snakemake. F1000Res. 10, 33 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  59. Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).

    Article  PubMed  CAS  Google Scholar 

  60. Bourgey, M. et al. GenPipes: an open-source framework for distributed and scalable genomic analyses. Gigascience 8, giz037 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  61. Sadedin, S. P., Pope, B. & Oshlack, A. Bpipe: a tool for running and managing bioinformatics pipelines. Bioinformatics 28, 1525–1526 (2012).

    Article  CAS  PubMed  Google Scholar 

  62. Novella, J. A. et al. Container-based bioinformatics with Pachyderm. Bioinformatics 35, 839–846 (2019).

    Article  CAS  PubMed  Google Scholar 

  63. Kieser, S., Brown, J., Zdobnov, E. M., Trajkovski, M. & McCue, L. A. ATLAS: a Snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data. BMC Bioinformatics 21, 257 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  64. Hölzer, M. & Marz, M. PoSeiDon: a Nextflow pipeline for the detection of evolutionary recombination events and positive selection. Bioinformatics 37, 1018–1020 (2020).

    Article  CAS  Google Scholar 

  65. Zhao, Q. et al. LncPipe: a Nextflow-based pipeline for identification and analysis of long non-coding RNAs from RNA-seq data. J. Genet. Genomics 45, 399–401 (2018).

    Article  PubMed  Google Scholar 

  66. Cornwell, M. et al. VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis. BMC Bioinformatics 19, 135 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  67. Lampa, S., Dahlö, M., Alvarsson, J. & Spjuth, O. SciPipe: a workflow library for agile development of complex and dynamic bioinformatics pipelines. Gigascience 8, giz044 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  68. Amstutz, P. et al. Common Workflow Language v1. 0 (2016); https://doi.org/10.6084/m9.figshare.3115156.v2

  69. Crusoe, M. R. et al. Methods included: standardizing computational reuse and portability with the common workflow language. Preprint at https://arxiv.org/abs/2105.07028 (2021).

  70. Voss, K., Van der Auwera, G. & Gentry, J. Full-stack genomics pipelining with GATK4 + WDL + Cromwell. F1000Res 6, 1381 (2017).

    Google Scholar 

  71. Vivian, J. et al. Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol. 35, 314–316 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Kotliar, M., Kartashov, A. V. & Barski, A. CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language. Gigascience 8, giz084 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  73. Yang, J. Cloud computing for storing and analyzing petabytes of genomic data. J. Ind. Inf. Integr. 15, 50–57 (2019).

    Google Scholar 

  74. Xu, B., An, L., Thung, F., Khomh, F. & Lo, D. Why reinventing the wheels? An empirical study on library reuse and re-implementation. Empir. Softw. Eng. 25, 755–789 (2020).

    Article  Google Scholar 

  75. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Bhardwaj, V. et al. snakePipes: facilitating flexible, scalable and integrative epigenomic analysis. Bioinformatics 35, 4757–4759 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Ewels, P. A. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat. Biotechnol. 38, 276–278 (2020).

    Article  CAS  PubMed  Google Scholar 

  78. Sicilia, M.-A., García-Barriocanal, E. & Sánchez-Alonso, S. Community curation in open dataset repositories: insights from Zenodo. Procedia Comput. Sci. 106, 54–60 (2017).

    Article  Google Scholar 

  79. Leman, J. K. et al. Better together: elements of successful scientific software development in a distributed collaborative community. PLoS Comput. Biol. 16, e1007507 (2020).

    Article  CAS  Google Scholar 

  80. Weber, L. M. et al. Essential guidelines for computational method benchmarking. Genome Biol. 20, 125 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  81. Marx, V. Bench pressing with genomics benchmarkers. Nat. Methods 17, 255–258 (2020).

    Article  CAS  PubMed  Google Scholar 

  82. Angers-Loustau, A. et al. The challenges of designing a benchmark strategy for bioinformatics pipelines in the identification of antimicrobial resistance determinants using next generation sequencing technologies. F1000Res. 7, J-459 (2018).

  83. Möller, S. et al. Robust cross-platform workflows: how technical and scientific communities collaborate to develop, test and share best practices for data analysis. Data Sci. Eng. 2, 232–244 (2017).

    Article  Google Scholar 

  84. Carey, V. J. et al. Global alliance for genomics and health meets Bioconductor: toward reproducible and agile cancer genomics at Cloud scale. JCO Clin. Cancer Inf. 4, 472–479 (2020).

    Google Scholar 

  85. List, M., Ebert, P. & Albrecht, F. Ten simple rules for developing usable software in computational biology. PLoS Comput. Biol. 13, e1005265 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  86. Karimzadeh, M. & Hoffman, M. M. Top considerations for creating bioinformatics software documentation. Brief. Bioinform. 19, 693–699 (2018).

    Article  PubMed  Google Scholar 

  87. Anzt, H. et al. An environment for sustainable research software in Germany and beyond: current state, open challenges, and call for action. F1000Res. 9, 295 (2020).

  88. Mangul, S., Martin, L. S., Eskin, E. & Blekhman, R. Improving the usability and archival stability of bioinformatics software. Genome Biol. 20, 47 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  89. Siepel, A. Challenges in funding and developing genomic software: roots and remedies. Genome Biol. 20, 147 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  90. Malone, K. & Wolski, R. Doing data science on the shoulders of giants: the value of open source software for the data science community. Harvard Data Science Review https://hdsr.mitpress.mit.edu/pub/xsrt4zs2/release/4 (31 May 2020).

Download references

Acknowledgements

J.G. is supported by funding from the Agency for Science, Technology, and Research (ASTAR), Singapore, and by the Singapore Ministry of Health’s National Medical Research Council under its Individual Research Grant funding scheme. L.W. was supported by the Singapore International Pre-Graduate Award (SIPGA) from A*STAR and the New Colombo Plan Scholarship from the Australian Department of Foreign Affairs and Trade. We thank B. Grüning for helpful comments and suggestions on this manuscript. We would like to thank R. Patro for contributing a test dataset for the example workflow implementations. We thank M. van den Beek for contributing the Galaxy workflow to the GitHub repository. We thank J. Köster for contributing the Snakemake workflow to the GitHub repository. We thank P. Di Tommaso for contributing the Nextflow workflow to the GitHub repository. We thank S. Lampa for contributing the SciPipe workflow to the GitHub repository. We thank J.H. Gálvez López, P.-O. Quirion, E. Henrion, and M. Bourgey for contributing the GenPipes workflow to the GitHub repository. We thank A. Novak, B. Paten, L. Blauvelt, and L. Koziol for contributing the Toil workflow to the GitHub repository. We thank S. Sadedin for contributing the Bpipe workflow to the GitHub repository. We thank S. Sadedin for contributing the Bpipe workflow to the GitHub repository.

Author information

Authors and Affiliations

Authors

Contributions

L.W., A.W., and J.G. planned the manuscript. L.W. and J.G. wrote the first draft. L.W., A.W., and J.G. wrote and revised the final manuscript.

Corresponding author

Correspondence to Jonathan Göke.

Ethics declarations

Competing interests

A.W. is an employee of ImmunoScape Pte Ltd. L.W. and J.G. declare no competing interests.

Additional information

Peer review information Nature Methods thanks Johannes Köster, Yasset Perez-Riverol, Anton Nekrutenko, and Paolo Di Tommaso for their contribution to the peer review of this work. Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Note.

Supplementary Table 1

Overview of workflow managers for bioinformatics.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wratten, L., Wilm, A. & Göke, J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 18, 1161–1168 (2021). https://doi.org/10.1038/s41592-021-01254-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-021-01254-9

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing