ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article
Revised

CGAT-core: a python framework for building scalable, reproducible computational biology workflows

[version 2; peer review: 1 approved, 1 approved with reservations]
PUBLISHED 16 Jul 2019
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Python collection.

Abstract

In the genomics era computational biologists regularly need to process, analyse and integrate large and complex biomedical datasets. Analysis inevitably involves multiple dependent steps, resulting in complex pipelines or workflows, often with several branches. Large data volumes mean that processing needs to be quick and efficient and scientific rigour requires that analysis be consistent and fully reproducible. We have developed CGAT-core, a python package for the rapid construction of complex computational workflows. CGAT-core seamlessly handles parallelisation across high performance computing clusters, integration of Conda environments, full parameterisation, database integration and logging. To illustrate our workflow framework, we present a pipeline for the analysis of RNAseq data using pseudo-alignment.

Keywords

workflow, pipeline, python, genomics

Revised Amendments from Version 1

We thank the reviewers for their comments. We have responded to reviewers’ comments by modifying our underlying code and made changes to our main manuscript. The differences in our manuscript are minor. These include a paragraph in the discussion stating that we have begun implementing new features to support cloud storage, containerisation and LSF cluster interaction. Our code has been modified to reflect this. We have begun beta testing of our cloud storage code to support google cloud, AWS S3 storage and Azure storage. Currently, we have development branches for supporting docker/singularity and kubernates containerisation.

See the authors' detailed response to the review by Devon P. Ryan
See the authors' detailed response to the review by Alexander Peltzer

Introduction

Genomic technologies have given researchers the ability to produce large amounts of data at relatively low cost. Bioinformatic analyses typically involve passing data through a series of manipulations and transformations, called a pipeline or workflow. The need for tools to manage workflows is well established, with a wide range of options available from graphical user interfaces such as Galaxy1 and Taverna2, aimed at non-programmers, to Snakemake, Nextflow, Toil, Ruffus and others310 developed with computational biologists in mind. These tools differ in their portability, scalability, parameter handling, extensibility, and ease of use. In a recent survey11, the tool rated highest for ease of pipeline development was Ruffus12, a Python package that wraps pipeline steps in discrete Python functions, called ‘tasks’. It uses Python decorators to track the dependencies between tasks, ensuring that dependent tasks are completed in the correct order and independent tasks can be run in parallel. If a pipeline is interrupted before completion, or new input files are added, only data sets that are missing or out-of-date are re-run. Ruffus implements a wide range of decorators that allow complex operations on input files including: conversion of a single input file to a single output file; splitting of a file into multiple files (and vice versa) and conditional merging of multiple input files into a smaller number of outputs. More advanced options include combining combinations or permutations of input files and conditional execution based on input parameters. Use of decorators means that Ruffus pipelines are native Python scripts, rather than the domain specific languages (DSLs) used in many other workflow tools. A key advantage of this is that Python code can be used to link individual steps, as well as in processing tasks.

Here, we introduce Computational Genomics Analysis Toolkit (CGAT)-core13, an open-source python library that extends the functionality of CGAT-Ruffus by adding cluster interaction, parameterisation, logging, database interaction and Conda environment switching.

Methods

CGAT-core13 extends the functionality of CGAT-Ruffus by providing a common interface to control distributed resource management systems using the Distributed Resource Management Application API (DRMAA). Currently, we support interaction with Sun Grid Engine, Slurm and PBS-pro/Torque. The execution engine enables tasks to be run locally or on a high-performance computing cluster and supports cluster distribution of both command line scripts (cgatcore.run) and python functions (cgatcore.cluster). System resources (number of cores to use, amount of RAM to allocate) can be set on a per-pipeline, per-task, or per task-instance basis, even allowing allocation to be based on variables, for example input file size.

Operation

The parameter management component encourages the separation of workflow/tool configuration from implementation to build re-usable workflows. Algorithm parameters are collected in a single human-readable yaml configuration file. Thus, parameters can be set specifically for each dataset, without the need to modify the code, a feature seen in many other workflow management systems. For example, sequencing data can be aligned to a different reference genome, by simply changing the path to the genome index in the yaml file. Both pipeline-wide and job-local parameters are automagically substituted into command line statements at execution-time.

To assist with reproducibility, record keeping and error handling CGAT-core provides multi-level logging during workflow execution, recording full details of runtime parameters, environment configuration and tracking job submissions. Additionally, CGAT-core provides a simple, lightweight interface for interacting with relational databases such as SQLite (cgatcore.database), facilitating loading of analysis results at any step of the workflow, including combining output from parallel steps in single wide- or long-format tables.

CGAT-core can load a different Conda environment for each step of the analysis, enabling the use of tools with conflicting software requirements. Furthermore, providing Conda environment files alongside pipeline scripts ensures that analyses can be fully reproduced.

CGAT-core workflows are Python scripts, and as such are stand-alone command line utilities that do not require the installation of a dedicated service. In order to reproducibly execute our workflows, we provide utility functions for argument parsing, logging and record keeping within scripts (cgatcore.experiment). Workflows are started, inspected and configured through the command line interface. Therefore, workflows become just another tool and can be re-used within other workflows. Furthermore, workflows can leverage the full power of Python, making them completely extensible and flexible.

Implementation

CGAT-core is implemented in Python 3 and installable via Conda and PyPI with minimal dependencies. We have successfully deployed and tested the code on OSX, Red Hat and Ubuntu. We have made CGAT-core and associated repositories open-source under the MIT licence, allowing full and free use for both commercial and non-commercial purposes. Our software is fully documented (https://pypi.org), version controlled and has extensive testing using continuous integration (https://travis-ci.org/cgat-developers.) We welcome community participation in code development and issue reporting through GitHub.

Use case

To illustrate a simple use case of CGAT-core, we have built an example RNAseq analysis pipeline, which performs read counting using Kallisto14 and differential expression using DESeq215. This workflow and Conda environment are contained within our CGAT-showcase repository (https://github.com/cgat-developers/cgat-showcase). The workflow highlights how simple pipelines can be constructed using CGAT-core, demonstrating how the pipeline can be configured using a yaml file, how third-party tools can be executed efficiently across a cluster or on a local machine, and how data can be easily loaded into a database. Furthermore, we and others have been extensively using CGAT-core to build pipelines for computational genomics (https://github.com/cgat-developers/cgat-flow).

Discussion

CGAT-core13 extends the popular Python workflow engine Ruffus by adding desirable features from a variety of other workflow systems to form an extremely simple, flexible and scalable package. CGAT-core provides seamless high-performance computing cluster interaction and adds Conda environment integration for the first time. In addition, our framework focuses on simplifying the pipeline development and testing process by providing convenience functions for parameterisation, database interaction, logging and pipeline interaction.

The ease of pipeline development enables CGAT-core to bridge the gap between exploratory data analysis and building production workflows. A guiding principle is that it should be as easy (or easier) to complete a series of tasks using a simple pipeline compared to using an interactive prompt, especially once cluster submission is considered. CGAT-core enables the production of analysis pipelines that can easily be run in multiple environments to facilitate sharing of code as part of the publication process. Thus, CGAT-core encourages a best-practice reproducible research approach by making it the path of least resistance. For example, exploratory analysis in Jupyter Notebooks can be converted to a Python script or used directly in the pipeline. Similarly, exploratory data analysis in R, or any other language, can easily be converted to a script that can be run by the pipeline. This lightweight wrapping of quickly prototyped analysis forms a lab book, enabling rapid reproduction of analyses and reuse of code for different data sets.

CGAT-core is under active development by the CGAT-Developers GitHub community. Support for cloud storage interaction, containerisation and LSF cluster interaction are currently being developed.

Data availability

All data underlying the results are available as part of the article and no additional source data are required.

Software availability

Source code available from: https://github.com/cgat-developers/cgat-core.

Archived source code at time of publication: http://doi.org/10.5281/zenodo.325738413.

Licence: MIT License.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 04 Apr 2019
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Cribbs AP, Luna-Valero S, George C et al. CGAT-core: a python framework for building scalable, reproducible computational biology workflows [version 2; peer review: 1 approved, 1 approved with reservations] F1000Research 2019, 8:377 (https://doi.org/10.12688/f1000research.18674.2)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 16 Jul 2019
Revised
Views
21
Cite
Reviewer Report 17 Jul 2019
Alexander Peltzer, Quantitative Biology Center (QBIC), University of Tübingen, Tübingen, Germany 
Approved with Reservations
VIEWS 21
The authors did come up with solutions for container technologies and have started to implement and/or mention these in the discussion of the article.

However, the aforementioned comments that "A key advantage of this is that Python ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Peltzer A. Reviewer Report For: CGAT-core: a python framework for building scalable, reproducible computational biology workflows [version 2; peer review: 1 approved, 1 approved with reservations]. F1000Research 2019, 8:377 (https://doi.org/10.5256/f1000research.21838.r51275)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
13
Cite
Reviewer Report 17 Jul 2019
Devon P. Ryan, Max Planck Institute of Immunobiology and Epigenetics (MPI-IE), Freiburg, Germany 
Approved
VIEWS 13
The authors have ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ryan DP. Reviewer Report For: CGAT-core: a python framework for building scalable, reproducible computational biology workflows [version 2; peer review: 1 approved, 1 approved with reservations]. F1000Research 2019, 8:377 (https://doi.org/10.5256/f1000research.21838.r51274)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 04 Apr 2019
Views
28
Cite
Reviewer Report 24 Apr 2019
Alexander Peltzer, Quantitative Biology Center (QBIC), University of Tübingen, Tübingen, Germany 
Approved with Reservations
VIEWS 28
Cribbs et al describe CGAT-Core, a python framework for building scalable, reproducible computational biology workflows in their proposed software tool article.

The rationale behind the requirements for developing the software tool is explained properly, although there are ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Peltzer A. Reviewer Report For: CGAT-core: a python framework for building scalable, reproducible computational biology workflows [version 2; peer review: 1 approved, 1 approved with reservations]. F1000Research 2019, 8:377 (https://doi.org/10.5256/f1000research.20448.r47003)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 16 Jul 2019
    Adam Cribbs, MRC WIMM Centre for Computational Biology, University of Oxford, Oxford, OX3 9DS, UK
    16 Jul 2019
    Author Response
    We would like to thank you for taking the time to review the manuscript. We are grateful for your perceptive suggestions and we have updated the manuscript, code and documentation ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 16 Jul 2019
    Adam Cribbs, MRC WIMM Centre for Computational Biology, University of Oxford, Oxford, OX3 9DS, UK
    16 Jul 2019
    Author Response
    We would like to thank you for taking the time to review the manuscript. We are grateful for your perceptive suggestions and we have updated the manuscript, code and documentation ... Continue reading
Views
44
Cite
Reviewer Report 16 Apr 2019
Devon P. Ryan, Max Planck Institute of Immunobiology and Epigenetics (MPI-IE), Freiburg, Germany 
Approved
VIEWS 44
Writing and using analysis pipelines has become a bioinformatician's (or more generally a data analyst's) bread and butter. There are a number of frameworks to perform such analyses, of which CGAT-Ruffus is preferred by many. CGAT-core brings some welcome functionality ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ryan DP. Reviewer Report For: CGAT-core: a python framework for building scalable, reproducible computational biology workflows [version 2; peer review: 1 approved, 1 approved with reservations]. F1000Research 2019, 8:377 (https://doi.org/10.5256/f1000research.20448.r46758)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 16 Jul 2019
    Adam Cribbs, MRC WIMM Centre for Computational Biology, University of Oxford, Oxford, OX3 9DS, UK
    16 Jul 2019
    Author Response
    We would like to thank you for taking the time to review the manuscript, code and documentation so thoroughly. We are grateful for your helpful suggestions and we have updated ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 16 Jul 2019
    Adam Cribbs, MRC WIMM Centre for Computational Biology, University of Oxford, Oxford, OX3 9DS, UK
    16 Jul 2019
    Author Response
    We would like to thank you for taking the time to review the manuscript, code and documentation so thoroughly. We are grateful for your helpful suggestions and we have updated ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 04 Apr 2019
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.