ABSTRACT
Cancer data is widely available in repositories such as the National Cancer Institute (NCI) Genomic Data Commons (GDC). These datasets could serve as controls or comparisons in compendium analyses with user data, avoiding the expense and time of generating additional datasets. However, the user must be able to process their new data in the same manner for these comparisons to be useful. This can be non-trivial. Although the executables themselves are usually available in repositories, the GDC pipelines that describe that entire analysis workflow are currently published as text-based standard operating procedures (SOPs). It is difficult to document a computational workflow to the level of detail and accuracy required to reproduce the results. Discrepancies between versions and exclusions of details accumulate as the documentation inevitably lags behind code revisions. We address this problem by converting the SOPs into a downloadable and executable format.
Specifically, we converted the GDC DNA sequencing (DNA-Seq) and the GDC mRNA sequencing (mRNA-Seq) SOPs into reproducible, self-installing, containerized, and interactive graphical workflows. These can be applied to reproducibly process user data and to harmonize datasets across repositories. Using our publicly available graphical workflows, we harmonize raw RNA-Seq datasets from the GDC and the Genotype-Tissue Expression (GTEx) project that were originally processed using different methodologies to illustrate the importance of uniform processing of control and treatment data for accurate inference of differentially expressed genes. By disseminating the analytical methodology in a reproducible and easily executed form, we greatly increase the utility of the GDC by enabling researchers to uniformly process custom data and datasets across multiple repositories to enhance data interpretation. Our approach and open-source executable workflows of making the analytical process as readily available as the data can be applied to other data repositories to increase their impact on scientific research.
Competing Interest Statement
LHH and KYY have equity interest in Biodepot LLC, which receives compensation from NCI SBIR contract numbers 75N91020C00009 and 75N91021C00022. The terms of this arrangement have been reviewed and approved by the University of Washington in accordance with its policies governing outside work and financial conflicts of interest in research.
Footnotes
In this revision, we updated the content to reflect the latest data releases from the NCI Genomic Data Commons. We also made our contributions in this work clearer by revising the title, abstract and introduction. In addition, we re-tested our workflows, cleaned up the GitHub repository, added documentation, and include only the workflows that work.
List of Abbreviations
- AMI
- Amazon Machine Image
- API
- application programming interface
- AWS
- Amazon Web Services
- Bwb
- Biodepot-workflow-builder
- CPTAC
- Clinical Proteomic Tumor Atlas Consortium
- CRDC
- Cancer Research Data Commons
- DCFS
- Data Commons Framework Services
- dbGaP
- database of Genotypes and Phenotypes
- DNA-Seq
- DNA sequencing
- DTT
- Data Transfer Tool
- EC2
- Elastic Compute Cloud
- GDC
- Genomic Data Commons
- IGV
- Integrated Genome Viewer
- miRNA-Seq
- micro RNA sequencing
- NCI
- National Cancer Institute
- NGS
- Next-generation sequencing
- PON
- Panel of Normals
- RNA-Seq
- RNA sequencing
- TARGET
- Therapeutically Applicable Research to Generate Effective Treatment
- TCGA
- The Cancer Genome Atlas
- WGS
- whole genome sequencing
- WXS
- whole exome sequencing