ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

Reproducibly sampling SARS-CoV-2 genomes across time, geography, and viral diversity

[version 1; peer review: 1 approved, 1 approved with reservations]
PUBLISHED 29 Jun 2020
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Emerging Diseases and Outbreaks gateway.

This article is included in the Coronavirus collection.

Abstract

The COVID-19 pandemic has led to a rapid accumulation of SARS-CoV-2 genomes, enabling genomic epidemiology on local and global scales. Collections of genomes from resources such as GISAID must be subsampled to enable computationally feasible phylogenetic and other analyses. We present genome-sampler, a software package that supports sampling collections of viral genomes across multiple axes including time of genome isolation, location of genome isolation, and viral diversity. The software is modular in design so that these or future sampling approaches can be applied independently and combined (or replaced with a random sampling approach) to facilitate custom workflows and benchmarking. genome-sampler is written as a QIIME 2 plugin, ensuring that its application is fully reproducible through QIIME 2’s unique retrospective data provenance tracking system. genome-sampler can be installed in a conda environment on macOS or Linux systems. A complete default pipeline is available through a Snakemake workflow, so subsampling can be achieved using a single command. genome-sampler is open source, free for all to use, and available at https://caporasolab.us/genome-sampler. We hope that this will facilitate SARS-CoV-2 research and support evaluation of viral genome sampling approaches for genomic epidemiology.

Keywords

SARS-CoV-2, genome-sampler, QIIME 2, bioinformatics, genomics

Introduction

The intersection of the SARS-CoV-2 outbreak and the genomics revolution has led to the rapid accumulation of viral genomes that are fueling our epidemiological understanding of the global pandemic. However, the rate of genome sequencing is challenging our ability to conduct comprehensive analyses in a timely manner. Local networks of health care professionals, laboratory professionals, and researchers are rapidly generating genome sequences at an unprecedented rate and feeding these data into global community resources, such as GISAID1 and GenBank2. Contextualizing locally-derived genome sequences with those from global resources (e.g., as recently performed by the Arizona COVID-19 Genomics Union3) enables phylogenetic analyses that can provide information about the relative roles of local transmission versus repeated introductions. This can help to evaluate the utility of control measures, such as stay-at-home orders. These sequencing data thus enable a new paradigm in epidemiology, which must be facilitated by computational workflows designed to handle this scale of data.

Contextualization of locally derived genome sequences will generally begin with two collections of sequences: those obtained from a global community resource and those obtained locally. The widely used NextStrain4 platform refers to these sequence collections in their documentation as the context sequences and the focal sequences, respectively, and we adopt that terminology here.

To enable phylogenetic analysis of full-length SARS-CoV-2 genomes, for example with Bayesian methods or maximum likelihood methods with bootstrap support, subsampling the context sequences is essential for computational feasibility. To avoid introducing post-sequencing sampling biases into our analysis, we subsampled the context sequences across three axes: time, space (i.e., geographic dispersion of near neighbors of focal sequences), and viral genome diversity. Sampling across time enables us to reliably infer a molecular clock signal from the data by ensuring that our sample of viral genomes span as much time as possible and include the oldest available genomes. Sampling the context sequences to include near neighbors of the focal sequences that come from different geographic regions enables us to avoid erroneously describing groups of focal sequences as monophyletic. Sampling across viral diversity enables us to represent the known diversity of the virus in our analysis. Each of these steps additionally reduces the chance of over-represented genomes dominating the analysis. When data sets are relatively small, this process can be performed manually, but when numbers of context genomes measure in the thousands, tens of thousands, or even hundreds of thousands (which may be likely as the pandemic progresses), an automated and reproducible subsampling approach is essential to maximize efficiency and to avoid human error.

Here we present genome-sampler5, a QIIME 2 plugin that enables other research teams to apply our context sequence subsampling workflow. Our subsampling workflow is compatible with tools such as NextStrain4, which includes a similar but not identical subsampling process (details provided in the Discussion section). We believe that our workflow can reduce sampling bias in analysis of SARS-CoV-2 genomes, and could be applied for regionally focused analyses, such as ours, or globally focused analyses. QIIME 26 (https://qiime2.org) is a plugin-based bioinformatics software platform developed for microbiome multi-omics analysis. It includes a unique retrospective data provenance tracking system that ensures reproducibility of bioinformatics steps by recording details of all analysis steps (commands called, parameters and input arguments provided, as well as details of the computational environment where the analysis was run, such as versions of underlying software dependencies; see examples at https://view.qiime2.org and in Figure 2 of the QIIME 2 paper6). We built this functionality as a QIIME 2 plugin because, given the pace at which SARS-CoV-2 genomics research is currently being carried out, human error in bioinformatics workflows is likely and the detailed record keeping needed to ensure reproducibility may be inadvertently skipped. QIIME 2 ensures that workflow errors could be detected retroactively and that workflows can be reproduced, even if detailed records are not kept while they are being run.

Methods

Implementation

genome-sampler5 operates on three input files: a fasta file containing the unaligned context sequences, a fasta file containing the unaligned focal sequences, and a tab-separated text file containing metadata for the context sequences. The context sequences and metadata will typically be obtained by the user from a public repository such as GISAID. The focal sequences will typically be sequences that the team has compiled independently, for example from their locale.

Operation

genome-sampler can be installed in a conda environment on macOS or Linux systems, as described in its installation documentation linked from the project website. The complete workflow can be applied in one step using the included Snakemake7 workflow, or the steps can be applied individually.

Use case

Here we describe the series of steps taken by the genome-sampler5 workflow (see Figure 1). In each step, any parameter values that can be overridden by the user are bolded. This description is accompanied by an online tutorial, available from the project website, which illustrates a use case focused on a small set of sequences obtained from GISAID. The tutorial is tested with each release of genome-sampler to ensure that all commands remain up to date.

598e2139-b447-48c8-a573-a710045d54fd_figure1.gif

Figure 1. The genome-sampler workflow.

This workflow samples context sequences for downstream phylogenetic analysis. Specific steps are represented by boxes: the QIIME 2 plugin name is bolded, and the action name in monospace font. Inputs and outputs are represented by folded-page file icons. The surrounding dashed box represents the Snakemake workflow which automates execution of the contained steps. Given context metadata, context sequences, and focal sequences, the Snakemake workflow will produce a fasta file which is ready for alignment and a summary of the sampling procedure as a QIIME 2 visualization.

The genome-sampler workflow works as follows:

  • 1. Clean up and filter the context sequences.

    • i. Filter sequences that contain non-IUPAC characters8 as these characters can be problematic for downstream tools, such as sequence aligners or alignment viewers.

    • ii. Remove any gap (“-” or “.”) characters, as this workflow is intended to work on unaligned sequences. (Aligned reference sequences can be provided as input since they will be unaligned in this step.)

    • iii. Filter sequences that are composed of >10% N characters.

  • 2. Uniformly sample context sequences across time, selecting 7 sequences from each 7-day period between the earliest and latest dates represented in the data set. If there are fewer than 7 sequences in any 7-day period, all sequences from that period are included in the result. These sequences are referred to as the temporally sampled context sequences. The user can optionally supply a start date, in which case any genomes from before that time will be excluded.

  • 3. Search focal sequences against context sequences to identify their 10 closest matches. This is achieved using vsearch’s usearch_global option9 at 99.99 percent identity. The resulting collections of best hits are sampled to select 3 geographically distinct context sequences for inclusion in the subsampled context sequence collection. This sampling procedure is weighted such that each geographic region has an equal probability of selection instead of each genome. This weighting prevents overrepresented regions from dominating the sample. This step ensures that any monophylies of target sequences are not artifacts of our sequence sampling approach. These sequences are referred to as the geographically sampled context sequences. (This step is achieved using sequence metadata, and can be parameterized so that this can be applied over any categorical metadata, not just geography.)

  • 4. Cluster the complete context sequence collection with vsearch’s cluster_fast option at 99.90 percent identity. The resulting cluster centroid sequences represent a divergent collection of the SARS-CoV-2 genomes and are referred to as the diversity sampled context sequences.

  • 5. Combine the temporally, geographically, and diversity sampled context sequences with the focal sequence collection. The resulting collection of sequences will be deduplicated by sequence identifier, so sequences contained in multiple different subsamples are represented only once in the final sequence collection. This final collection of sequences should be used for downstream analysis.

Discussion

Resemblance to NextStrain context sequence sampling workflow

The NextStrain workflow also subsamples context sequences for its phylogenetic tree builds using augur (https://github.com/nextstrain/augur) and scripts in their ncov repository (https://github.com/nextstrain/ncov). Their workflow subsamples the context sequences across two axes: time and geography, prioritizing similarity to focal sequences when selecting sequences from different geographic regions. They sample across time by including a specified number of sequences per month for different regions. When determining the closest matches, percent identity is computed based on a multiple sequence alignment of all sequences, which is computed by aligning each sequence against a reference alignment using mafft10.

Step 2 of our workflow is similar to their time sampling approach, but is independent of other variables such as geography. The workflows diverge more in Step 3, where we begin by identifying near neighbors of all focal sequences using global alignment search with vsearch. We then optionally sample across the geographic source of those sequences such that each geographic region represented in each collection of near neighbors has an equal probability of selection. We follow this with Step 4, where we sample the full genetic diversity of the context sequences by clustering them all against one another and including the resulting cluster centroid sequences in our final sequence collection. As far as we are aware, there is not an analog to our Step 4 in the NextStrain workflow.

Our workflow is modular in design to facilitate benchmarking and optimization of this essential context sequence sampling step. Our three sampling steps can be used individually or in any combination, and can be replaced with a random sampling step (the sample-random action) to allow evaluation of the importance of each step. At this stage, we do not claim that our workflow is better than the one used by NextStrain. We hope the similarity of our interfaces (both of which require the same input and output, are accessible through Snakemake, and use the same terminology to describe data) will allow for independent comparison of these and other approaches. In our next stage of work on this project, we plan to evaluate the impact of each subsampling step and their associated parameters on downstream phylogenetic results.

Retrospective data provenance tracking system

The retrospective data provenance tracking system implemented in QIIME 2 differs from other systems such as Snakemake7 or Galaxy11, which we view as providing prospective data provenance tracking. For example, when a Snakemake file is used to run a workflow, that workflow is documented for reproducibility by the Snakemake file. However, if a user were to run the underlying commands independently, they must keep detailed records of their commands to ensure reproducibility of the analysis. This is not necessary with QIIME 2’s retrospective data provenance tracking system, which records steps regardless of whether the workflow is run using a tool like Snakemake or Galaxy, or whether individual components are run independently. Additionally, QIIME 2’s system assigns universally unique identifiers (UUIDs) to all execution steps, inputs, and outputs, so data can be unambiguously linked to workflow descriptions. QIIME 2 is therefore fully compatible with workflow engines such as Snakemake or Galaxy, but provides additional information which further ensures reproducibility.

We present genome-sampler5, a QIIME 2 plugin that supports subsampling of genomic sequence collections based on time of genome isolation, geography of genome isolation, and genomic diversity, thus facilitating genomic epidemiology based on large numbers of genomes while reducing the possibility of post-sequencing sampling bias impacting conclusions. As the number of available SARS-CoV-2 genomes continues to increase rapidly, approaches such as this will be required to enable phylogenetic and other analyses of genome data.

Data availability

Source data

The context sequences and metadata used in the genome-sampler Use case were obtained from GISAID. Those genomes were sampled from patients in Arizona, USA, and published to GISAID by the Arizona COVID-19 Genomics Union (ACGU). The focal sequences and metadata used in the genome-sampler Use case were sequenced at a later time than the context sequences, also from patients in Arizona. The focal sequences were generated and assembled by the ACGU and are currently being added to GISAID. These context and focal sequences and associated metadata are all available for download for use in learning genome-sampler (see the project website). For analysis purposes, we recommend obtaining sequences from a public repository, such as GISAID or GenBank, as those sequences will be updated (for example to improve genome assemblies) before our tutorial data is updated.

Software availability

genome-sampler source code available at: at https://github.com/caporaso-lab/genome-sampler.

Archived source code at time of publication: https://doi.org/10.5281/zenodo.38918195.

License: BSD 3-Clause "New" or "Revised" License.

Documentation, written using Myst (https://myst-parser.readthedocs.io/en/latest/) and rendered using Jupyter Book (https://jupyterbook.org/), is available at http://caporasolab.us/genome-sampler/. If you need technical support, please post a question to the QIIME 2 Forum at https://forum.qiime2.org. We are very interested in contributions to genome-sampler from the community - please get in touch via the GitHub issue tracker or the QIIME 2 Forum if you’re interested in contributing.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 29 Jun 2020
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Bolyen E, Dillon MR, Bokulich NA et al. Reproducibly sampling SARS-CoV-2 genomes across time, geography, and viral diversity [version 1; peer review: 1 approved, 1 approved with reservations] F1000Research 2020, 9:657 (https://doi.org/10.12688/f1000research.24751.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 29 Jun 2020
Views
33
Cite
Reviewer Report 27 Aug 2020
C. Titus Brown, Department of Population Health and Reproduction, University of California, Davis, Davis, CA, USA 
Approved with Reservations
VIEWS 33
  • This paper describes a software package, genome-sampler, that subsamples collections of SARS-CoV-2 genomes with attention to various metadata attributes. The paper is well motivated and well written.
     
  • My review focuses on the
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Brown CT. Reviewer Report For: Reproducibly sampling SARS-CoV-2 genomes across time, geography, and viral diversity [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2020, 9:657 (https://doi.org/10.5256/f1000research.27305.r67936)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Reviewer Response 28 Aug 2020
    C. Titus Brown, Department of Population Health and Reproduction, University of California, Davis, Davis, USA
    28 Aug 2020
    Reviewer Response
    I have managed to run the pipeline successfully, and have the following additional comments --

    ---

    I successfully ran the tutorial, huzzah!

    It would be good to ... Continue reading
  • Author Response 28 Oct 2020
    Greg Caporaso, Northern Arizona University, USA
    28 Oct 2020
    Author Response
    Thank you for the feedback on our manuscript, and for testing the software and offering suggestions. Below we provide a point-by-point reply. Your comments are presented in italics, and we ... Continue reading
COMMENTS ON THIS REPORT
  • Reviewer Response 28 Aug 2020
    C. Titus Brown, Department of Population Health and Reproduction, University of California, Davis, Davis, USA
    28 Aug 2020
    Reviewer Response
    I have managed to run the pipeline successfully, and have the following additional comments --

    ---

    I successfully ran the tutorial, huzzah!

    It would be good to ... Continue reading
  • Author Response 28 Oct 2020
    Greg Caporaso, Northern Arizona University, USA
    28 Oct 2020
    Author Response
    Thank you for the feedback on our manuscript, and for testing the software and offering suggestions. Below we provide a point-by-point reply. Your comments are presented in italics, and we ... Continue reading
Views
63
Cite
Reviewer Report 31 Jul 2020
James Hadfield, Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA 
Approved
VIEWS 63
This paper presents a software tool to tackle a pressing, but welcome, problem: the number of publicly shared SARS-CoV-2 sequences (c. 75,000 at the time of this review) are too numerous to be analysed or visualised using currently available methods ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hadfield J. Reviewer Report For: Reproducibly sampling SARS-CoV-2 genomes across time, geography, and viral diversity [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2020, 9:657 (https://doi.org/10.5256/f1000research.27305.r65756)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 28 Oct 2020
    Greg Caporaso, Northern Arizona University, USA
    28 Oct 2020
    Author Response
    Thanks for your thoughtful review of our manuscript! We have reviewed your comments and are submitting a revision that addresses them as detailed here. The reviewer comments are presented in italics, ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 28 Oct 2020
    Greg Caporaso, Northern Arizona University, USA
    28 Oct 2020
    Author Response
    Thanks for your thoughtful review of our manuscript! We have reviewed your comments and are submitting a revision that addresses them as detailed here. The reviewer comments are presented in italics, ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 29 Jun 2020
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.