Keywords
SARS-CoV-2, genome-sampler, QIIME 2, bioinformatics, genomics
This article is included in the Emerging Diseases and Outbreaks gateway.
This article is included in the Coronavirus collection.
SARS-CoV-2, genome-sampler, QIIME 2, bioinformatics, genomics
The intersection of the SARS-CoV-2 outbreak and the genomics revolution has led to the rapid accumulation of viral genomes that are fueling our epidemiological understanding of the global pandemic. However, the rate of genome sequencing is challenging our ability to conduct comprehensive analyses in a timely manner. Local networks of health care professionals, laboratory professionals, and researchers are rapidly generating genome sequences at an unprecedented rate and feeding these data into global community resources, such as GISAID1 and GenBank2. Contextualizing locally-derived genome sequences with those from global resources (e.g., as recently performed by the Arizona COVID-19 Genomics Union3) enables phylogenetic analyses that can provide information about the relative roles of local transmission versus repeated introductions. This can help to evaluate the utility of control measures, such as stay-at-home orders. These sequencing data thus enable a new paradigm in epidemiology, which must be facilitated by computational workflows designed to handle this scale of data.
Contextualization of locally derived genome sequences will generally begin with two collections of sequences: those obtained from a global community resource and those obtained locally. The widely used NextStrain4 platform refers to these sequence collections in their documentation as the context sequences and the focal sequences, respectively, and we adopt that terminology here.
To enable phylogenetic analysis of full-length SARS-CoV-2 genomes, for example with Bayesian methods or maximum likelihood methods with bootstrap support, subsampling the context sequences is essential for computational feasibility. To avoid introducing post-sequencing sampling biases into our analysis, we subsampled the context sequences across three axes: time, space (i.e., geographic dispersion of near neighbors of focal sequences), and viral genome diversity. Sampling across time enables us to reliably infer a molecular clock signal from the data by ensuring that our sample of viral genomes span as much time as possible and include the oldest available genomes. Sampling the context sequences to include near neighbors of the focal sequences that come from different geographic regions enables us to avoid erroneously describing groups of focal sequences as monophyletic. Sampling across viral diversity enables us to represent the known diversity of the virus in our analysis. Each of these steps additionally reduces the chance of over-represented genomes dominating the analysis. When data sets are relatively small, this process can be performed manually, but when numbers of context genomes measure in the thousands, tens of thousands, or even hundreds of thousands (which may be likely as the pandemic progresses), an automated and reproducible subsampling approach is essential to maximize efficiency and to avoid human error.
Here we present genome-sampler5, a QIIME 2 plugin that enables other research teams to apply our context sequence subsampling workflow. Our subsampling workflow is compatible with tools such as NextStrain4, which includes a similar but not identical subsampling process (details provided in the Discussion section). We believe that our workflow can reduce sampling bias in analysis of SARS-CoV-2 genomes, and could be applied for regionally focused analyses, such as ours, or globally focused analyses. QIIME 26 (https://qiime2.org) is a plugin-based bioinformatics software platform developed for microbiome multi-omics analysis. It includes a unique retrospective data provenance tracking system that ensures reproducibility of bioinformatics steps by recording details of all analysis steps (commands called, parameters and input arguments provided, as well as details of the computational environment where the analysis was run, such as versions of underlying software dependencies; see examples at https://view.qiime2.org and in Figure 2 of the QIIME 2 paper6). We built this functionality as a QIIME 2 plugin because, given the pace at which SARS-CoV-2 genomics research is currently being carried out, human error in bioinformatics workflows is likely and the detailed record keeping needed to ensure reproducibility may be inadvertently skipped. QIIME 2 ensures that workflow errors could be detected retroactively and that workflows can be reproduced, even if detailed records are not kept while they are being run.
genome-sampler5 operates on three input files: a fasta file containing the unaligned context sequences, a fasta file containing the unaligned focal sequences, and a tab-separated text file containing metadata for the context sequences. The context sequences and metadata will typically be obtained by the user from a public repository such as GISAID. The focal sequences will typically be sequences that the team has compiled independently, for example from their locale.
genome-sampler can be installed in a conda environment on macOS or Linux systems, as described in its installation documentation linked from the project website. The complete workflow can be applied in one step using the included Snakemake7 workflow, or the steps can be applied individually.
Here we describe the series of steps taken by the genome-sampler5 workflow (see Figure 1). In each step, any parameter values that can be overridden by the user are bolded. This description is accompanied by an online tutorial, available from the project website, which illustrates a use case focused on a small set of sequences obtained from GISAID. The tutorial is tested with each release of genome-sampler to ensure that all commands remain up to date.
The genome-sampler workflow works as follows:
1. Clean up and filter the context sequences.
i. Filter sequences that contain non-IUPAC characters8 as these characters can be problematic for downstream tools, such as sequence aligners or alignment viewers.
ii. Remove any gap (“-” or “.”) characters, as this workflow is intended to work on unaligned sequences. (Aligned reference sequences can be provided as input since they will be unaligned in this step.)
iii. Filter sequences that are composed of >10% N characters.
2. Uniformly sample context sequences across time, selecting 7 sequences from each 7-day period between the earliest and latest dates represented in the data set. If there are fewer than 7 sequences in any 7-day period, all sequences from that period are included in the result. These sequences are referred to as the temporally sampled context sequences. The user can optionally supply a start date, in which case any genomes from before that time will be excluded.
3. Search focal sequences against context sequences to identify their 10 closest matches. This is achieved using vsearch’s usearch_global option9 at 99.99 percent identity. The resulting collections of best hits are sampled to select 3 geographically distinct context sequences for inclusion in the subsampled context sequence collection. This sampling procedure is weighted such that each geographic region has an equal probability of selection instead of each genome. This weighting prevents overrepresented regions from dominating the sample. This step ensures that any monophylies of target sequences are not artifacts of our sequence sampling approach. These sequences are referred to as the geographically sampled context sequences. (This step is achieved using sequence metadata, and can be parameterized so that this can be applied over any categorical metadata, not just geography.)
4. Cluster the complete context sequence collection with vsearch’s cluster_fast option at 99.90 percent identity. The resulting cluster centroid sequences represent a divergent collection of the SARS-CoV-2 genomes and are referred to as the diversity sampled context sequences.
5. Combine the temporally, geographically, and diversity sampled context sequences with the focal sequence collection. The resulting collection of sequences will be deduplicated by sequence identifier, so sequences contained in multiple different subsamples are represented only once in the final sequence collection. This final collection of sequences should be used for downstream analysis.
The NextStrain workflow also subsamples context sequences for its phylogenetic tree builds using augur (https://github.com/nextstrain/augur) and scripts in their ncov repository (https://github.com/nextstrain/ncov). Their workflow subsamples the context sequences across two axes: time and geography, prioritizing similarity to focal sequences when selecting sequences from different geographic regions. They sample across time by including a specified number of sequences per month for different regions. When determining the closest matches, percent identity is computed based on a multiple sequence alignment of all sequences, which is computed by aligning each sequence against a reference alignment using mafft10.
Step 2 of our workflow is similar to their time sampling approach, but is independent of other variables such as geography. The workflows diverge more in Step 3, where we begin by identifying near neighbors of all focal sequences using global alignment search with vsearch. We then optionally sample across the geographic source of those sequences such that each geographic region represented in each collection of near neighbors has an equal probability of selection. We follow this with Step 4, where we sample the full genetic diversity of the context sequences by clustering them all against one another and including the resulting cluster centroid sequences in our final sequence collection. As far as we are aware, there is not an analog to our Step 4 in the NextStrain workflow.
Our workflow is modular in design to facilitate benchmarking and optimization of this essential context sequence sampling step. Our three sampling steps can be used individually or in any combination, and can be replaced with a random sampling step (the sample-random action) to allow evaluation of the importance of each step. At this stage, we do not claim that our workflow is better than the one used by NextStrain. We hope the similarity of our interfaces (both of which require the same input and output, are accessible through Snakemake, and use the same terminology to describe data) will allow for independent comparison of these and other approaches. In our next stage of work on this project, we plan to evaluate the impact of each subsampling step and their associated parameters on downstream phylogenetic results.
The retrospective data provenance tracking system implemented in QIIME 2 differs from other systems such as Snakemake7 or Galaxy11, which we view as providing prospective data provenance tracking. For example, when a Snakemake file is used to run a workflow, that workflow is documented for reproducibility by the Snakemake file. However, if a user were to run the underlying commands independently, they must keep detailed records of their commands to ensure reproducibility of the analysis. This is not necessary with QIIME 2’s retrospective data provenance tracking system, which records steps regardless of whether the workflow is run using a tool like Snakemake or Galaxy, or whether individual components are run independently. Additionally, QIIME 2’s system assigns universally unique identifiers (UUIDs) to all execution steps, inputs, and outputs, so data can be unambiguously linked to workflow descriptions. QIIME 2 is therefore fully compatible with workflow engines such as Snakemake or Galaxy, but provides additional information which further ensures reproducibility.
We present genome-sampler5, a QIIME 2 plugin that supports subsampling of genomic sequence collections based on time of genome isolation, geography of genome isolation, and genomic diversity, thus facilitating genomic epidemiology based on large numbers of genomes while reducing the possibility of post-sequencing sampling bias impacting conclusions. As the number of available SARS-CoV-2 genomes continues to increase rapidly, approaches such as this will be required to enable phylogenetic and other analyses of genome data.
The context sequences and metadata used in the genome-sampler Use case were obtained from GISAID. Those genomes were sampled from patients in Arizona, USA, and published to GISAID by the Arizona COVID-19 Genomics Union (ACGU). The focal sequences and metadata used in the genome-sampler Use case were sequenced at a later time than the context sequences, also from patients in Arizona. The focal sequences were generated and assembled by the ACGU and are currently being added to GISAID. These context and focal sequences and associated metadata are all available for download for use in learning genome-sampler (see the project website). For analysis purposes, we recommend obtaining sequences from a public repository, such as GISAID or GenBank, as those sequences will be updated (for example to improve genome assemblies) before our tutorial data is updated.
genome-sampler source code available at: at https://github.com/caporaso-lab/genome-sampler.
Archived source code at time of publication: https://doi.org/10.5281/zenodo.38918195.
License: BSD 3-Clause "New" or "Revised" License.
Documentation, written using Myst (https://myst-parser.readthedocs.io/en/latest/) and rendered using Jupyter Book (https://jupyterbook.org/), is available at http://caporasolab.us/genome-sampler/. If you need technical support, please post a question to the QIIME 2 Forum at https://forum.qiime2.org. We are very interested in contributions to genome-sampler from the community - please get in touch via the GitHub issue tracker or the QIIME 2 Forum if you’re interested in contributing.
We are deeply indebted to the patients who provided SARS-CoV-2 samples (via an uncomfortable procedure) while suffering from this new pathogen, to the medical professionals who bravely collected the samples that form the basis for our genomic epidemiology work, and to the laboratory professionals who processed and sequenced genomes from these samples. Furthermore, we would like to acknowledge the GISAID sequence contributors from across the world who have sequenced genomes at an unprecedented rate and made them available for public use.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: bioinformatics and software development
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: I am involved in Nextstrain, which (as the paper mentions) provides a similar subsampling approach for SARS-CoV-2 sequences. (I am a proponent of seeing a diverse suite of tools developed which can be used interchangeably for genomic epidemiology research.)
Reviewer Expertise: Bioinformatics, phylogenetics
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 28 Oct 20 |
read | |
Version 1 29 Jun 20 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)