Coronavirus discovery by metagenomic sequencing: a tool for pandemic preparedness

Introduction The SARS-CoV-2 pandemic of 2020 is a prime example of the omnipresent threat of emerging viruses that can infect humans. A protocol for the identification of novel coronaviruses by viral metagenomic sequencing in diagnostic laboratories may contribute to pandemic preparedness. Aim The aim of this study is to validate a metagenomic virus discovery protocol as a tool for coronavirus pandemic preparedness. Methods The performance of a viral metagenomic protocol in a clinical setting for the identification of novel coronaviruses was tested using clinical samples containing SARS-CoV-2, SARS-CoV, and MERS-CoV, in combination with databases generated to contain only viruses of before the discovery dates of these coronaviruses, to mimic virus discovery. Results Classification of NGS reads using Centrifuge and Genome Detective resulted in assignment of the reads to the closest relatives of the emerging coronaviruses. Low nucleotide and amino acid identity (81% and 84%, respectively, for SARS-CoV-2) in combination with up to 98% genome coverage were indicative for a related, novel coronavirus. Capture probes targeting vertebrate viruses, designed in 2015, enhanced both sequencing depth and coverage of the SARS-CoV-2 genome, the latter increasing from 71% to 98%. Conclusion The model used for simulation of virus discovery enabled validation of the metagenomic sequencing protocol. The metagenomic protocol with virus probes designed before the pandemic, can assist the detection and identification of novel coronaviruses directly in clinical samples.


Introduction
The Severe Acute Respiratory Syndrome Coronavirus type 2 (SARS-CoV-2) pandemic of 2020 demonstrates the devastating effect an emerging virus can have. Although previous pandemics such as the Spanish Flu (1918) and Asian Flu (1957) resulted in a multitude of fatal cases, the SARS-CoV-2 pandemic exhibits an unprecedented impact on public health, the economy and society as a whole. In 2002 and 2012 respectively, the Severe Acute Respiratory Syndrome (SARS) [1] and Middle Eastern Respiratory Syndrome (MERS) Coronavirus [2] have emerged as zoonotic infections causing severe respiratory disease, with continued introductions of MERS-CoV remaining a public health threat up to now [3].
Pandemic preparedness comprises strategies and measures to protect human health and lives in anticipation of the worldwide spread of (re) emerging pathogens. Pandemic preparedness plans [4] focus on measures to contain and control the spread of emerging pathogens. Early detection of the pathogen is the mainstay of initiating infection control measures. Global surveillance as a component of the International Health Regulations (IHR) aims at early detection and monitoring of human cases of zoonotic diseases with pandemic potential [5]. Pandemic surveillance plans commonly focus on specific viruses, such as influenza, and depend on targeted detection of these specific viral threats, limiting the detection of unanticipated and novel viruses. The current SARS-CoV-2 pandemic shows the need for unbiased identification of potential pathogens.  Metagenomic Next-Generation Sequencing (mNGS) enables hypothesis-free sequencing of all nucleic acids in a given sample, including genomes of pathogens. All sequences are amplified, followed by classification of sequences based on a reference database. While research applications are more common, mNGS is being introduced in clinical diagnostic laboratories as indicated by recently diagnosed cases of encephalitis [6]. Implementation of mNGS in clinical diagnostics requires validation of metagenomic protocols. Metagenomic protocols and pipelines have been successfully used for detection of known pathogens [6,7,8]. However, detection and identification of novel, previously unknown emerging viruses presents a challenge due to the absence of their genome sequences in reference databases.
In this study, we validated the identification of emerging coronaviruses by a viral metagenomic protocol, using clinical samples with SARS-CoV-2, and samples spiked with cultivated isolates SARS-CoV Frankfurt-1 (SARS-CoV) and MERS-CoV EMC/2012 (MERS-CoV). The validation included analysis of the performance of both an in-house and a commercially available data analysis pipeline, Genome Detective [9]. Identification of coronaviruses was tested using modified databases lacking SARS-CoV-2, SARS-CoV, and MERS-CoV, mimicking the situation at the time of virus discovery. Additionally, the efficacy of detection of novel coronaviruses using capture probes targeting vertebrate viruses [10,11] known before the current pandemic was analyzed using a SARS-CoV-2 clinical sample.

Sample selection and preparation
Nasopharyngeal swabs were obtained from two patients who tested positive for SARS-CoV-2 by real-time PCR targeting the SARS-CoV-2 Egene [12] with Cq values of 20 and 30, respectively. These PCRs were performed as part of routine diagnostics at the Clinical Microbiological Laboratory (CML) of the Leiden University Medical Center.

Metagenomic Next-Generation Sequencing (mNGS)
Library preparation and sequencing were performed using a previously validated protocol [15,16]. Briefly, 200 μl of patient samples were spiked with equine arteritis virus (EAV) and phocid herpesvirus-1 (PhHV-1) prior to NA extraction using the Magnapure 96 DNA and Viral NA Small volume extraction kit on the MagnaPure 96 system (Roche, Basel, Switzerland) resulting in 100 μL nucleic acid-containing eluate. Of this eluate, 50 μl per sample was used as input for the library prep, utilizing the NEBNext Ultra II Directional RNA Library prep kit for Illumina (New England Biolabs, Ipswich, MA, USA), dual indexed NEBNext Multiplex Oligos for Illumina (1.5μM), and a protocol optimized for processing RNA and DNA simultaneously in a single tube [15].
Library preps of the samples where processed both with and without enrichment for viruses using sequence capture probes (see below). Subsequent sequence analysis was performed using a NovaSeq6000 sequencing system (Illumina, San Diego, CA, USA) at GenomeScan BV to obtain approximately 10 million 150bp reads per sample.

Viral capture probe enrichment
Enrichment of viral sequences from the sample library pools was performed using the SeqCap EZ HyperCap kit according to the manufacturer's instructions (Roche, Basel, Switzerland). This kit uses a vertebrate virus SeqCap EZ probe pool designed to target a set of sequences from vertebrate viruses that were available in 2015 [10], including the following: Coronaviridae NCBI:txid11118), Coronavirinae (NCBI:txid693995), Alphacoronavirus (NCBI:txid693996), Betacoronavirus (NCBI:txid694002), Gammacoronavirus (NCBI:txid694013), and Deltacoronavirus (NCBI:txid1159901). Amplified DNA libraries from two SARS-CoV-2 samples and one negative control, with a combined mass of 1 μg, were pooled in equal amounts in a single enrichment experiment. Some adaptions were made: human Cot DNA and blocking oligos (Integrated DNA Technologies, Coralville, IA, USA) were added to each enrichment pool to prevent nonspecific binding and binding of human DNA to the probes. Subsequently, hybridization to the probe pool was performed for 40 hours. Next, the Hyber Cap Bead kit was used for washing the captured DNA, followed by post capture PCR amplification using the KAPA HiFi HotStart ReadyMix (2×) (Roche, Basel, Switzerland) and Illumina NGS primers (5 μM). The final washing step was performed using AMPure XP beads (Beckman Coulter, Inc., Brea, CA, USA) after which quality and quantity of the enriched libraries were assessed by Qubit analysis (Thermo Fisher, Waltham, MA, USA) and Bioanalyzer (Agilent, Santa Clara, CA, USA).

Sequence read classification: Centrifuge
After quality pre-processing using an in-house QC pipeline, Biopet version 0.9.0 [17] and removal of human reads after mapping them to Table 1 Classification of SARS-CoV-2, SARS-CoV, and MERS sequence reads using reference databases created before their emergence, using metagenomic classifier Centrifuge.

Sample
Untargeted mNGS, or viral enrichment by capture probes

In-house virus discovery protocol
Pre-processed short reads were de novo assembled into contigs using SPAdes version 3.10.1 [22]. All contigs were analyzed using the NCBI Basic Local Alignment Search Tool (BLAST 2.8.1) [23] using the BLAST NCBI's nucleotide (nt) database (accessed April 2018). Only viral hits for contigs with a length of ≥500bp were selected to identify the best shared homology to viruses. A length of 500bp was taken to ensure coverage of the built contigs by at least 3 reads, to rule out any possible contamination. Only hits dated prior to the date of emergence of the viruses were considered to mimic the virus discovery setting for SARS-CoV, MER-S-CoV and SARS-CoV-2.

Genome Detective: commercial classification and discovery tool
After extraction of human reads, FASTQ files generated for SARS-CoV-2 samples (with and without viral enrichment) were uploaded for classification and de novo assembly by the commercial web-based tool Genome Detective v1.120 (www.genomedetective.com, accessed 2020-05-11) [9], using a reference database (generated 2019-09-21). In brief, after removal of low-quality reads and trimming by Trimmomatic [24], candidate viral reads were identified using the protein-based alignment method DIAMOND [25] in combination with the Swissprot UniRef90 protein database followed by de novo assembly using metaSPAdes [26]. Blastx and Blastn [23] were used to search for candidate reference sequences using the NCBI RefSeq virus database (accessed 2019-09-21). Consensus sequences were produced by joining de novo contigs using Advanced Genome Aligner [27].

Classification of SARS-CoV-2, SARS-CoV, and MERS-CoV using databases created before the emergence of these viruses
To mimic the classification conditions present in the setting of virus discovery, viral metagenomic reference genome databases created before the emergence of SARS-CoV-2, SARS-CoV and MERS-CoV were used for the classification of sequence reads (December 2019 for the two SARS-CoV-2 positive samples, November 2002 for the SARS-CoV and June 2012 for the MERS-CoV positive samples). Classification results of viral reads are shown in Fig. 1

Virus discovery: de novo assembly
Results of de novo assembly of all samples for contigs longer than 500bp are shown in Table 2. BLASTn was used to search for hits with sequence homology. Only viral hits with the lowest E-value of all matches identified that were submitted before the publication of SARS-CoV-2 genomes were considered. BLASTn search results of the contigs with Coronaviridae hits are listed in Table 2 including the length of the Table 2 Classification of SARS-CoV-2, SARS-CoV, and MERS de novo assembled contigs using BLAST.  Supplementary Fig. 1 and 2, respectively.

Virus discovery of SARS-CoV-2 by GenomeDetective
GenomeDetective results of identification of SARS-CoV-2 sequences using a database created before the emergence of SARS-CoV-2 are shown in Fig. 2. SARS-CoV-2 sequences were identified as SARS-CoV, with nucleotide and amino acid identity of 80-81% and 83-85% respectively in combination with up to 98% genome coverage, being indicative for a novel finding.

Virus discovery using capture probes
The efficacy of a metagenomic sequencing protocol using capture probes targeting vertebrate virus sequences designed before the emergence of SARS-CoV-2, was studied in the context of virus discovery. We analyzed metagenomic data from the two SARS-CoV-2 positive samples prepared both with and without viral enrichment. The total amount of contigs and the number of contigs matching genomes of viruses from Coronaviridae are shown in Table 2 and Fig. 2. For the clinical sample with higher SARS-CoV-2 load (Cq 20), genome coverage was comparable (98% vs. 97% genome coverage), and for the sample with lower load (Cq 30), genome coverage was markedly higher (74% vs. 91% genome coverage) when the metagenomic protocol with viral capture probes was used.
Reads mapping to the SARS-CoV-2 reference genome were used to visualize the difference in using capture probes as depicted in Fig. 3, where the SARS-CoV-2 genome is almost completely covered. The two largest contigs built by SPAdes that had a hit with the lowest E-value when BLASTed against genomes from Coronaviridae, were 4,866bp and 5,811bp in length for the two SARS-CoV-2 samples enriched using  probes.

Discussion
In this study, we evaluated the performance of a metagenomic sequencing protocol for the identification of emerging viruses using clinical samples in combination with a simulated reference database. High and low loads of SARS-CoV-2, SARS-CoV, and MERS-CoV in clinical samples could be detected as 'novel' viruses, using only reference sequences created before these viruses emerged. Sequence reads were assigned to the closest relatives of these viruses available at that time and assembled with heterologous sequences to 'novel' consensus genomes. Low identity of these consensus genomes with genomes of closely related ones indicated a novel virus. Additionally, probes targeting sequences of vertebrate viruses, available prior to the coronavirus pandemic of 2020, succeeded in the capture of nearly the full genome of SARS-CoV-2. It must be noted that the validation was performed using emerging viruses with nucleotide identity of over 76% to their closest known relatives and conclusions cannot be extended to novel viruses which are less closely related. Nucleotide (and amino acid) identities reported in literature with regard to novel human pathogenic viruses vary, for example 50% for older viruses like SARS-CoV [1], 80% for MERS-CoV [14], 88% for parts of the Human Metapneumovirus [28] and up to 97.2% for parts of SARS-CoV-2 [29].
Several reports have shown an increase of 100-10,000 fold in sensitivity for detection of known viruses when using capture probes [10], [30] and here we report the potential of using capture probes in the detection of novel viruses. Sequence variation was addressed in the probe design by retaining mutant or variant sequences if sequences diverged by more than 90% [10]. Lipkin and colleagues describe the capture of conserved regions of a rodent hepacivirus isolate with 75% identity using VirSeqCap VERT, and even 40% for detection rather than whole genome sequencing is suggested [10]. The capture probes used in this study targeted sequences of several isolates of alpha-,beta-,gamma-, and deltacoronaviruses. In this study the whole genome of SARS-CoV-2, with 76-100% overall nucleotide identity to the probe targets, was detected using these probes.
Metagenomic sequencing is increasingly being used in diagnostic laboratories as a hypothesis-free approach for suspected infectious diseases in undiagnosed cases. Metagenomic sequencing in diagnostic laboratories has resulted in the detection of pathogens present in the reference database but either not tested for by routine methods due to rare or unknown associations with a specific disease, or for which routine testing failed (e.g., due to primer mismatches). Additionally, mNGS enables the detection of novel pathogens not (yet) present in the databases. Common bioinformatic classifiers are usually not designed for discovery purposes, so additional algorithms including a separate validation to assess the performance in a discovery setting are needed. Reports on specific bioinformatic discovery tools typically describe the algorithm and an in silico analysis and here we present validation studies on the performance of virus discovery tools using clinical samples.
Implementation of virus discovery protocols in diagnostic laboratories may contribute to increased vigilance for emerging viruses and therefore aids in surveillance and pandemic preparedness.

Declaration of Competing Interest
The authors report no declarations of interest.