Metagenomic Next-Generation Sequencing of Nasopharyngeal Specimens Collected from Confirmed and Suspect COVID-19 Patients

SARS-CoV-2 has presented a rapidly accelerating global public health crisis. The ability to detect and analyze viral RNA from minimally invasive patient specimens is critical to the public health response. Metagenomic next-generation sequencing (mNGS) offers an opportunity to detect SARS-CoV-2 from nasopharyngeal (NP) swabs. This approach also provides information on the composition of the respiratory microbiome and its relationship to coinfections or the presence of other organisms that may impact SARS-CoV-2 disease progression and prognosis. Here, using direct Oxford Nanopore long-read third-generation metatranscriptomic and metagenomic sequencing of NP swab specimens from 50 patients under investigation for COVID-19, we detected SARS-CoV-2 sequences by applying the CosmosID bioinformatics platform. Further, we characterized coinfections and detected a decrease in the diversity of the microbiomes in these patients. Statistically significant shifts in the microbiome were identified among COVID-19-positive and -negative patients, in the latter of whom a higher abundance of Propionibacteriaceae and a reduction in the abundance of Corynebacterium accolens were observed. Our study also corroborates the growing evidence that increased SARS-CoV-2 RNA detection from NP swabs is associated with the early stages of disease rather than with severity of disease. This work illustrates the utility of mNGS for the detection and analysis of SARS-CoV-2 from NP swabs without viral target enrichment or amplification and for the analysis of the respiratory microbiome.

agenomic sequencing of NP swab specimens from 50 patients under investigation for COVID-19, we detected SARS-CoV-2 sequences by applying the CosmosID bioinformatics platform. Further, we characterized coinfections and detected a decrease in the diversity of the microbiomes in these patients. Statistically significant shifts in the microbiome were identified among COVID-19-positive and -negative patients, in the latter of whom a higher abundance of Propionibacteriaceae and a reduction in the abundance of Corynebacterium accolens were observed. Our study also corroborates the growing evidence that increased SARS-CoV-2 RNA detection from NP swabs is associated with the early stages of disease rather than with severity of disease. This work illustrates the utility of mNGS for the detection and analysis of SARS-CoV-2 from NP swabs without viral target enrichment or amplification and for the analysis of the respiratory microbiome.
KEYWORDS COVID-19, nasopharyngeal, SARS-CoV-2, metagenomic next-generation sequencing, metagenomics S ince the emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in December 2019, cases of coronavirus disease 2019 (COVID- 19) have rapidly increased around the world. Research in this area has aggressively expanded, but to date, there have been few metagenomic next-generation sequencing (mNGS) studies of samples from COVID-19 patients. The first study identified SARS-CoV-2 via mNGS of RNA extracted from bronchoalveolar lavage (BAL) samples collected from two patients using an Illumina MiSeq platform (1). Within 6 days, the group was able to identify a novel coronavirus and report a complete genome, demonstrating the utility of mNGS in the early stages of novel pathogen discovery. Other approaches have included random primer metagenomic sequencing (sequence-independent single primer amplification [SISPA]) or metagenomic sequencing with spiked primer enrichment (MSSPE) to identify SARS-CoV-2 using mNGS methods with a limited sample size (2,3). Most studies have applied an amplicon-based approach to detect and sequence SARS-CoV-2 directly from specimens (1,2,(4)(5)(6)(7).
Perhaps one of the greatest advantages of mNGS is the ability to obtain a snapshot of the patient's microbiome at a given sampling site to detect coinfections and determine other organisms that may impact patient outcomes. Understanding coinfection is important as it may lead to exacerbation of COVID-19, as was observed with Middle East respiratory syndrome (MERS), and may provide insight into the management of these patients (8). There have been limited reports of coinfection in COVID-19 patients, and thus far, the results have varied, perhaps due to limitations of the different methods used to detect coinfection, poor recovery or detection by standard-of-care methods due to broad-spectrum empirical coverage, lack of testing to understand coinfections, and the diverse geographic regions in which studies were conducted (9)(10)(11). Studies from China, the United States, and Europe have found various levels of coinfection within regions as well. These studies report wide ranges of coinfection, from as low as 2% to as high as 80% (12)(13)(14)(15)(16)(17)(18). While real-time reverse transcription-PCR (RT-PCR) detection of respiratory pathogens was the most common method used to detect coinfection, there is still a great deal of heterogeneity in the detection methods between these studies. The use of mNGS would not suffer from the same limitations as previous methods and may potentially reveal coinfections that may be missed by the targeted detection methods.
In addition to evidence of true pathogens, there is growing evidence that the microbiome of the respiratory tract can have an impact on the health of patients, but the majority of this evidence focuses on interactions among the bacteria (19). There is growing evidence that it may be possible to predict which patients with respiratory tract infections are more likely to experience more serious disease by analyzing the microbiome (19,20). There is also evidence that the bacterial burden and composition of the lung microbiome may impact the likelihood that mechanically ventilated critically ill patients will develop acute respiratory distress syndrome (ARDS) samples (P ϭ 0.372) or between RT-PCRϩ/sequencing-samples and RT-PCR-/sequencing-samples (P ϭ 0.064) at the family, genus, or species level. There was a significant difference between the RT-PCRϩ/sequencingϩ samples and RT-PCR-/sequencingsamples (P ϭ 0.007) at the species level. In addition, we also observed a difference at the species level when we compared patients' samples grouped by severity index (P ϭ 0.022) (Fig. 2f). As was the case in the previous PCoA analysis, there was no significant difference at the genus and family levels.
The severity index was defined on a scale of 1 to 4, as follows: 4, not admitted; 3, admitted; 2, intensive care unit; and 1, required ventilator. b Days from onset is defined as the number of days from the initial onset of symptoms until the time of specimen collection. c Putative pathogens causing coinfections are defined as known pathogens, with bacteria having above 50% relative abundance and not considered microbiota. d Suspected COVID-19 specimens that were found to be positive for SARS-CoV-2 by RT-PCR. e Suspected COVID-19 specimens that were found to negative for SARS-CoV-2 by RT-PCR. f Ct, cycle threshold; SD, standard deviation; -, unknown information; No ID, samples that were not identified by CosmosID; NA, not applicable.
Finally, we visualized the microbial community composition at the species level by comparing the overall relative abundances of species detected in SARS-CoV-2-positive and SARS-CoV-2-negative samples. The family Propionibacteriaceae revealed the greatest difference in abundance between these two groups, with Propionibacteriaceae proportionately more abundant in SARS-CoV-2 patients by ca. 30%, representing the most abundant organism group detected in these samples (P ϭ 0.028) (Fig. 3a). Additionally, there was a significant decrease in the incidence of Corynebacterium accolens in COVID-19-positive patients (P ϭ 0.025) (Fig. 3a). With respect to individual samples, we observed that, in many cases, a single organism comprised the vast majority of sequencing reads detected among the SARS-CoV-2-positive patients (Fig. 3b).

DISCUSSION
In this study, we demonstrated that it is possible to detect SARS-CoV-2 in NP swab specimens by direct long-read metatranscriptomic sequencing. Due to the relatively low cycle threshold (Ct) values (i.e., high viral burdens) observed in these samples, it was possible to obtain sufficient sequencing reads without amplification or enrichment of viral targets during preparation of the cDNA library. These steps are often necessary to obtain sufficient sequencing depth or to overcome background reads, particularly in low-titer samples (24)(25)(26). However, eliminating the amplification or enrichment step prior to sequencing introduces less bias in sequences that are ultimately generated by this method. Further, forgoing PCR amplification preserves the long sequencing reads that maximize the utility of this platform. Our study also corroborates growing evidence that the viral load in the nasopharynx is highest early in the disease course and wanes as disease progresses (27,28).
From our metagenomic analysis, we were able to observe a reduction in the microbial diversity of SARS-CoV-2-positive patients, which may be driven in part by several samples overwhelmingly dominated by a single species. Interestingly, singlespecies predominance was observed in samples throughout the range of Ct values. Using PCoA analysis, we also detected a significant difference in the microbial communities when grouped by severity index. However, it should be noted that there were larger numbers of samples in the less severe groups.
A recent study of critically ill patients found that an increased abundance of Enterobacterales in respiratory samples increased the risk of developing ARDS (21). This observation suggests that gut-colonizing bacteria may have a deleterious impact on outcomes when present in the respiratory tract. However, in our study, we did not observe increased gut-colonizing organisms. Propionibacteriaceae were more abundant in our SARS-CoV-2-positive patients. However, Propionibacterium and Cutibacterium (formerly Propionibacterium) species may represent sampling contaminants since they are a component of normal skin microbiota (29)(30)(31). Previous reports have associated Propionibacteriaceae with the respiratory tract in cystic fibrosis patients, but thus far, no linkage with disease severity has been established (29,32). We also observed a significant reduction in the relative abundance of Corynebacterium accolens in our COVID-19 patients. This organism is considered a commensal organism, and there is evidence that it has a negative association with colonization by Streptococcus pneumoniae (30,31,33). Further studies would be required to conclude the role of these associations in patients with COVID-19.
At the time of sampling, we did observe several clinically relevant organisms that may be potential causes of coinfection. Haemophilus influenzae (n ϭ 2) and Moraxella catarrhalis (n ϭ 1) were identified in COVID-19-positive samples with high relative abundances. These organisms frequently colonize the respiratory tract preceding infection, have been associated with exacerbations of chronic bronchitis and pneumonia (34,35), and have previously been detected in COVID-19 patients (11,36). Interestingly, there is also evidence that viral infections of the respiratory tract can increase the ability of H. influenzae to establish infection (34,37,38). We identified the viral pathogen human metapneumovirus (hMPV) in a COVID-19-positive sample. Although typically mNGS To Detect SARS-CoV-2 ® associated with the common cold, this virus has also been associated with exacerbations of pulmonary disease, particularly in the very young and the elderly (34,39). In addition, we identified human alphaherpesvirus 1 (HSV1) in one patient, which likely represents reactivation in the setting of COVID-19. In our COVID-19-positive samples, we observed that the SARS-CoV-2 reads were overwhelmingly abundant and that this virus frequently was the only virus identified in the metatranscriptomic sequencing results.
There are limitations to our study. First is the inability to draw larger conclusions regarding the composition of the microbiome and associations with COVID-19, including its impact on the severity of disease due to the relatively low number of samples, particularly from the more severe categories of the severity index. Having a randomized sample selection approach was beneficial to demonstrate that it is possible to detect SARS-CoV-2 from samples at various Ct values seen in our patient population. However, this ultimately led to relatively few patients with severe disease being included in our study. More broadly, a larger sample size would be necessary for future studies in order draw more definitive conclusions about microbial diversity in COVID-19 patients. In order to assess the utility of direct long-read metatranscriptomic sequencing as a potential sentinel method for assessing the continued sensitivity of RT-PCR assays, future studies should also include greater numbers of samples from suspected COVID-19 patients that are RT-PCR negative. Our study did not detect SARS-CoV-2 in any of the 10 samples from patients suspected of having COVID-19 that were negative by RT-PCR despite both our study groups having no difference in mean number of days from symptom onset, which suggests that this observation was not due to the reduced clinical sensitivity reported for NP swabs from later stages of disease. Single-nucleotide polymorphism analysis to study genomic diversity for the majority of our samples was not pursued due to insufficient depth or incomplete viral genome coverage.
In summary, this study serves as a proof of concept that it is possible to detect SARS-CoV-2 and characterize simultaneous coinfections and the respiratory microbiome of patients under investigation for COVID-19 using direct long-read metatranscriptomic and metagenomic sequencing.

MATERIALS AND METHODS
Ethics statement. This study was approved by the Johns Hopkins University institutional review board, with a waiver of informed consent.
Sample collection and severity index assignment. In this study, 50 randomly selected nasopharyngeal (NP) swab specimens were obtained from 40 COVID-19 patients and 10 patients suspected of having COVID-19 but negative by a diagnostic laboratory-developed test (LDT), RT-PCR (U.S. Food and Drug Administration Emergency Use Authorization pending), on gene targets E and S (Table 2) (40). Remnant NP swabs were collected for the study at the completion of standard-of-care testing prior to disposal. We assigned a four-point severity index value to the samples based on the following criteria: 4, the patient was not admitted to the hospital; 3, the patient was admitted to the hospital but not the intensive care unit (ICU); 2, the patient was admitted to the ICU; and 1, the patient was placed on a ventilator.
Nucleic acid extraction. Automated total nucleic acid extraction was performed using the NucliSENS easyMag (bioMérieux, Marcy-l'Étoile, France) software, version 2.1.0.1. The input volume was 500 l, and the elution volume was 50 l. Extraction was performed by following the manufacturer's protocol. Extracts were stored at -80°C prior to mNGS sequencing. mNGS GridION sequencing. Long-read metagenomic sequencing was performed using the Nanopore GridION X5 (Oxford, England) sequencing instrument. Each Nanopore sequencing library was prepared using total nucleic acid extract from NP swabs. Two different libraries were generated for each specimen using two different kits. A direct cDNA sequencing kit (SQK-DCS 109; Oxford Nanopore Technologies) was used for sequencing poly(A) ϩ RNA full-length transcripts to allow for sequencing SARS-CoV-2 in a nonbiased way. In addition, we used a PCR barcoding kit (SQK-PBK004; Oxford Nanopore Technologies) for capturing DNA targets for untargeted metagenomic analysis and to complement the metatranscriptomic analysis. For cDNA libraries, complementary cDNA strand synthesis, as well as strand switching, was performed using kit-supplied reagents and oligonucleotides. This was followed by RNA strand degradation and synthesis of complementary strands and then ligation of sequencing adaptors. Multiplexing was performed using the Native barcoding expansion kit (EXP-NBD104).
For PCR barcoding kit libraries, briefly, total nucleic acid was fragmented using the Covaris g-TUBE. Repair of sheared ends was performed using the NEBNext end repair/dA-tailing, followed by ligation of adaptors containing primer binding sites for sample amplification as well as barcoding. Primers used, in addition to incorporating barcodes, added 5= tags to allow for sample attachment to one-dimensional rapid sequencing adaptors.
Specimens were sequenced using R9.4.1 flow cells (FLO-MIN106). Five samples were multiplexed per flow cell, except for libraries for samples 1 to 5, which were prepared with the Direct cDNA sequencing kit and sequenced one sample per flow cell. Samples 1 to 10 have only metatranscriptomic data, as there mNGS To Detect SARS-CoV-2 ® was insufficient remaining specimens for subsequent metagenomic sequencing. MinKNOW software was used to collect and base call the sequencing data.
CosmosID analysis and bioinformatics pipeline. Raw nanopore fastq files that passed filter based on nanopore default parameters were concatenated into one file per sample, with all reads in order of sequencing start time. Concatenated, unassembled sequencing reads were directly analyzed by the CosmosID bioinformatics platform (CosmosID Inc., Rockville, MD) as described elsewhere (41)(42)(43)(44) for multi-kingdom microbiome analysis and quantification of organism relative abundance. Briefly, the system utilizes curated genome databases and a high-performance data-mining algorithm that rapidly disambiguates hundreds of millions of metagenomic sequence reads into the discrete microorganisms engendering particular sequences. Four variables are generated for each organism detected: unique match frequency, unique match percentage, total match percentage, and relative abundance, as previously defined (45). The CosmosID % Total Matches statistic was used as an approximation of the percent coverage of the sample against SARS-CoV-2 strain Hu-1.
To determine the detection, depth, and coverage of SARS-CoV-2 strain Hu-1 at different times, multiple cumulative read files were generated per sample. Depending on the number of reads in a sample, subsamples were created using the first 1,000, 2,000, 4,000, and 8,000 reads, doubling until a maximum of 256,000 reads was reached. These files were used for both CosmosID and Minimap analysis. The number of the first read identified was the first read mapped to SARS-CoV-2 strain Hu-1 via the CosmosID Metagenomics software. Approximations of the amounts of reads belonging to various kingdoms were determined by the number of reads in each sample that mapped to the CosmosID database for each kingdom. Relative abundance stacked bars were generated from the family-, genus-, and species-level relative abundance matrices from CosmosID taxonomic analysis. Bars are separated based on the comparative cohort and were visualized using the R package ggplot2 (49).
Putative pathogens causing coinfections are defined as organisms of high abundance having Ͼ50% relative abundance compared to that in the normal microbiota (i.e., organisms expected to be found in the respiratory tracts of healthy individuals). Orthogonal validation of coinfection calls were completed by running raw nanopore output files against reference genomes of interest using minimap-2, with default parameters, in the same way as was used in the mapping to SARS-CoV-2 strain Hu-1. Mapping statistics and coverage plots were generated from Minimap's sam/bam files using Qualimap's bamqc function.
Statistical analysis. Alpha diversity boxplots were calculated from the family-, genus-, and specieslevel abundance score matrices employing the CosmosID taxonomic analysis. Chao, Simpson, and Shannon alpha diversity metrics were calculated in R using the R package vegan. Wilcoxon rank sum tests were performed between positive and negative COVID-19 groups using the R package ggsignif. Boxplots with overlaid significance in P value format were generated using the R package ggplot2 (49)(50)(51).
Beta diversity principal-coordinate analyses (PCoA) were calculated from the family-, genus-, and species-level relative abundance matrices from CosmosID taxonomic analysis. Bray-Curtis and Jaccard diversities were calculated in R using the R package vegan with the function vegdist, and PCoA tables were generated using vegan's function pcoa. Permutational multivariate analysis of variance (PERMANOVA) tests for each distance matrix were generated using vegan's function adonis2. Group pairwise PERMANOVAs were generated using the pairwise Adonis function pairwise.adonis2. Plots were visualized using the R package ggpubr (50,52,53).

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. ACKNOWLEDGMENTS B.F. and M.D. are employees of CosmosID, and R.C. is the founder and chair of CosmosID. CosmosID provided bioinformatics support for the mNGS analysis. P.J.S. receives grants and personal fees from Accelerate Diagnostics, OpGen Inc., and BD Diagnostics, grants from bioMérieux, Inc., Affinity Biosensors, and Hardy Diagnostics, and personal fees from Roche Diagnostics, GeneCapture, and Shionogi Inc, outside the submitted work.
James Brayer and Kim Fitzgerald from Oxford Nanopore helped expedite the reagents required for the study. Monolina Binny (Oxford Nanopore), Michael Micorescu (Oxford Nanopore), and Winston Timp (Johns Hopkins University, Whiting School of Engineering) provided feedback on sequencing methods. We thank the Medical Microbiology Laboratory at the Johns Hopkins Hospital for assistance with specimen collection and specimen archiving for research studies.