Non-SARS-CoV-2 genome sequences identified in clinical samples from COVID-19 infected patients: Evidence for co-infections

Background In December 2019, an ongoing outbreak of pneumonia caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2/ 2019-nCoV) infection was initially reported in Wuhan, Hubei Province, China. Early in 2020, the World Health Organization (WHO) announced a new name for the 2019-nCoV-caused disease: coronavirus disease 2019 (COVID-19) and declared COVID-19 to be a Public Health Emergency of International Concern (PHEIC). Cellular co-infection is a critical determinant of viral fitness and infection outcomes and plays a crucial role in shaping the host immune response to infections. Methods In this study, 68 public next-generation sequencing data from SARS-CoV-2 infected patients were retrieved from the NCBI Sequence Read Archive database using SRA-Toolkit. Data screening was performed using an alignment-free method based on k-mer mapping and extension, fastv. Taxonomic classification was performed using Kraken 2 on all reads containing one or more virus sequences other than SARS-CoV-2. Results SARS-CoV-2 was identified in all except three patients. Influenza type A (H7N9) virus, human immunodeficiency virus, rhabdovirus, human metapneumovirus, Human adenovirus, Human herpesvirus 1, coronavirus NL63, parvovirus, simian virus 40, and hepatitis virus genomes sequences were detected in SARS-CoV-2 infected patients. Besides, a very diverse group of bacterial populations were observed in the samples.


INTRODUCTION
In December 2019, the first cases of coronavirus disease 2019  were possibly due to a zoonotic transmission in China. It was tied to a large seafood market that also traded in live wild animals (Tay et al., 2020). The causative virus, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is capable of human-to-human transmission and rapidly spread to other regions of China and other countries (World Health Organization, 2020). It is now a global pandemic and is a considerable concern for public health. So far, more than 16,101,367 confirmed cases were diagnosed in nearly 213 countries and territories around the world and two international conveyances, causing globally over 645,000 deaths (Worldometer, 2020, July 25).
Coronaviruses are known to cause severe diseases in humans and animals. Of these, four human coronaviruses (229E, NL63, OC43, and HKU1) typically only infect the upper respiratory tract and cause relatively minor symptoms (Fehr & Perlman, 2015). However, there are three coronaviruses (severe acute respiratory syndrome coronavirus (SARS-CoV), Middle East respiratory syndrome coronavirus (MERS-CoV), and SARS-CoV-2) that can replicate in the lower respiratory tract and cause pneumonia, which can be fatal. With 79% genome sequence similarity, SARS-CoV is the closest relative to SARS-CoV-2 among human coronaviruses (Gorbalenya et al., 2020). However, of all known coronavirus sequences, SARS-CoV-2 is most similar to bat coronavirus RaTG13, with a similarity of 98% (Zhou et al., 2020).
SARS-CoV-2 pathophysiology closely parallels that of SARS-CoV infection, with active inflammatory responses strongly implicated in the resulting airway damage (Wong et al., 2004). Hence the extent of the disease in patients is attributed not only to the viral infection but also to the host's response (Tay et al., 2020).
Underlying co-infections in primary infectious disease are an important variable that needs to be considered but is often undetected. Remarkable developments in next-generation sequencing have recently made metagenomics, an unbiased shotgun method of analysis, a widely used tool in just about every field of biology, including diagnosis of infectious diseases (Kuroda et al., 2012;Lecuit & Eloit, 2014). Metagenomics is powerful because it can diagnose unsuspected microbial agents (Wilson et al., 2014). It directly analyzes samples in their entirety, eliminating the need for prior knowledge to obtain comprehensive information. In this capacity, metagenomics exceeds traditional diagnostic limitations. With the microbial genomes in hand, we can now explore the possibility of using metagenomic and metatranscriptomic next-generation sequencing (mNGS) directly as a screening method for detection of other microbes in clinical samples (Plyusnin et al., 2020).
A simple approach would be to first map sequencing reads from the sample to the reference microbial genome. However, the accuracy of such an alignment-based method is relatively low compared to an alignment-free approach. In the alignment-based method, genome sequences from closely related viruses can lead to false-positive results (Chen et al., 2020b). Moreover, the virus-specific reads obtained may not be adequate for unambiguous detection (degraded RNA), incompletely target-enriched sequence library by multiple-PCR (Lundberg et al., 2013) or hybrid capture (Duncavage et al., 2011), which can lead to false-negative results.
Fastv is an alignment-free, ultra-fast tool for detecting the microbial sequences in sequence data (Chen et al., 2020b). It can identify target microorganisms using unique k-mers. It detects SARS and other coronaviruses from sequencing data and efficiently distinguishing SARS from MERS.
In this study, public next-generation sequencing data from SARS-CoV-2 infected patients were analyzed by fastv using the pre-computed unique k-mer resources (Chen et al., 2020b). Taxonomic classification was performed using Kraken 2 on all reads containing more than one microbial sequence (Wood, Lu & Langmead, 2019). The present study's findings have confirmed the actual existence of genome sequences of other microbes in SARS-CoV-2 infected patients.

SRA database mining
Next-generation sequencing technologies have enabled large-scale genomic surveillance of SARS-CoV-2 as thousands of isolates are being sequenced worldwide and deposited in public data repositories. The sequence data were downloaded as .sra files using the prefetch tool (https://github.com/ncbi/sra-tools/tree/master/tools/prefetch), then extracted to .FASTQ files using the NCBI fastq-dump tool (https://github.com/ncbi/sra-tools/tree/ master/tools/fastq-dump). Data sets for analysis were chosen through a keyword search of the SRA descriptions for "COVID19" and downloaded between 27 January 2020 and 16 May 2020. Sequence data from negative COVID-19 patients, experimental studies, and controlled access were excluded (Table S1).
Sequence data pre-processing and screening using fastv Fastv, along with the pre-computed unique k-mer resources, was used as previously described (Chen et al., 2020b). Initially, fastv performed adapter trimming, quality pruning, base correction, and other pre-processing to ensure the accuracy of k-mer analysis on fastq input files. The fastv tool identifies a target virus from sequencing data and detects any microbial sequences for which a unique k-mer data is provided. A unique SARS-CoV-2 k-mer set and the SARS-CoV-2 reference genome were used as input files (downloaded from https://github.com/OpenGene/fastv/tree/master/data) with the k-mer collections for viral and microbial genomes (downloaded from http://opengene.org/ microbial.kc.fasta.gz). The k-mer scanning results were visualized in a figure on a single HTML page by fastv.

Sequence data analysis using Kraken 2
The results of fastv were validated with Kraken 2. Kraken 2 is the latest version of Kraken, a taxonomic classification system that uses exact k-mer matches to achieve high accuracy and rapid classification (https://github.com/DerrickWood/kraken2) (Wood, Lu & Langmead, 2019). Kraken constructs an index of all k-mers found in the reference genomes and assigns each k-mer to the least common ancestor (LCA) of all species that have that k-mer. Then Kraken matches the k-mers contained in the reads to this index and eventually assigns the reads to the taxon with the most fitting k-mers by following the path from the root of the tree (Wood, Lu & Langmead, 2019). Through lowering memory use by 85%, Kraken 2 improves upon Kraken 1, enabling higher numbers of genomic reference data to be used while retaining high accuracy and fivefold speed. A standard database containing RefSeq complete bacterial, archaeal, and viral genomes, along with the human genome and a collection of known vectors (UniVec_Core), was downloaded (https://ccb.jhu.edu/software/kraken2/downloads.shtml). The database was then constructed using 32 threads with the default parameters on an AWS EC2 h1.8xlarge storage optimized instance with 16 dual-core hyperthreaded 2.30 GHz CPUs and 132 gigabytes (GB) of RAM.

SARS-CoV-2 identification
Sequence data analysis was performed on public Illumina HiSeq/MiSeq libraries from the NCBI SRA database (Bioproject PRJNA605983) sequenced from bronchoalveolar lavage fluid from five patients (WIV02, WIV04, WIV05, WIV06, and WIV07) with pneumonia at the early COVID-19 outbreak in Wuhan, China. Nine libraries were downloaded as .sra files using the prefetch tool; then, the fastq files were extracted using the NCBI fastq-dump tool. Prior to the k-mer analysis, sequencing adapters and low-quality bases were removed by fastv. After scanning the fastq data, fastv reported the k-mer coverage for each microbial genome with valid hits (Fig. 1).
SARS-CoV-2 was detected in all of the 68-sequence data except SRR11772662, SRR11772663, and SRR11772664 samples were negative for SARS-CoV-2. These samples belong to one study (Bioproject PRJNA631042), where the research group used different sequencing technologies on the same sample to find a cost-effective and highly scalable method for SARS-CoV-2 sequencing. Because sequence technologies vary in reading depth and coverage thresholds, fastv could not detect SARS-CoV-2 in sequenced samples with lower coverage metrics.
In addition, human immunodeficiency virus, human herpesvirus 1, human T-lymphotropic virus 1, hepatitis virus, and simian virus 40 were found in WIV07 patient sequence data with coverage below 21%. The low coverage and non-human specific pathogen hits have been overlooked (Table S2).
A very diverse group of bacterial populations were observed in the COVID-19 infected patients. Enterobacter hormaechei was identified in WIV02, WIV04, WIV05, WIV06, and WIV07 patients. Enterobacter hormaechei is an important emerging pathogen and can cause nosocomial infections, and often have resistance to multiple clinically relevant antibiotics (Monahan et al., 2019). Acinetobacter baumannii sequence was detected in WIV05 and WIV07 patients. A. baumannii is one of the most successful pathogens associated with hospital-acquired infections worldwide (Lee et al., 2017). Enterococcus faecalis sequence was identified in WIV04, WIV05, and WIV07 patients (Table S2).

Kraken taxonomic classification
Taxonomic classification was performed using Kraken 2 on all reads containing more than one virus sequences. The genetic data for constructing the databases were retrieved from the NCBI RefSeq library. A very diverse group of viral, bacterial, and archaeal populations was observed in the samples. A taxonomic classification that was obtained from WIV04, WIV06, and WIV07 patients revealed a dominance of Bacteria (7%, 5%, and 59%, respectively), followed by Viruses (0.3%, 0.3%, and 0.3%, respectively) while Archaea was lower than 0.02% in all patients. Among viral communities, influenza type A (8%), rather than SARS-CoV-2 (2%), was found to be dominant in the WIV07 patient, which is consistent with fastv result (Fig. 2). SARS-CoV-2, rhabdovirus, and influenza type A dominated the sequence data from WIV06 patient (Fig. 3) and WIV04 patient (Fig. 4).
Acinetobacter baumannii and Enterobacter species genomic sequences were detected in SRR11092059 (Fig. 5), SRR11092060 (Fig. 6), SRR11092061, SRR11092062 (Fig. 7), and SRR11092063 sequence data. By using Kraken2, a tool for read taxonomy, many bacterial species were identified that most of them present in the normal flora and rarely cause problems. However, compared to the fastv result, the Pasteurella multocida sequence was detected in SRR11092062 (Fig. 7) and SRR11092063 sequence data. The genome sequence of Staphylococcus aureus was detected in SRR11092059 data (Fig. 5).
The most abundant bacterial phylum was Bacteroidetes, followed by Firmicutes phylum in SRR10971381 sequence data (Fig. 8).

DISCUSSION
The importance of detecting co-infections is becoming more recognized (Griffiths et al., 2011;Li & Zhou, 2013), but it remains challenging to get such information. The source of clinical samples and the sequencing technology can be inferior in co-infection detection (Birdsell et al., 2018).
A better understanding of co-infection's prevalence is required, partly because co-infecting pathogens can interact with each other directly or indirectly via the host's Figure 2 The Krona pie chart of the viral taxonomy of the WIV07 patient. Viruses identified from kraken 2 analysis of SRR11092059 data were visualized using krona tool. Among viral communities, influenza type A (8%), SARS-CoV-2 (2%) and rhabdovirus (0.6%) dominated the sequence data from WIV07 patient.
Full-size  DOI: 10.7717/peerj.10246/ fig-2 resources or immune system (Cox, 2001;Griffiths et al., 2011). These interactions within co-infected hosts can alter the transmission, clinical progression, and control of multiple infectious diseases as compared to single pathogen species infection (Chiodini, 2001;Griffiths et al., 2011;Palacios et al., 2009). Recent studies indicated that co-infection's adverse effects are more common than those with no-effects or positive impacts on human health (Pullan & Brooker, 2008). The underdiagnosis of co-infections is attributed to a lack of clinical suspicion, common symptoms, and the fact that conventional methods have little capacity to detect co-infections in the absence of a priori knowledge.
Co-infection can potentially affect the performance of laboratory testing for coronavirus disease 2019. A previous study (Lai, Wang & Hsueh, 2020) showed that reverse transcription polymerase chain reaction (RT-PCR) could not detect SARS-CoV-2 in two patients co-infected with influenza A virus. Some researchers have suggested that inadequate viral specimens, the use of improperly validated assay, timing and methods of collecting specimens, the presence of mutations at the primer binding site, and co-infection with other viruses might be responsible (Arevalo-Rodriguez et al., 2020;Kucirka et al., 2020;Lai, Wang & Hsueh, 2020;Li et al., 2020). Based on a limited number of observational studies, it was found that the false-negative rate for SARS-CoV-2 RT-PCR was 20% (Lai, Wang & Hsueh, 2020;Li et al., 2020). Given these considerations, COVID-19 might be underdiagnosed, especially during the influenza season, since typical clinical symptoms of COVID-19, including fever, cough, and dyspnea, resemble those of influenza (Chen et al., 2020a;Wu et al., 2020).
Therefore, exploring new diagnostic approaches is essential to advance understanding of co-infection contribution to disease manifestations and treatment responses (Birdsell et al., 2018).
Remarkable developments in next-generation sequencing have recently made metagenomics, an unbiased shotgun method of analysis, a widely used tool in just about every field of biology, including diagnosis of infectious diseases (Kuroda et al., 2012;Lecuit & Eloit, 2014). In this study, using k-mer based tools (fastv and Kraken 2), genome sequences of various microorganisms, including the SARS-CoV-2 virus, were detected. The accuracy of such an alignment-free method is relatively high compared to an alignment-based approach. Fastv is an alignment-free, ultra-fast tool for detecting the microbial sequences in sequence data using unique k-mers (Chen et al., 2020b). To validate the results obtained from fastv analysis, Kraken 2 was used. Kraken 2 is the latest version of Kraken, a taxonomic classification system that uses exact k-mer matches to achieve high accuracy and rapid classification (Wood, Lu & Langmead, 2019).
Kraken 2 is fast and requires less memory than Kraken 1. However, the standard database construction requires approximately 100 gigabytes (GB) of disk space and random-access memory (RAM) of more than 30 GB. For that purpose, an AWS EC2 h1.8xlarge storage optimized instance with 16 dual-core hyperthreaded 2.30 GHz CPUs and 132 GB of RAM was used.
Multidrug-resistant bacteria associated with hospital-acquired infections worldwide, such as E. hormaechei , S. aureus, P. multocida, and A. baumannii, were detected in COVID-19 infected patients (Lee et al., 2017;Monahan et al., 2019). Most of the infections caused by the previously mentioned bacteria occur in critically ill and or immunocompromised patients in the intensive care unit (ICU) setting (Fournier, Richet & Weinstein, 2006). Understanding the nature and consequences of co-infection is essential for accurate estimates of infectious disease burden. In particular, more systematic data on infectious diseases would also help measure the extent of co-infection on human health. Increased knowledge of the risk factors, the conditions in which co-infecting pathogens interact, and the mechanisms behind these interactions, particularly in experimental studies, will also help develop and evaluate infectious disease management programs.

CONCLUSIONS
In this study, screening of 68 public next-generation sequencing data from SARS-CoV-2 infected patients was performed using fastv and Kraken 2. Multiple viruses, including the SARS-CoV-2 virus, genome sequences were detected. Further large-sample studies are warranted to investigate the prevalence of COVID-19 co-infection, the impact of co-infection on the host immune system of COVID-19 patients, and their role in disease progression. Figure 8 Sankey diagrams of Kraken 2 report results obtained from SRR10971381. The width of the flow is proportional to the number of reads. The number above each node is the number of k-mer hits. A rank code, indicating domain (D), kingdom (K), phylum (P), family (F), genus (G), or species (S) was used.