Metagenomic Surveillance of Viral Gastroenteritis in a Public Health Setting

ABSTRACT Norovirus is the primary cause of viral gastroenteritis (GE). To investigate norovirus epidemiology, there is a need for whole-genome sequencing and reference sets consisting of complete genomes. To investigate the potential of shotgun metagenomic sequencing on the Illumina platform for whole-genome sequencing, 71 reverse transcriptase quantitative PCR (RT-qPCR) norovirus positive-feces (threshold cycle [CT], <30) samples from norovirus surveillance within The Netherlands were subjected to metagenomic sequencing. Data were analyzed through an in-house next-generation sequencing (NGS) analysis workflow. Additionally, we assessed the potential of metagenomic sequencing for the surveillance of off-target viruses that are of importance for public health, e.g., sapovirus, rotavirus A, enterovirus, parechovirus, aichivirus, adenovirus, and bocaparvovirus. A total of 60 complete and 10 partial norovirus genomes were generated, representing 7 genogroup I capsid genotypes and 12 genogroup II capsid genotypes. In addition to the norovirus genomes, the metagenomic approach yielded partial or complete genomes of other viruses for 39% of samples from children and 6.7% of samples from adults, including adenovirus 41 (N = 1); aichivirus 1 (N = 1); coxsackievirus A2 (N = 2), A4 (N = 2), A5 (N = 1), and A16 (N = 1); bocaparvovirus 1 (N = 1) and 3 (N = 1); human parechovirus 1 (N = 2) and 3 (N = 1); Rotavirus A (N = 1); and a sapovirus GI.7 (N = 1). The sapovirus GI.7 was initially not detected through RT-qPCR and warranted an update of the primer and probe set. Metagenomic sequencing on the Illumina platform robustly determines complete norovirus genomes and may be used to broaden gastroenteritis surveillance by capturing off-target enteric viruses. IMPORTANCE Viral gastroenteritis results in significant morbidity and mortality in vulnerable individuals and is primarily caused by norovirus. To investigate norovirus epidemiology, there is a need for whole-genome sequencing and reference sets consisting of full genomes. Using surveillance samples sent to the Dutch National Institute for Public Health and the Environment (RIVM), we compared metagenomics against conventional techniques, such as RT-qPCR and Sanger-sequencing, with norovirus as the target pathogen. We determined that metagenomics is a robust method to generate complete norovirus genomes, in parallel to many off-target pathogenic enteric virus genomes, thereby broadening our surveillance efforts. Moreover, we detected a sapovirus that was not detected by our validated gastroenteritis RT-qPCR panel, which exemplifies the strength of metagenomics. Our study shows that metagenomics can be used for public health gastroenteritis surveillance, the generation of reference-sets for molecular epidemiology, and how it compares to current surveillance strategies.

assess its potential to provide off-target molecular characterization of GE pathogens that are relevant from a public health perspective. For this study we selected norovirus RT-qPCR-positive samples that were sent to the Dutch norovirus surveillance program by regional laboratories for referral in outbreak situations.

RESULTS
Complete norovirus genomes were robustly generated using metagenomics. To investigate the potential of shotgun metagenomic sequencing on the Illumina platform, 71 RT-qPCR norovirus-positive feces (threshold cycle [C T ], ,30) samples from norovirus surveillance within The Netherlands were subjected to metagenomics. Of the 71 samples, 66 (93%) yielded norovirus genomic sequences, while off-target viruses were identified in 13 (18%) of the samples ( Fig. 1; Table 1). Of the 66 samples resulting in norovirus sequences, 58 (82%) yielded a full norovirus genome, and eight (11%) yielded a partial norovirus genome. Two full and two partial minority norovirus strains were additionally generated from double-infection samples, resulting in a total of 60 full and 10 partial norovirus sequences ( Table 1). The five (7.9%) remaining samples were possibly negative due to either a low viral input, the presence of inhibitors, or the presence of another highly abundant taxon. Samples R02-07 and R02-09 had high viral loads (C T , ,25) but contained a highly abundant adenovirus (AdV) (C T , 7.1; Table 1) and a 64-kb Pseudomonas phage with a 723Â depth of coverage, respectively.
A comparison of RT-qPCR values versus genome completeness at a sequencing depth cutoff of 3 showed that above a C T value of 27.1, no norovirus contigs were obtained (Fig. 2). For the mixed-strain samples, block-like coverage peaks near the ORF1/2 junction were observed as a result of homologous regions between these strains, and these regions had to be resolved manually (Fig. 3).  Metagenomic sequencing of norovirus has higher sensitivity than Sanger sequencing. For 60 out of 71 samples, a Sanger norovirus sequence could be retrieved, compared to 66 for metagenomic sequences. To investigate the concordance and resolution of NGS versus Sanger sequencing, sequences generated by both methods were compared using a pairwise single nucleotide polymorphism (SNP) distance method with a pairwise deletion option that removes ambiguous positions. In total, 50 high-quality Sanger sequences were compared to NGS sequences. SNP differences were identified for six samples: R03-06, R03-07, R03-12, R03-13, and R04-07 had 1 SNP difference, while R03-09 had 2 SNP differences over a 969nucleotide (nt) region (see Fig. S5 in the supplemental material). Although ambiguous positions were ignored by this method, NGS could clearly resolve ambiguous nucleotides that arose due to overlapping fluorescent peaks in the Sanger trace files.
Phylogenetic analysis of norovirus strains. To investigate the genetic diversity of the obtained norovirus sequences, we inferred maximum likelihood trees of the ORF1 and ORF2 sequences for norovirus GII (Fig. 4) and GI (Fig. 5). We included references from the latest nomenclature update described in Chhabra et al. (11). For samples that were linked in location and time (sampled within 1 week of each other), we investigated if there were differences in single nucleotide polymorphisms (SNPs). We identified six linked sample pairs, that had up to two SNP distances between them over the entire combined ORF1 to ORF3 coding domain sequence (CDS) ( Table 1  norovirus strain (bottom) with lower DoC, denoted "major" and "minor," respectively. The characteristic blocklike high coverage peaks near the ORF1/2 junction of the minority strain are due to homology between the strains in these regions. The aligned reads of the majority strain adversely affect the consensus sequence of the minority strain, necessitating manual correction. A full overview of all norovirus GI, GII, and off-target viruses with public health relevance is shown in Fig. S1 to S4. The coverage is shown as a 1 1 log 10 transformation with a maximum 999Â depth of coverage.     Table 1). The scale bar represents nucleotide substitutions per site, and selected bootstrap values greater than 70 are shown. For samples containing multiple strains, the lowest depth of coverage strain is denoted with "minor" in the tip label (e.g., "R01-04-minor"). All GII.4 and GII.P4 strains were Sydney and New Orleans variants, respectively.
variants C (GenBank accession number AB983218) and D (accession number LC037415) are contemporary strains that recently caused large outbreaks in Asia (22,23). All GII.17 strains in this study belonged to variant D, with the exception of R02-06, which belonged to variant C (Fig. 4). The GII. 17[P25] strain (R01-04) belonged to GII. 17 variant D (Fig. S6B), and it had a single amino acid change from the reference strain, which suggests a recent recombination event. Additionally, its polymerase region clustered with GII.P25 (Fig. S6A). This strain represents the second complete GII.P25 ORF1 sequence alongside MG495083.
Children were frequently coinfected with another GE virus. From the samples that tested positive for norovirus by RT-qPCR, 18% yielded off-target GE viruses, resulting in 13 full and two partial genomes (Table 1; Fig. 2 and 5). Children were more often coinfected with another GE virus (39%) than adults (6.7%; Fig. 1). Aichivirus 1 (AiV-1), RVA, and AdV41 were detected in individuals of 60, 66, and 71 years old, respectively, while all other off-target GE viruses were detected in samples from individuals younger than 5 years (Fig. 1). All detected EV were coxsackieviruses (CV). For several of the GE viruses, the number of sequences generated in this study provided a marked increase in the number of sequences uploaded to NCBI between 2017 and 2022 (Table S1).
To investigate the sensitivity of metagenomic sequencing for the off-target GE viruses, (RT-)qPCRs were performed for RVA, AdV, sapovirus (SaV), EV, and human parechovirus (HPeV) on 71 samples. For AiV and human bocaparvovirus (HBoV), (RT-)qPCR assays were not available ( Table 1). The sample size was limited, but overall, no sequences were obtained with a C T value of .28 except for two HPeV genomes, while RT-qPCR detected several offtarget viruses with a C T of .28, which were not detected with metagenomics, metagenomics allowed for the reliable detection and genotyping of off-target viruses in a single assay.
Metagenomics identified a primer/probe mismatch for SaV GI.7. An SaV GI.7 was initially identified by metagenomics but not by RT-qPCR. Based on the obtained SaV GI.7 sequence, the SaV RT-qPCR was updated ( Table 1). The forward primer remained unchanged (59-GAYCWGGCYCTCGCCACCT-39), the probe was truncated from 59-TGY ACCACCTATRAACCAVGG-39 to 59-TGYACCACCTATRAACCA-39, and the reverse primer was truncated at the 39 end and extended on the 59 end: 59-GCCCTCCATYTCAAAC ACTAWTTTG-39 to 59-CATTGCCCTCCATYTCAAACACTA-39. This updated primer set allowed for the retrospective detection of SaV with a C T of 13.

DISCUSSION
As previously reported by others and shown in this study (24), Illumina-based metagenomic sequencing is a robust method for generating norovirus sequences. We were also able to determine norovirus genomes from minority strains in double-infection samples in which Sanger sequencing could not differentiate the minor from the major strain.
Although this study did not primarily investigate outbreaks, we identified several pairs of samples that were collected from the same location and time. These pairs had a pairwise SNP distance of #2 across the combined ORF1 to ORF3 CDS. However, to identify norovirus clusters, further investigation of outbreaks with well-characterized transmission chains is necessary to determine an appropriate SNP cutoff.
When comparing genome completeness against qPCR C T values, a proxy for the viral load, an overall lower genome completeness at higher C T values was observed, as reported by others (25). The genome ends had a lower coverage depth, but the reads were evenly distributed across the rest of the genome (26)(27)(28)(29). For three samples with low C T value, no norovirus sequences could be obtained, and we hypothesize that this was caused by other abundant taxa outcompeting norovirus nucleic acids for sequencing space or the presence of inhibitors negatively affecting amplification efficiency (30). This limitation highlights one of the challenges of using agnostic, metagenomics-based virus typing.
Moreover, an RVA coinfection, which causes severe GE (2), was identified in an elderly individual, and all 11 segments were genotyped. Importantly, this study was performed in The Netherlands, where RVA will be included in the National vaccination program in 2024. Metagenomic GE surveillance can help assess the effect of vaccination on the epidemiology of RVA in The Netherlands. The other off-target viruses that were detected were an AiV1, which is linked to predominantly subclinical GE (44,45), AdV41, a known cause of GE (2), an HBoV1, and an HBoV3, the former of which is a respiratory virus, while the latter is a GE-causing virus (17,19,46). Lastly, an SaV GI.7 was identified in an infant, which belongs to the same viral family as norovirus with similar clinical and epidemiological characteristics (47,48). These off-target findings provide a reference set to assess silent transmission and to facilitate genotype to phenotype studies to determine if genetic or genotype-specific factors contribute to severe pathogenesis. However, it should be noted that studies have shown that 7 to 15% of norovirus infections remains asymptomatic, meaning that some of these off-target viruses could be the causative agent of the reported gastroenteritis (49).
Interestingly, the SaV GI.7-positive sample that initially tested negative with RT-qPCR was identified by metagenomics. Evaluation of the primers and probe showed that these were not compatible with this genotype. Therefore, they were updated, and now they are similar to the frequently used primer and probe sets of Oka et al. (50). Since Okada et al. first detected SaV GI.7 in 2006, only a few GI.7 sequences have been submitted to NCBI, and the first complete genome was sequenced in 2018 (51,52). The SaV GI.7 complete genome generated in this study most closely resembles this complete genome (GenBank accession number AB522390) with only 89% nucleotide identity, while SaV GI is shown to have low intragenotype diversity (51,53). A recent large-scale review of SaV prevalence (48) did not report the GI.7 genotype in surveillance programs. Therefore, our results raise questions about where this genotype circulated without being detected in surveillance programs.
Even though this study had a limited sample size, it generated a marked addition of contemporary and publicly available full genomes. Compared to full-genome sequences uploaded to NCBI over a 4-year period, the genomes generated in this study represent up to a 33% increase in contemporary sequences for some of these off-target viruses (Table S1). This can help contextualize molecular epidemiology and outbreak tracing.
While Illumina metagenomic NGS is a more costly technique than RT-qPCR or Sanger sequencing, it has broad surveillance potential. Conventional methods require continuous updates of primers and probe sets to account for antigenic drift and novel genotypes. Without metagenomic-based surveillance, the SaV GI.7 in this study would not have been identified, and similar results are reported by others (54). Primer mismatches can also occur for viruses of chronically infected patients, where prolonged replication can result in mutations in the otherwise conserved domains targeted by Sanger sequencing assays.
In-depth analysis of NGS data to identify GE viruses in a public health setting is challenging and time-consuming. Negative controls and detection thresholds have to be optimized to account for lab-specific contamination levels to mitigate false-positives. Likewise, a mixed infection with shared homologous regions requires careful manual curation. Here, it helps to have standardized pipelines, with parameter settings tailored to the local laboratory methods and visualization of genomic regions for manual in-depth inspection.
In conclusion, this study shows the potential of NGS-based norovirus surveillance for generating a full-genome reference set for a broad range of public health-relevant pathogens and for high-resolution outbreak detection. By describing the caveats and strengths of metagenomic-based surveillance, and by sharing these complete genome sequences, we hope to help other researchers contextualize their outbreak investigations and improve molecular epidemiology.

MATERIALS AND METHODS
Sample selection. A total of 71 norovirus RT-qPCR positive fecal samples obtained between 2015 and 2019 with a C T of ,30 for at least one norovirus genogroup were selected from GE outbreak-related samples sent to the Dutch National Institute for Public Health and the Environment (RIVM) norovirus surveillance program by medical microbiology laboratories for referral. The samples originated predominantly from young children (,5 years old, N = 23) and elderly (.65 years old N = 38). For one sample, the patient age was not reported. C T values ranged from 17.0 to 33.2 (N = 10) and 11.7 to 29.3 (N = 63) for GI and GII, respectively. Feces samples were stored at 4°C until processing, which is the standard procedure at the RIVM for norovirus-positive feces.
qPCR and Sanger sequencing. Feces sample was suspended in 1,000 mL modified Eagle medium (MEM) medium containing gentamicin (Thermo Fisher, Bleiswijk, The Netherlands) to give a 10 to 20% vol/vol suspension and centrifuged at 16,100 relative centrifugal force (RCF) for 1 min at room temperature. Total nucleic acid (TNA) was extracted using the MagNA Pure 96 system (Roche, Almere, The Netherlands) and eluted in 50 mL elution buffer. All samples were subjected to a qPCR panel targeting norovirus GI, norovirus GII, RVA, sapovirus (SaV), and adenovirus (AdV). Norovirus Sanger sequences were generated as described previously (3,15). Sanger sequences were analyzed using BioNumerics (AppliedMaths, Sint-Martens-Latem, Belgium) and genotyped via the NoroNet norovirus typing-tool (16). Additional viruses detected by NGS in this study, which are of relevance for public health, were confirmed by RT-qPCR: EV and human parechovirus (HPeV) via RT-qPCR as described by Benschop et al. (55,56) and SaV via RT-qPCR with modified primers (as described above).
Virus enrichment, NGS pretreatment, and NGS. Feces was suspended in 1,000 mL Eagle MEM containing gentamicin (Thermo Fisher, Bleiswijk, The Netherlands) to give a 10 to 20% vol/vol suspension and centrifuged at 16,100 RCF for 1 min at room temperature. The supernatant was filtered using Costar Spin-X 0.45-mm CA membrane centrifuge tube filters (Corning, Amsterdam, The Netherlands). Afterward, 200 mL filtrate was supplemented with 25 mL 25 mM MgCl 2 and treated with 1.25 mL 200 U/mL OmniCleave endonuclease (EpiCentre, Leusden, The Netherlands) for 1 h at 37°C.
Total nucleic acid (TNA) was extracted using the Magna Pure 96 system (Roche, Almere, The Netherlands) and eluted in 50 mL elution buffer. First-strand cDNA synthesis was performed using SuperScript III (Invitrogen, Bleiswijk, The Netherlands) using 11 mL TNA eluate as input. The second cDNA strand was synthesized using the NEBNext mRNA second-strand synthesis module (New England Biolabs, Leusden, The Netherlands) using 20 mL as input, as per the manufacturer's instructions, resulting in 160 mL output. Next, the sample was purified using DNA Clean & Concentrator-5 capped columns (Zymo Research, Leiden, The Netherlands), as per the manufacturer's instructions, resulting in 15 mL double-stranded DNA (dsDNA) per sample.
Library preparation was performed with the Nextera XT DNA library prep kit (Illumina, Eindhoven, The Netherlands). Samples were processed in four independent NGS runs, denoted with the prefixes R01 to R04. Prefixes R02, R03, and R04 were 150-nucleotide paired-end sequenced on an Illumina NextSeq instrument per the manufacturer's specifications, and samples starting with R01 were 300-nucleotide paired-end sequenced on an Illumina MiSeq instrument by BaseClear B.V. (Leiden, The Netherlands). On average, 4.7 million reads (postfiltering) were generated per sample.
Data analysis. All NGS data were analyzed using an in-house NGS analysis workflow called Jovian (v1.01, https://github.com/DennisSchmitz/Jovian) using default settings. Briefly, this workflow removes human and poor-quality reads (i.e., nucleotides with a Phred score of ,20 are trimmed, and reads with a length of ,50 nt are discarded), assembles reads into contigs and annotates contigs of $250 nt via megaBLAST against the NCBI NT database, determines the lowest common ancestor (LCA) of the BLAST results, and finally, performs genotyping of several clinically important virus families and genera via their respective typing tools (16).
Norovirus sequences were manually curated and subsequently scaffolded and assembled based on same-genotype public reference sequences or by extending the 59 and 39 ends using soft-clipped overhangs. This manually assembled draft genome was then assessed and corrected by performing Minimap2 (57) alignment and LoFreq (58) SNP-calling, with manual curation of the genome ends, to produce a final sequence. Homologous regions of mixed-strain samples were manually corrected and curated. Any reference nucleotide with ,3 reads coverage was masked with an N in the final genome sequence. Considering a minimum Phred score of 20 and minimum depth of coverage of 3, a sequencing error in the consensus requires $2/3 simultaneous errors which, by binomial distribution, has a 0.03% chance. This is a lower bound since the depth of coverage generally was $1 order of magnitude higher than 3 (Table 1).
Virus genomes were considered complete when their entire CDS was characterized with #100 N's, and partial when it contained .100 N's or could not be scaffolded to a full CDS sequence. If fewer than 50 total reads aligned against same-taxon scaffolds with a length $250 nt, they were considered negative. Norovirus scaffolds with a different ORF1 and ORF2 genotype than their majority strain were considered minority strains.
Data availability. All filtered, General Data Protection Regulation-compliant, FASTQ files and full genomes were submitted to NCBI SRA under study accession number PRJEB54724 (accession numbers OP162334 to OP162342, OP205527 to OP205585 and OP255971 to OP255995). Sequences that contained too many N's could not be uploaded to NCBI but are available on request. All complete and partial norovirus genomes were also uploaded to NoroNet (https://www.rivm.nl/en/noronet).

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. SUPPLEMENTAL FILE 1, PDF file, 0.5 MB.