A next generation sequencing-based method to study the intra-host genetic diversity of norovirus in patients with acute and chronic infection

Immunocompromised individuals with chronic norovirus (NoV) infection and elderly patients are hypothesized to be reservoirs where NoV might accumulate mutations and evolve into pandemic strains. Next generation sequencing (NGS) methods can monitor the intra-host diversity of NoV and its evolution but low abundance of viral RNA results in sub-optimal efficiency. In this study, we: 1) established a next generation sequencing-based method for NoV using bacterial rRNA depletion as a viral RNA enrichment strategy, and 2) measured the intra-host genetic diversity of NoV in specimens of patients with acute NoV infection (n = 4) and in longitudinal specimens of an immunocompromised patient with chronic NoV infection (n = 2). A single Illumina MiSeq dataset resulted in near full-length genome sequences for 5 out of 6 multiplexed samples. Experimental depletion of bacterial rRNA in stool RNA provided up to 1.9 % of NoV reads. The intra-host viral population in patients with acute NoV infection was homogenous and no single nucleotide variants (SNVs) were detected. In contrast, the NoV population from the immunocompromised patient was highly diverse and accumulated SNVs over time (51 SNVs in the first sample and 122 SNVs in the second sample collected 4 months later). The percentages of SNVs causing non-synonymous mutations were 27.5 % and 20.5 % for the first and second samples, respectively. The majority of non-synonymous mutations occurred, in increasing order of frequency, in p22, the major capsid (VP1) and minor capsid (VP2) genes. The results provide data useful for the selection and improvement of NoV RNA enrichment strategies for NGS. Whole genome analysis using next generation sequencing confirmed that the within-host population of NoV in an immunocompromised individual with chronic NoV infection was more diverse compared to that in individuals with acute infection. We also observed an accumulation of non-synonymous mutations at the minor capsid gene that has not been reported in previous studies and might have a role in NoV adaptation.


Background
Norovirus (NoV) is recognized as a leading cause of epidemic and sporadic gastroenteritis around the world [1]. The viral RNA genome is about 7500 nt long and contains three ORFs. ORF1 encodes for a polyprotein that is cleaved into 6 non-structural proteins [2]. ORF2 and ORF3 encode the major (VP1) and minor (VP2) capsid proteins, respectively. NoVs are classified, based on VP1 amino acid sequences, into seven genogroups (GI-GVII) that are further divided into genotypes. To date, 41 NoV genotypes have been reported of which at least 29 have been found in humans [3], however, genotype GII.4 alone is responsible for over 60 % of all NoV outbreaks worldwide [4]. New genetic clusters of GII.4, commonly referred to as GII.4 variants, arise every 2 to 4 years and spread rapidly often causing global pandemics [4]. Novel GII.4 variants evolve by antigenic drift and display changes in VP1 epitopes that can avert immune responses mounted against previous variants [5][6][7]. Homologous recombination is another mechanism responsible for the genetic diversity among NoV GII.4 and it commonly occurs at the ORF1/ORF2 junction allowing the virus to exchange structural and nonstructural genes between different GII.4 variants or even different genotypes [8,9]. Norovirus acute gastroenteritis is a self-limited illness typically lasting 2 to 3 days while viral shedding can range from 13 to 56 days [10]. In immunocompromised patients, however, NoV shedding is usually prolonged [11][12][13][14][15] and cases of chronic NoV infection with shedding over 1-2 years have been reported [13,14,16]. Due to their long shedding periods and weak immune responses it has been hypothesized that immunocompromised patients with chronic NoV infection might be reservoirs where new GII.4 variants emerge [12,17]. Indeed, ORF2 sequence analysis in these individuals has shown that the virus can accumulate mutations in VP1 and develop an intra-host NoV population with large genetic diversity [12,16,[18][19][20][21][22][23]. However, it is still unclear how other regions of the viral genome evolve in immunocompromised hosts. More recently, the elderly and malnourished host have also been proposed as NoV reservoirs where NoV might accumulate mutations [17] but no studies have yet been performed in humans to confirm this hypothesis.
The gold standard method to study intra-host NoV populations involves cloning viral RT-PCR amplicons into plasmids followed by Sanger sequencing [16,19,20,24], a labour intensive method that requires a relatively large number of clones to be processed in order to obtain an accurate assessment of the viral diversity. An alternative approach uses next generation sequencing (NGS) technologies, which process millions of fragments of nucleic acid in a single experiment. NGS is highly cost-effective in terms of cost per base, however, since NoV RNA represents a very small fraction of all stool RNA, as little as 0.01 % [25], the cost per viral base can be considerable. Fortunately, the efficiency can be improved by viral RNA enrichment, and strategies previously used for NoV enrichment include polyA tail selection [26], RT-PCR amplification using NoV-specific primers [21,23,27], VIDISCA [28,29] and virus purification [22]. Depletion of rRNA is another possible enrichment strategy [30]. For most cells, rRNA is the most abundant species of RNA and its removal with commercial kits can substantially increase the prevalence of non-rRNA [31,32]. The method poses an advantage over others in that it maintains the original representation of the viral population present in the sample and is less likely to be affected by NoV RNA fragmentation.
In this study we establish a next generation sequencingbased method for NoV using bacterial rRNA depletion as an enrichment strategy for NoV RNA then use the resulting data to examine the intra-host viral population in samples from patients involved in NoV outbreaks (mostly elderly) and in longitudinal samples collected from an immunocompromised host with chronic NoV infection.

Mapping of NoV sequencing reads
The samples included in this study (n = 6) are described in Table 1. NoV sequences represented 0.01 % to 1.88 % of all quality-filtered reads and 0.04 % to 8.54 % of the non-rRNA reads. The non-NoV reads were further characterized and a description is provided under Additional file 1. The average coverage of the final consensus sequences ranged between 11X and 1,603X (Table 2 and Fig. 1). Five out of six samples yielded near full-length NoV sequences and called 99.91 to 100 % of the reference genome. OU3, the sample with the lowest percentage of NoV reads and lowest coverage, failed to yield a complete NoV genome sequence. The percentage of NoV sequences per sample showed a strong correlation with the Ct values of the RT-qPCR performed on RNA stool extracts (Spearman's rho correlation coefficient: 0.886, P = 0.019, two-tailed). An equally strong correlation was observed between the percentage of NoV sequences and viral titer per ng of RNA (Additional file 2). These results suggest that the poor yield of NoV sequences from sample OU3 was due to a low abundance of viral RNA rather than failure of the enrichment method.
In order to validate our NGS method, the NoV consensus sequences of samples OU1 and OU3 obtained with the MiSeq platform were compared to those obtained using Sanger sequencing. There was a large concordance between both methods. NoV OU1, which achieved high coverage with MiSeq, showed a concordance of almost 100 % between the sequences from both methods except for one nucleotide. The base mismatch was located near the 3'end of the genome and was confirmed as an error in Sanger sequencing after review of the Sanger chromatogram and the coverage with MiSeq at that position (21X). Furthermore, the 5' and 3' ends of the genome could be extended by 9 and 24 nucleotides, respectively, with MiSeq. These nucleotides were not covered with Sanger because of their close proximity to the primers used for PCR amplification and sequencing. The consensus sequence with lowest coverage, NoV OU3, showed three mismatches over a total of 5,081 nt that could be sequenced using Sanger and MiSeq. The mismatches were identified at 1732 nt, 2167 nt and 2239 nt (positions are given relative to KU311160) and had, respectively, coverages of 1X, 2X and 2X and Phred base quality scores of 16, 39 and 39. All three mismatches occurred at wobble bases resulting in synonymous mutations and the last two were located in the p22 gene. Since these positions had very low coverage, it is possible that the mismatches were created by sequencing errors with MiSeq, however, the high base quality associated with at least 2 of these three nucleotides and their relative position in the genome suggests that they could also represent true SNVs originally present in the sample.

Detection of single nucleotide polymorphisms
Single nucleotide variants (SNV) were identified to measure the intra-host genetic population of NoV. In order to reduce false positives, we only called a SNV if it was observed at least 5 times and represented a minimum of 2 % of all observations. Using these criteria we did not detect any SNVs in all acute infection samples (OU1, OU2, OU3 and OU4). OU3 and OU4 have relatively low coverage (Table 2) limiting the ability to detect SNVs, but OU1 and OU2 have an average coverage above 250X and still no SNVs were found. In contrast, the first sample from the immunocompromised subject with chronic NoV infection, SP1, had 51 SNVs while SP2, the second sample collected from the same individual 4 months later, had 122 SNVs, indicating that the genetic diversity of NoV in this patient was higher and also increased over time (Fig. 2). The increased number of SNVs in SP2 vs. SP1 was confirmed even after controlling for differences in coverage (Additional file 3). The higher number of SNVs in the SP2 sample was not due to more sensitive detection, since SP2 had a lower average coverage than SP1, OU1, and OU2 ( Table 2).
The percentages of non-synonymous single nucleotide variants (nsSNVs) for SP1 and SP2 were 27.5 % and 20.5 %, respectively. There were differences in the distribution of nsSNV across genes between SP1 and SP2. In SP1, most nsSNVs occurred in p22 (4/14), VP2 (4/14)   (3/14) whereas in SP2, the majority of nsSNVs occurred in VP2 (12/25) and VP1 (9/25) ( Table 3). The enrichment of non-synonymous mutations in VP2, the P2 domain of VP1 and p22 was statistically significant (P < 0.05) after controlling for gene/domain size (Table 3) and is consistent with an increased rate of amino acid divergence in these proteins. The amino acid residues affected by nsSNVs (summarized in Fig. 3 . Of the 11 mutations occurring at the P2 domain of VP1, two (affecting amino acid residues 294 and 340) were located at epitopes reported to be targeted by antibodies that block the binding of NoV to human blood group antigens (surrogates of neutralizing antibodies) [6]. A comparison between the consensus sequences of SP1 and SP2 revealed differences in 9 positions, of which 2 were synonymous (203 nt and 3851 nt), 6 were non synonymous (2188 nt, 6095 nt, 6193 nt, 6281 nt, 7141 nt and 7167 nt; the corresponding amino acid changes are shown in Fig. 3) and 1 was located at the 3'UTR (7509 nt). Among the 9 differences between the consensus sequences of SP1 and SP2, 6 were SNVs at SP1 only (203 nt, 2188 nt, 3851 nt, 6281 nt, 7141 nt and 7167 nt), 1 was a SNV in SP2 only (6095 nt), 1 was a SNV in SP1 and SP2 (6193 nt) and 1 was not a SNV in SP1 nor SP2 (7509 nt). There were a total of 10 SNVs occurring at the same nucleotide positions of SP1 and SP2 (474 nt, 2303 nt, 4323 nt, 5580 nt, 5706 nt, 6193 nt, 6222 nt, 6803 nt, 7172 nt and 7252 nt; shown with red color in Fig. 2).

Discussion
In this study we established a method to analyze the intra-host genetic diversity of NoV in samples from patients with acute and chronic NoV infection. Our next generation sequencing-based method coupled to bacterial rRNA depletion as a NoV RNA enrichment strategy produced between 0.01 % and 1.9 % of NoV reads. Characterization of the non-NoV reads confirmed that samples still contained 64 to 95 % of bacterial rRNA reads after depletion which differs with a previous study reporting 1 to 5 % of bacterial rRNA reads after using the same depletion method with stool RNA [32]. The reasons for the lower rRNA depletion efficiency observed with our samples could not be identified.
Since the method we used for depletion works with a library of probes that must first hybridize the bacterial rRNA for subsequent removal, it is possible that the hybridization step was affected by sequence incompatibility. Also, rRNA depletion was carried out with the maximum amount of input RNA recommended by the manufacturer, which could have affected the efficiency of the method. Based on our data we estimate that if all 16S and 23S bacterial rRNA could be removed from stool RNA, NoV sequences would have represented between 0.04 % and 8.5 % of all NGS reads.
For all but one sample we were able to obtain sufficient sequence data to retrieve near full-length NoV genome sequences. Moreover, even the non-enriched sample (OU4) provided enough data for de novo assembly of a NoV genome, AlbertaEI404/2012/CA, which to our knowledge, is the first near full-length NoV GI.7 genome reported. Strains of genotype GI.7 were observed with increased prevalence in Alberta, Canada, between July 2012 and June 2013 [33] and also among children in Pakistan between April 2006 and March 2008 [34].
Our analysis included four cases of acute NoV infection. Two samples from these acute cases, OU1 and OU2, yielded sufficient coverage to identify SNVs at frequencies ≥ 2 %. However, no NoV SNVs were detected in these two samples, indicating that the viral population in typical acute infections is homogeneous. In contrast, we identified numerous SNVs in an immunocompromised Fig. 2 Distribution of NoV SNV frequencies across the viral genome. Samples SP1 and SP2 were collected four months apart from an immunocompromised bone marrow transplant patient with chronic NoV infection. SNV calling was performed using Freebayes. Only SNVs with frequencies ≥ 2 % and ≥ 5X coverage are reported. Positions with coverage < 10X were excluded from the analysis. SNVs shared in common between SP1 and SP2 are shown in red patient with chronic NoV infection. Interestingly, there was also an increase of SNV over time (51 SNVs with ≥ 2 % frequency in the first sample and 122 SNVs with ≥ 2 % frequency in the second sample collected 4 months after), revealing intra-host NoV evolution. As of the date of writing this manuscript there were two published studies comparing the intra-host populations of NoV between immunocompetent and immunocompromised patients by using NGS data. Both reported similar observations [21,22]. Vega et al. found no SNVs throughout the viral genome (ORF1, ORF2 and ORF3) in an immunocompetent subject with acute NoV illness but identified multiple SNVs in three different immunocompromised bone marrow transplant patients (15, 67 and 235 SNVs with ≥ 10 % frequency). In an analysis of just the ORF2 and partial ORF3 regions, Bull et al. detected multiple SNVs in immunocompetent individuals with acute NoV infection (5 to 8 SNVs with ≥ 2 % frequency). However, an immunocompromised individual with chronic NoV infection displayed considerably higher NoV diversity that also increased over time (48,    and 34 to 48 % in immunocompromised subjects) whereas we observed a rate of 21 and 28 % in the immunocompromised patient. Our rates are similar to those reported in another study with bone marrow transplant patients receiving immunosuppressive drugs (236 non synonymous mutations out of a total of 1082 mutations across 13 samples, equivalent to an overall rate of 21.8 %) [23]. It could be argued that Bull et al. achieved higher coverage and therefore higher resolution of SNVs with PCR, however, at least two of our samples (one from an outbreak patient and another from the immunocompromised patient) had average coverage levels greater than 1300X, well above 950X, the average coverage reported by Bull et al. We can only speculate that the differences between our study and that of Bull et al. could be due to the processes of quality filtering and trimming of reads as well as the parameters used for SNV calling, which probably were more stringent in our study.
We observed an enrichment of non-synonymous SNVs in the P2 domain of VP1 which matches the expectation that the most exposed and possibly the most antigenic part of the virus bears the highest pressure to diverge. In fact, the majority of studies analyzing NoV intra-host genetic diversity have been restricted to identify potential changes at VP1 [12,14,21,24,35,36]. By analyzing near-full length NoV genomes, we also observed an enrichment of non-synonymous mutations in VP2, which suggests that during chronic infection this gene can also be under pressure to diverge. Although our observations are based on samples from a single immunocompromised patient, we believe this is plausible because it agrees with previous analysis of sequence alignments of multiple GII.4 strains showing high evolutionary rates for VP2 [27,37]. The role of VP2 in the viral life cycle is still unknown. Only a few copies of VP2 molecules are present in the final virion (the precise number is yet unclear). VP2 appears to bind the interior surface of the capsid and has been shown to enhance the expression and half-life of VP1 [38]. Since VP2 is rich in basic residues and therefore is positively charged, it has also been suggested that it interacts with the negatively charged genomic RNA, and possibly plays a role during the encapsidation process [38]. The VP2 protein of murine noroviruses regulates the maturation of antigen presenting cells and is an important determinant of NoV protective immunity [39]. It is interesting to consider that human NoV VP2 might be an important epitope of host immune responses.
By analyzing whole NoV genome sequences using traditional cloning, Chan et al., found accumulation of mutations at VP1 and VP2 (38 and 15 nt resulting in 9 and 5 amino acid changes, respectively) over a period of 4 months in an immunocompromised patient with agammaglobulinemia and thymoma [19]. Conversely, Vega et al. reported that, in comparison to NoV strains in circulation, NoV mutations in immunocompromised bone marrow transplant patients occur mostly at random positions except for gene p22, which presents a significantly large proportion of mutations and positively selected sites [22]. We speculate that the difference in results observed in these studies and our study might be due to temporal changes in the strength and specificity of host immune responses during chronic infection. This will be dependent on the type of immunosuppressed host, the type of exogenous immunosuppression administered, the time of infection during the patient's clinical course and the time of the patient's immune reconstitution. Regarding this point, we observed that the distribution of non synonymous mutations across genes changed with time which suggests that NoV might face changes in its evolutionary trajectory during chronic infection as reported for other viruses such as HCV and HIV [40][41][42].
Besides whole genome sequencing, NGS methods are powerful tools for studying viral population dynamics within a host because of the high sequencing depth that can be achieved. In addition, samples can be prepared relatively quickly without the need of cloning into vectors and multiple isolates can be sequenced in parallel (in our case, six samples were analyzed in one MiSeq 2x121bp run). The number of samples that can be sequenced depends on the throughput of the sequencing technology and the abundance of the viral RNA in the sample. Different forms of enrichment can enhance the later. We demonstrated that bacterial rRNA depletion is a beneficial treatment that can be further optimized. Hybrid capture is a promising new enrichment strategy that remains to be examined in the future [43].

Conclusions
This study provides data useful for the selection and improvement of NoV RNA enrichment strategies for NGS and is also the first study to look at the intra-host diversity of NoV by analyzing near full length-genomes from acute cases of NoV infection and in longitudinally collected samples from an immunocompromised patient with chronic NoV infection. We identified a larger viral diversity in the immunocompromised patient, which increased over time and we also observed that genes VP2, p22 and VP1 accumulated the majority of nonsynonymous mutations during chronic infection. Further studies are needed to observe if the accumulation of mutations at VP2 is consistent across immunocompromised patients with chronic infection and unveil the role of VP2 and p22 proteins in relationship to host immune responses. We are currently planning a larger follow-up study to assess the intra-host genetic diversity of NoV in solid organ transplant recipients.

Patient samples
All stool specimens were collected in Alberta between 2012 and 2014 and stored at -80°C until analysis. A total of six stool samples previously genotyped within our routine program of NoV surveillance in Alberta were included in this study [33,44]. The patient's description and NoV genotype associated to each sample are listed in Table 1. Samples OU1, OU2, OU3 and OU4 were collected from outbreak patients with acute NoV infection. The near full-length norovirus genome for sample OU1 was determined in a previous study using Sanger sequencing [9] and was included for comparison purposes. Samples SP1 and SP2 were collected four months apart from a pediatric bone marrow transplant patient who first tested positive for NoV 6 months before sample SP1 was collected.

RNA extraction
Samples OU1, OU2, OU3, SP1 and SP2 were processed as follows: 50 to 75 mg of stool were mixed with 20 μL of proteinase K and 200 μL of lysis buffer from Maga-zorb®RNA mini-prep kit (Promega, Madison, WI). Nucleic acids were extracted with 1 mL of Trizol® (Life Technologies, Carlsbad, CA) and according to the manufacturer's instructions. The yield of RNA per extraction was estimated using NanoDrop 1000 and the process was repeated to obtain at least 50ug of RNA per sample. The extracts were then treated with DNase (Promega, Madison, WI) and purified by phenol chloroform extraction and ethanol precipitation. RNA extracts were eluted through OneStep™ PCR Inhibitor Removal columns (Zymo Research, Irvine, CA). Bacterial rRNA was depleted using Ribo-Zero® bacterial kit (Epicentre, Madison, WI) according to the manufacturer's instructions using 5 μg of RNA per sample and purifying by ethanol precipitation as the final step. Depletion of bacterial rRNA was performed once per sample. For comparison purposes, sample OU4 was processed using our routine enteric virus nucleic acid extraction method previously described [33,44] using Magazorb®RNA mini-prep kit. The nucleic extract was treated with DNase and purified by phenol chloroform extraction followed by ethanol precipitation. Sample OU4 was not depleted of bacterial rRNA. The presence of norovirus RNA was confirmed in all extracts by RT-qPCR as previously described [45].

Illumina library preparation and sequencing
Sample libraries were prepared from 1 μg of RNA using the TruSeq RNA sample preparation kit v2 (Illumina, San Diego, CA) following the manufacturer's instructions with a fragmentation time of 1 min during the "elute-fragmentprime" step and unique indexed adapters for each sample.
cDNA libraries were quantified with Qubit and the average fragment size was estimated using the Agilent 2100 Bioanalyzer. A control library of phage X714 was also included in each sample. All sample libraries (n = 6) were sequenced once on a single Illumina MiSeq run to produce paired end reads of 250 bp each, resulting in reads of 121 bp each after removing adapters.

Identification, characterization and removal of rRNA reads
Raw sequence reads were quality-trimmed and filtered with Prinseq-lite, version 0.20.4 [46] using the following criteria: the first nucleotide at each end (5' and 3') was trimmed and the following nucleotides were also trimmed stepwise if their base quality was below 20. Sequences with an average base quality below 20 or with more than 90 % of Ns were also removed. A description of the reads that were filtered is provided in Additional file 2.
Ribosomal RNA reads were identified and filtered out with SortmeRNA [47] using the 23S/28S large subunit (LSU) and 16S/18S small subunit (SSU) rRNA SILVA 119 databases and the 5S and 5.8S rRNA Rfam databases for all three domains of life (Eukarya, Bacteria and Archaea). All reads failing to pass filters (i.e. non-rRNA reads) were maintained as paired-ends reads and used in downstream analyses the command used for analysis is described in Additional file 2).
A subset of 150,000 single-end reads with length ≥ 80 nt and identified as bacterial 23S rRNA, the most predominant type of rRNA found in all samples, was uploaded in the SILVAngs data analysis service (https://www.arb-silva.de/ngs/) for identification of operational taxonomic units (OTUs).
Since no full-length or near full-length genome NoV GI.7 sequences were available in GenBank, additional steps were followed for OU4. OU4 reads were assembled de novo with Velvet version 1.2.10 using several hash lengths with read category set to 'short paired' and including the consensus sequence (partial ORF1, complete ORF2 and ORF3) as a long read (the command used for this analysis is described in Additional file 2). Norovirus contigs were identified among all Velvet assemblies with BLAST [49] using JN899243.1, a genotype GI.9 strain, as query sequence. All NoV contigs and the partial consensus sequence were aligned using MEGA 6.0 [50] to obtain the final genome assembly.
The ends of each NoV genome were extended beyond the reference sequences by using the first and last 15 nt as query for matches among fastq sequences using the Unix "grep" command. Matching reads were aligned to the consensus sequence and any extra nucleotides (5'or 3' overhangs) were incorporated into the consensus sequence. The process was repeated until no more extra nucleotides were found at either end of the consensus sequence.
Sample reads were mapped with Bowtie 2 to their respective NoV extended consensus assembly. The final alignments (SAM files) were used to calculate the coverage per genome position using BEDTools, version 2.14.3-1 [51].

NoV sequencing using Sanger's method
The NoV strain from sample OU3 was sequenced using Sanger's method to compare results against those obtained with Illumina MiSeq. Nine pairs of primers (described in Additional file 2) were designed to retrieve overlapping PCR amplicons between~600 to 1100 bp long, spanning altogether all NoV ORFs. The RT and PCR reactions were performed as previously described [9]. PCR products were obtained for six out of nine pair of primers and were sequenced in both directions. The assembly of the sequences produced two nonoverlapping contigs: a 3,448 bp sequence containing a partial ORF1 (incomplete at the 5' and 3' ends) and a 1,906 bp sequence spanning ORF2 and ORF3 (incomplete at the 5' and 3' ends, respectively).

Characterization of non-rRNA, non-NoV sequences
The sequences failing to align with Bowtie 2 to the final NoV consensus sequence were analyzed with BLAST to further characterize the major components of stool RNA. All reads from a single end were queried against the non-redundant nt database using megablast (standalone version with databases downloaded on July 24, 2015; see parameters of BLAST analysis in Additional file 2). Results were analyzed with SPSS after removing duplicates, i.e. if a read had more than one BLAST hit, then only the hit with the lowest e-value was included in the analysis. Bacterial hits belonging to the normal human gut were identified using as reference the microoganisms reported in previous studies [52,53].

Analysis of single nucleotide variants
Single nucleotide variants (SNVs) were called with Free-Bayes [54] and visually inspected in Tablet version 1.14.04.10 [55]. The following criteria was used for SNV calling with FreeBayes: -K (report all alleles passing filters), -haplotype_length =1 (call haplotypes as 1 nt long), -m or mapping quality =10 (chance that the read truly originated elsewhere of 1 in 10), -q or base quality =20 (chance of a wrong base call of 1 in 100), -F or alternate fraction =0.02 (call SNVs with frequencies ≥ 2 %), -min-coverage =10 (call SNVs for positions with coverage ≥10X) and -C or min-alternate-count = 5 (call SNVs with coverage ≥ 5X) (the command used for the analysis is described in Additional file 2). The choice to set up the analysis to detect variants with frequencies ≥ 2 % was made based on a study reporting that sequencing errors with the MiSeq platform can produce false variants that are undistinguishable from true low frequencies variants at ≤ 1 % with a 1000X average coverage [56]. We also set up the analysis to call SNVs with a coverage of ≥ 5X based on: 1) a study that eliminated virtually all false positives by calling variants if counted ≥ 10 times independently [57] and 2) the lower coverage per genome position achieved with our samples compared to Van den Hoecke et al. [57].

Additional files
Additional file 1: Major components of stool RNA after bacterial rRNA depletion. This file provides a description of the rRNA and non-NoV-non-rRNA reads from each sample. (DOCX 1108 kb) Additional file 2: Various supporting data. This file contains: 1) the data used to estimate the correlation between percentage of NoVreads and Ct values, 2) the commands used for NGS analysis, 3) a summary of the sequences filtered out by Prinseq-lite due to quality control and 4) a description of the primers designed for sequencing sample OU3 via Sanger's method. (DOCX 86 kb) Additional file 3: NoV SNV calling results. This file lists all synonymous and non-synonymous NoV SNVs detected by FreeBayes in samples SP1, SP2, and a subset of sequences from SP1 that produced an average NoV coverage similar to thatobserved in SP2. (XLSX 38 kb)