Tracing the fate of wastewater viruses reveals catchment-scale virome diversity and connectivity

The discharge of wastewater-derived viruses in aquatic environments impacts catchment-scale virome compo- sition. To explore this, we used viromic analysis of RNA and DNA virus-like particles to holistically track virus communities entering and leaving wastewater treatment plants and the connecting river catchment system and estuary. We reconstructed > 40 000 partial viral genomes into 10 149 species-level groups, dominated by dsDNA and ( + )ssRNA bacteriophages ( Caudoviricetes and Leviviricetes ) and a small number of genomes that could pose a risk to human health. We found substantial viral diversity and geographically distinct virus communities associated with different wastewater treatment plants. River and estuarine water bodies harboured more diverse viral communities in downstream locations, influenced by tidal movement and proximity to wastewater treatment plants. Shellfish and beach sand were enriched in viral communities when compared with the surrounding water, acting as entrapment matrices for virus particles. Extensive phylogenetic analyses of environmental-derived and reference sequences showed the presence of human-associated sapovirus GII in all sample types, multiple rotavirus A strains in wastewater and a diverse set of picorna-like viruses associated with shellfish. We conclude that wastewater-derived viral genetic material is commonly deposited in the environment and can be traced throughout the freshwater-marine continuum of the river catchment, where it is influenced by local geography, weather events and tidal effects. Our data illustrate the utility of viromic analyses for wastewater- and environment-based ecology and epidemiology, and we present a conceptual model for the circulation of all types of viruses in a freshwater catchment.


Introduction
Viruses are the most abundant biological entities in terrestrial and aquatic biomes, but their origin, distribution and potential to spread disease via watercourses is poorly understood (Roux et al., 2020). Previous research has demonstrated that wastewater contains a plethora of viruses, including human-pathogenic and zoonotic viruses, and that wastewater treatment processes do not remove human viruses with sufficient efficacy (Da Silva et al., 2007;Farkas et al., 2018b;Fong et al., 2010;Girones et al., 2010;Gomes et al., 2019;Gulino et al., 2020;Hellmér et al., 2014;Kitajima et al., 2014;Prado et al., 2019;Qiu et al., 2015;Sidhu et al., 2017). Viral abundance, behaviour, infectivity and fate remain poorly understood because of knowledge gaps in the ecology and connectivity of viromes across human populations and the freshwater-marine continuum.
The current gold standard method for investigating enteric viruses in the environment is q(RT)-PCR, a technique that provides reliable quantitative information on the presence of the genomic material of target viruses, but requires prior knowledge on the identity of the virus and its genome sequence (Farkas et al., 2020(Farkas et al., , 2017b. As qPCR-based assays only detect a fragment of the genome, the question of virus integrity, and hence infectivity, remains open. Infectivity assays can offer a solution, but even where available, require specialised cultivation systems and are not likely to become generally applicable for routine monitoring of public health risks (DiCaprio, 2017). As a more comprehensive and now potentially feasible alternative, we applied shotgun viromics, i.e. next-generation sequencing of the entire aquatic virome, to reconstruct full virus genomes from the environment and objectively scrutinise the ecological and health implications of virus diversity and geographical distribution, with minimal bias.
Virome analyses are transforming our understanding of viral diversity and function in the biosphere (Emerson et al., 2018;Gregory et al., 2020;Roux et al., 2016) and provide unprecedented opportunity to understand the connectivity and fate of human-derived viruses at the catchment scale. Here, we present the first integrated analysis of the full virome of a river catchment system and estuary including water, sediment, wastewater treatment plants, beaches and shellfish production areas ( Fig. 1 and Supplementary Table 1). We assembled over 40,000 partial or near-complete genomes (UViGs Uncultivated Virus Genomes, (Roux et al., 2019)) of ssRNA, dsRNA, ssDNA and dsDNA viruses, clustered into 10 149 species-level groupings (vOTUs, viral Operational Taxonomic Units). Our detailed bioinformatic analysis of the RNA and DNA viromes provides an assessment of viral diversity in the wastewater-impacted Conwy river catchment located North Wales (UK) (Farkas et al., 2018aPerkins et al., 2014;Robins et al., 2019), encompassing information on the dynamics of viral deposition along the river system leading to viral enrichment at the estuary, including shellfish destined for human consumption and a recreational bathing beach. Viral genome reconstruction revealed general patterns of viral enrichment, dilution and degradation, and the implications for human health.

Sample collection
This work builds on our pilot study of a single wastewater treatment plant, and a downstream water and sediment sampling site, in which we optimised methods and showed that we could reconstruct RNA virus genomes from environmental samples (Adriaenssens et al., 2018). We collected and processed four different types of samples for this study: wastewater (influent and effluent), surface water (river and estuarine), sediment and shellfish in June 2017 from the Conwy river catchment area located in North Wales (UK) (Fig. 1, Supplementary Table 1). Wastewater influent was collected from the four major treatment plants in the catchment and the corresponding effluent from three (the Ganol plant effluent pipe exits directly into the open sea and therefore was not sampled separately); one litre per sample. Surface water was collected in four biological replicates of 50 L, resulting in two replicates per library type (RNA and DNA-based). At two locations, one in the river and one at a major recreational beach, sediment samples were taken in 4 biological replicates of at least 50 g, two replicates per library type. At low tide, samples were scooped from an accessible part of the river bank into sterile bags using a sterilised spade. Finally, mussels (Mytilus edulis) were collected from the two main commercial shellfishery locations in the estuary and divided into eight pseudoreplicates, as described below. Samples were collected at low tides, enabling the collection of shellfish samples from shore without a boat.

Sample processing
The wastewater (1 L) and surface water (50 L) samples were concentrated using a two-step protocol involving tangential flow ultrafiltration (TFUF) and beef extract elution as described in detail previously (Farkas et al., 2018c). Briefly, sample volumes were reduced to 50 mL by TFUF on a KrosFlo® Research IIi Tangential Flow Filtration System (Spectrum Labs, Phoenix, AZ, USA, Cat. no. SYR-U20-01N) using a 100 kDa cut-off mPES MiniKros® hollow fibre filter (Spectrum Labs). Virus particles were eluted using beef extract and NaNO 3 to a final concentration of 3% and 2 M (pH 5.5), respectively. After the solution's pH was adjusted to 7.5, PEG 6000 was added to a final concentration of 15% with 2% NaCl, incubated overnight at 4 • C and after centrifugation (30 min, 10 000 x g, 4 • C) the pellet was resuspended in 10-15 ml PBS (pH 7.4). These suspensions were kept at -80 • C until nucleic acid extraction. The sediment samples were processed using beef extract elution and PEG 6000 precipitation as above and described previously (Farkas et al., 2017a).
For the mussel samples, approximately 200 mussels were collected from each location and stored on ice. Each mussel was dissected and the digestive tissue extracted and minced with a scalpel. The tissue was Fig. 1. Viral abundance along the wastewater impacted Conwy river catchment and coastal zone. Left: Schematic of the Conwy river catchment with sampling sites designated by colour-coded dots (redwastewater, bluesurface water, greensediment, orangeshellfish). The section of the river within the tidal limit is designated in green. Map of Great Britain by Free Vector Maps. Right: Boxplot representation of the number of species (vOTUs) detected in each sample per ml or g of sample extracted, composed of RNA and DNA libraries and biological replicates, species numbers for single libraries in blue dots. pooled per location and then divided into four pseudoreplicates. Two replicates per location were mixed with SM buffer (0.1 M NaCl, 50 mM Tris/HCl-pH 7.4, 10 mM MgSO 4 ) and two with PBS at 25 g of digestive tissue to 20 mL of buffer. The samples were then shaken for 30 minutes (150 rpm) at room temperature to dissociate viral particles from the digestive tissue after which they were stored at -80 • C. Due to a limit on the number of shellfish collected, we used both PBS and SM buffer as we could not perform a pre-optimisation of the protocol. No systematic differences between the two methods were detected.

RNA extraction
Wastewater, surface water and sediment concentrates were processed as follows. The concentrates were diluted in an equal volume 0.5 M NaCl to improve dissociation of viral particles before filtration. After centrifugation (5 min, 3200 x g) the supernatant was filtered through a 0.2 µm sterile syringe filter (Millipore). The filtrate was further concentrated using Vivaspin 20 spin filters (100 kDa) and centrifugation at 3200 x g. Once the volume was below 1 mL, 5 mL Tris buffer (5 mM TrisHCl, 5 mM MgSO 4 , 75 mM NaCl, pH 7.5) was added and the volume reduced (two times) to reduce the NaCl content of the virus suspension. Centrifugation times ranged between 150 minutes and 20 hours to reduce the volume below 500 µL for the next step. A DNase treatment with 10 U Turbo DNase (Invitrogen) was performed to remove extraviral DNA (incubation at 37 • C for 30 min, inactivation at 75 • C for 10 min). Mussel samples were highly viscous and required separate processing, as we were unable to filter or concentrate with the Vivaspin filters. Instead, 2 × 1 mL aliquots per replicate were mixed with 0.1 mm glass beads (MoBio) and lysed in a PowerLyser (MoBio) shaker (2 × 30 seconds at 3400 rpm). Debris was removed by centrifugation (5 min at 3200 x g) and the supernatant was stored at -20 • C for next-day processing.
For all sample types, the viral capsids were lysed using a combination of proteinase K (50 µg for clear samples, 100 µg for turbid samples), EDTA (0.5 M final concentration) and SDS (0.5% final concentration), and incubation for one hour at 56 • C. Next, the RNA was extracted by TRIzol extraction derived from Kroger and colleagues (Kroger et al., 2012). In short, 500 µL of sample was mixed with 1 mL of TRIzol reagent and 200 µl of molecular-grade chloroform in Phasemaker TM tubes (Invitrogen), shaken vigorously and centrifuged for 15 minutes at 13 000 x g. The aqueous phase was removed and transferred to a new tube. The phase separation was repeated for samples that remained turbid. The nucleic acid was recovered by isopropanol precipitation and resuspended in 50 µL of sterile, RNase-free water. Viral DNA was removed with an additional DNase step, adding 4 U Turbo DNase, 5 µL TD buffer, and incubating for 40 minutes at 37 • C followed by inactivation of the DNase at 75 • C for 10 minutes. The DNase was removed by a second isopropanol precipitation as above, the RNA resuspended in 50 µL of RNase-free water and stored at -80 • C until sequencing. Alongside all samples, a positive extraction control comprising of Salmonella cells (Salmonella enterica subsp enterica serovar Typhimurium strain D23580, RefSeq acc NC_016854) and the process-control virus mengovirus (~ 10 5 particles/ml) was extracted, as was a negative Tris buffer control.

DNA extraction
Wastewater, surface water and sediment samples were processed similarly as for RNA extraction with a few amendments to the extraction process. The samples were diluted in 10 ml NaCl (0.5 M). For surface water and sediment samples one replicate (designated a) was treated with chloroform (1 mL) to lyse the cellular fraction (15 min incubation with gentle shaking) and the cellular debris removed by centrifugation (5 min, 3200 x g). The second replicate (b) was filtered through a 0.45 µm sterile syringe filter (Millipore). For the wastewater samples which consisted of only one replicate, the sample was split in two, half treated with chloroform and half filtered, and then merged. All samples were then concentrated and desalted as described above (using Vivaspin 20 spin filters (100 kDa) and centrifugation at 3200 x g, with centrifugation time between 100 min and 20h). All sample concentrates (approx. 500 µL each) were treated with 10 U of Turbo DNase (Invitrogen) and 10 µg of RNase A (Thermo Fisher Scientific) supplemented with Turbo DNase buffer for 30 min at 37 • C and inactivation at 65 • C for 10 min. The separation of filtering and chloroform treatment before proteinase and nuclease treatment theoretically allows for the recovery of both giant viruses (larger than the filter pore size) and viruses with lipid membranes, while simultaneously reducing the background cellular DNA. No differences between treatments were found for potentially pathogenic viruses, however, the systematic comparison of methods for all viruses is beyond the scope of this study and we encourage others to use our data for further exploration.
Mussel digestive tissue was processed exactly as during RNA extraction (mixed with 0.1 mm glass beads and lysed in PowerLyzer) and no nuclease treatment was performed.
From this point, all samples were extracted in the same manner. Capsids were lysed by adding proteinase K (50 µg/mL final concentration), EDTA (20 mM final concentration) and SDS (0.5% final concentration), followed by incubation at 56 • C for one hour and extracted using phenol/chloroform/isoamylalcohol (25:24:1) in PhaseLock tubes (VWR). The aqueous phase was transferred to a new tube and the process was repeated at least once (twice for turbid samples), followed by one round of chloroform phase separation. Finally, samples were further cleaned and concentrated with ethanol precipitation (2.5 x volume 100% ethanol; 1/10 volume 3 M NaAc pH 5; incubation at -20 • C for 30 minutes; precipitation 30 min at 15 000 x g, 4 • C), washed with 70% ethanol and air-dried in a laminar flow cabinet.
In tandem with the whole process, control samples were extracted, starting with the dilution in NaCl. We used a negative control consisting of Tris buffer and a positive control consisting of 500 µl stationary culture Escherichia coli MG1655 cells (RefSeq acc NC_000913), 2.2 × 10 8 pfu of Escherichia phage T5 (RefSeq acc NC_005859) and 1.3 × 10 5 pfu of Escherichia phage vB_EcoP_phi24B (GenBank acc HM208303).

Sequencing
Sequence library preparation and sequencing was performed by the Centre for Genomics Research (CGR) NBAF facilities at the University of Liverpool, UK. RNA libraries were prepared as in the pilot study (Adriaenssens et al., 2018) using the NEBNext Ultra directional RNA library preparation kit of Illumina with dual indexes. During library preparation, the number of PCR cycles was increased to 30 to account for the low amounts of input RNA (< 1 ng). Dual-indexed DNA libraries were generated using the NEBNext Ultra II DNA Library Prep kit according to the manufacturer's instructions. Libraries were pooled and sequenced on six lanes of the Illumina HiSeq 4000 generating paired-end 2 × 150 bp reads, three lanes for the RNA libraries in July 2017 and three lanes for the DNA libraries in March 2018.
The DNA sequencing run failed for libraries DNA_SW2a, DNA_TyCa/ b, DNA_SBa/b and unfortunately, we were unable to reconstruct the libraries as the samples had been mistakenly stored at 4 • C and the DNA had degraded. Furthermore, the read lengths obtained for the mussel DNA libraries were much lower as for all other libraries, as the DNA had been excessively sheared during the extraction procedure.

In silico processing
Reads went through an initial round of quality control at CGR to remove Illumina adapters (Cutadapt version 1.2.1, -O 3) and were trimmed with Sickle (version 1.2) removing all reads below an average quality of 20 and shorter than 20 bp (Joshi and Fass, 2011;Martin, 2011). The resulting fastq files were received as raw read files from the CGR and deposited into SRA under BioProject PRJNA509142, accession numbers SRR8299359 to SRR8299398.
The paired-end read files were further trimmed and filtered to increase quality using the prinseq-lite suite (Schmieder and Edwards, 2011) and the read pairs meeting the following criteria were retained: minimum length 35 bases, GC-content between 5 and 95%, maximum 1 N, trimmed until the average read quality was 30. For all exactly duplicated reads only one copy was retained. The reads for the control libraries were merged per library type (RNA & DNA) and used as a bowtie2 mapping reference (Langmead and Salzberg, 2012). Each of the sample libraries was then mapped against its control and only the unmapped reads were retained. These reads were then assembled per sample using SPAdes version 13.9 using the k-mers lengths 21,33,55,77, 95,107,121 (Nurk et al., 2013), with the exception of the mussel DNA libraries containing the shorter reads where the k-mers 21,33,55,77 were used. The control libraries were assembled using the same parameters and compared to the sample contigs using BLASTn (BLAST+ suite), and sample contigs that showed significant similarity (e value < 0.001) were removed from each of the sample contig datasets (Camacho et al., 2009).
From these contigs, an Anvi'o contig database was created according to the instructions of the metagenomics workflow (Eren et al., 2015). To be included in the database, contigs needed to meet the following criteria: RNA library assemblies (i) contig length min 1000 nucleotides (nt); (ii) amino acid similarity with any known virus; (iii) recruit no reads from control libraries; DNA library assemblies (i) contig length min 10,000 nt, (ii) identified by VirSorter as viral in categories 1 or 2 (Roux et al., 2015), (iii) recruit no reads from control libraries. VirSorter was run on all DNA contig sets using the microbiome decontamination mode on the iVirus Cyverse infrastructure (Bolduc et al., 2017). The contig dataset comprising 40 000 UViGs was merged and clustered at an approximation of the viral species level (95% average nucleotide identity over min 80% of contig length), according to the species definition for bacteriophages implemented by the International Committee on Taxonomy of Viruses (ICTV) and conventionally used in virome studies (Adriaenssens and Brister, 2017;Emerson et al., 2018;Gregory et al., 2016;Roux et al., 2019Roux et al., , 2016. We performed a final refinement by removing all contigs < 10 000 nt assembled from RNA libraries that showed amino acid similarity with dsDNA viruses, based on diamond BLASTx comparison (Buchfink et al., 2015) with the nr database downloaded from the NCBI in January 2018. The final database contained 10 149 UViGs (Uncultivated Viral Genomes, (Roux et al., 2019)) that each represent a viral species-level population. Taxonomic information was added to the contigs database in Anvi'o using Kaiju with the built-in viral database (Menzel et al., 2016). All individual assemblies were also compared with the nr and viral RefSeq protein databases (version Jan 2018) separately in case the length thresholds for contig database creation excluded certain virus types.
To compare the incidence and abundance of UViGs in the different samples, for each library the reads were mapped to the contigs database using kallisto (Bray et al., 2016). The abundances of contigs within and between samples were assessed by transforming the values into Transcript Per Million values (TPMs) where each contig (UViG) was considered a transcript using the program tximport in R (Soneson et al., 2016). The resulting 10 149 by 58 matrix was visualised with Phantasus (Zenkova et al., 2018). The pseudobam alignment files generated by kallisto were then transformed into Anvi'o profiles according to the metagenomics workflow instructions and investigated using the anvi-interactive interface (Eren et al., 2015). Numbers of species detected per library, sample or sample type were calculated as the number of UViGs having a TPM value of minimum 10. Venn diagrams were produced on the online webserver http://bioinformatics.psb.ugen t.be/webtools/Venn/ hosted by the VIB-UGent Center for Plant Systems Biology.
The taxonomic classifications by Kaiju as part of the Anvi'o platform left over 5000 UViGs unclassified. We then used diamond BLASTx against the viral RefSeq protein database (version 200, May 2020) and Megan 6 Community Edition to assign all UViGs to their most reliable taxonomic rank using the Megan 6 "long read" lowest common ancestor algorithm at the default settings (Buchfink et al., 2015;Huson and Weber, 2013). The taxonomic bin information was added to the Phantasus heatmaps by matching the UViG names and exported to R Studio using the tidyverse packages to create graphs in ggplot2 (Wickham, 2016;Wickham et al., 2019).
To generate phylogenetic trees of taxonomic groups of interest, we used the Megan 6 taxonomic bins. All UViGs assigned to a bin were annotated with Prokka (Seemann, 2014) using the -kingdom Viruses setting and the predicted CDSs were manually curated in UGene (Okonechnikov et al., 2012) to adjust for the presence of polyproteins and missing start or stop codons from incomplete genomes. Per RNA virus taxonomic group, the RNA-dependent RNA polymerase (RdRP) amino acid sequences were extracted and aligned together with RdRP sequences from reference databases using MAFFT with maximum 5 iterations (Katoh and Standley, 2013). The resulting alignments were trimmed with TrimAl (Capella-Gutiérrez et al., 2009) using the -gappyout setting, followed by manual inspection in the UGene alignment viewer. Sequences missing the conserved structural motifs present in RdRPs (Venkataraman et al., 2018) were removed, as were sequences missing more than 50% of the trimmed sites. Trees were computed using the IQ-Tree suite (Nguyen et al., 2015) including calculation of the best substitution model with ModelFinder (Kalyaanamoorthy et al., 2017), calculation of the approximate likelihood ratio test (1000 repetitions) (Anisimova et al., 2011) and ultrafast bootstrap approximation with UFBOOT2 (1000 repetitions) (Hoang et al., 2018). The resulting trees were analysed and annotated in iTOL (Letunic and Bork, 2007). For the picorna-calici tree, the alignments generated by Shi and colleagues were additionally used as references (Shi et al., 2018(Shi et al., , 2016. Rotavirus segment genotyping was performed on the RotaC 2.0 webserver of the Rega Institute (KU Leuven, Belgium) (Maes et al., 2009).

Viral species richness is environment-specific and geographically distinct
We generated a final species-level contig database containing 10 149 vOTUs from 44 897 assembled contigs, each represented by the longest viral genome (UViG). We used the number of vOTUs per sample, normalised per volume (ml) or wet weight (g) of input material, as an approximation of species richness (Fig. 1). Normalised viral OTU ("species") richness was highest in the shellfish digestive tissue and beach sediment samples, intermediate in wastewater influent and effluent and lowest in the surface water samples. The surface river water samples showed a trend of increasing viral richness moving downstream, as further inputs of wastewater from treatment plants and other anthropogenic sources occurred, reaching a plateau around the location of SW3 after which richness remained high (Fig. 1).
We further investigated the differential patterns of abundance of each UViG in each library by mapping reads of all samples against the vOTU database and visualised the data with Anvi'o ( Fig. 2, Supplementary Fig. 1), to identify 13 categories of viral species abundance and composition patterns (Table 1). The wastewater samples contained the most diverse set of UViGs in absolute numbers, however, each wastewater treatment plant yielded a geographically distinct viral community. The river water samples contained a lower absolute richness of viruses than the wastewater, except for sample SW5 (and to a lesser extent, SW3) which showed a high degree of overlap in UViGs from category 1 with the wastewater influent sample from the Tal-y-Bont treatment plant (RNA_TI) (Fig. 2). Many of the same UViGs in this category (1) were also detected in the mussel (shellfish) samples and in the sediment samples. Comparing this pattern with the geographical origin of the samples (Fig. 1), revealed that the river water upstream and distant from wastewater effluent locations contained fewer detectable virus species, while the locations immediately downstream of an effluent pipe (SW3) or in the tidal estuary (SW5, mussels, beach sediment) were enriched for UViGs. The high virus richness in the tidal estuary (SW5) can be explained by the mixing of river and marine waters during tidal flow (Robins et al., 2019). The SW5 wastewater treatment Combined Sewage Outflow (CSO) periodically discharges untreated sewage directly upstream of SW5, representing a sewage input that largely avoids the dilution effect of estuary water and is consistent with the higher detection of faecal indicator bacteria previously at this location (Perkins et al., 2014). The viral species count per sample ( Fig. 1) also demonstrated that wastewater effluent samples (mean 1287) generally had a lower species tally than influent (mean 2079), indicating that wastewater treatment reduced viral species richness, but the large variance and limited number of samples (n = 4 per group) did not allow for meaningful tests of statistical significance.
Examination of UViGs grouped per environment type for shared viral species [cut-off for detection 10 TPM (transcripts per million, see methods)] confirmed our observation that absolute richness was highest in wastewater samples (2692 unique vOTUs; Fig. 3a). River/estuarine water (82 unique UViGs), mussels (137) and sediment (100) all contained an order of magnitude fewer unique vOTUs. The majority of the vOTUs present in mussels and sediment were shared with wastewater; out of 4692 vOTUs detected in mussels, 3917 were also detected in wastewater (83%), and for sediment this was 1464 out of 1944 (75%) (Fig. 3a). Even though most of these vOTUs were likely bacteriophages, the high connectivity of these environments is a cause for concern, indicating potential sources of contamination that pose a risk to human health, and is investigated in more detail below. The categories of vOTUs in wastewater encompassed all virus types detected in this study (Table 1), whereas those specific to mussel shellfish and sediment comprised primarily RNA viruses. In the Materials and Methods Sequencing section, we describe technical difficulties during sequencing library construction that may have resulted in the underestimation, or even failure to detect, a group of mussel and sediment-specific DNA Fig. 2. Differential patterns of abundance of each viral genome (UViG) along the wastewater impacted Conwy river and coastal zone. Anvi'omean coverage per contig (split). Each row is a sequencing library, coloured by its sample type (green = sediment; orange = mussels; blue = river/estuary water; red = wastewater). Each column (leaf in top dendrogram) is a contig or a split of a contig (in cases where contigs were larger than 11 kb). The height of the bar in each row is the log mean coverage across the contig or contig split length. The contigs are clustered (top dendrogram) according to their sequence composition and differential coverage using Euclidean distance and Ward linkage. Based on this clustering, we identified 13 categories of UViGs, indicated by shades of grey in the dendrogram and numbered at the bottom of the plot. The bottom row represents the taxonomy assigned by Kaiju (using its viral database) to the predicted genes in each contig. Contigs without assigned taxonomy are depicted in grey, dsDNA bacteriophages in shades of blue, other dsDNA viruses in shades of green, ssDNA viruses in shades of yellow, RNA (ds, (+)ss, (-) ss) in shades of purple/red. The right hand panels show the library type (RNA = grey; DNA = black), the number of single nucleotide variants (SNVs) found after read mapping (0 -20 640), the total number of reads mapped to contigs (0 -13 757 048) and the total number of raw sequencing reads (before QC and contamination screen; 0 -140 000 000).

Table 1
Categories of UViGs observed in the dataset, binned using a combination of sequence composition and read mapping pattern.

viruses.
In order to assign each UViG to a viral family and higher taxa, we used a combination of Diamond BLASTx against the viral RefSeq protein database (version 200, May 2020) and taxonomic binning using a lowest common ancestor approach with Megan6 (Buchfink et al., 2015;Huson and Weber, 2013). To reduce the number of different taxa displayed in Figure 3b, we assigned the UViGs at class or phylum level recently defined by the International Committee on Taxonomy of Viruses (ICTV) (Gorbalenya et al., 2020; Koonin et al., 2020) and where this was not unambiguously possible at the Realm level, with the remainder designated as either "Viruses" (similarities to viruses belonging to multiple Realms) or "Unknown" (no similarity with any virus in the RefSeq database). Of all contigs in our set, 98% had at least one BLAST hit with the virus database (9935/10 149) and 88% (8904/10 149) were assigned to at least the viral Realm level.
The taxonomic composition of each library (Fig. 3b), normalised to 1 million reads per sample mapped to the UViGs, showed large differences in relative abundances of virus groups between both library types and samples. Scanning these data confirms the observation from Figure 2, that some of the RNA libraries were contaminated with DNA viruses. In these cases (all RNA river water libraries and RNA_TI, RNA_BI, RNA_-TyCa/b, RNA_CP1/2/CS1), the relative abundance of dsDNA viruses, mainly tailed phages of the class Caudoviricetes, eclipsed the detected RNA virus signatures. The remainder of the RNA libraries recruited the most reads against several groups of RNA viruses, such as the phage class Leviviricetes (formerly family Leviviridae), phylum Lenarviricota and unknown RNA virus UViGs (Realm Riboviria) from a previously published study on the RNA virosphere of invertebrates (Shi et al., 2016). The majority of the DNA libraries were dominated by dsDNA bacteriophages associated with the class Caudoviricetes and its constituent families. Exceptions were libraries DNA_TI and DNA_LE, which were dominated by a small number of UViGs with ambiguous taxon assignments (i.e. classified as "Viruses" or "Unknown"). The read recruitment to the UViGs and their taxonomic binning clearly showed discrepancies between some of the replicates, most notably the RNA libraries of the shellfish digestive tissue samples. These differences are in line with a recent study by Pérez-Cataluña and colleagues who investigated library preparations for viromes of wastewater and showed that further standardisation of methods is necessary for quantitative viromics (Pérez-Cataluña et al., 2021).
In view of these discrepancies in read mapping patterns between replicates, we investigated the taxonomic bins per environment type as an indication for richness, not relative abundance ( Supplementary  Fig. 2). Overall, the most common RNA virus classification was the "UViG RNA virus" bin grouped within the realm Riboviria comprising a diverse set of metagenome-assembled RNA viruses [dsRNA, (+)ssRNA, (-)ssRNA from invertebrates (Shi et al., 2016)], which contained the most UViGs from mussels, sediment and wastewater samples. The most abundant DNA virus bin was the class Caudoviricetes which groups all tailed phages of the order Caudovirales and its constituent families (Myoviridae, Siphoviridae, Podoviridae, Ackermannviridae, Autographiviridae, Drexlerviridae, Herelleviridae) including crAss-like phages, and unidentified dsDNA viruses (probably tailed bacteriophages), which were particularly rich in wastewater, river water, and to a lesser extent in mussels. Wastewater was also host to a diverse group of (+)ssRNA phages of the class Leviviricetes (~700 UViGs), with a smaller number of these viruses observed in mussels and sediment. About 6% of the total reads could not be assigned to a known group, not even at the Realm level, and were categorized as unknown viruses. While these unknown viruses represented only 6% of the total reads, they made up about a third (3 502/10 149) of the vOTUs.

Circulating human pathogens: Sapovirus, coxsackievirus and rotavirus
To investigate the potential environmental and public health impact of the UViGs, we focused on the near-complete genomes shared between wastewater and the other environments (Fig. 3a) and the taxonomic groups that contain known pathogens (human/animal). We identified 29 vOTUs of potential public health concern, further representing 73 UViGs from six families (Table 2). Interestingly, we were unable to unambiguously identify any potentially pathogenic dsDNA UViGs. The ability to reconstruct a complete papillomavirus genome in our pilot study from a subset of these sites sampled at an earlier date (Adriaenssens et al., 2018) suggests that there were in fact no predicted-pathogenic DNA viruses circulating (above the limit of detection) in the Conwy catchment at the time of sampling (June 2017). We speculate that the most likely reason for the absence of potentially pathogenic DNA viruses from the dataset is because their presence was below the limit of detection, with the discovery of the papillomavirus in the pilot study potentially due to a large shedding event. It is also possible that the diversity of dsDNA bacteriophages drives up the limit Fig. 3. Commonality and taxonomic composition of viral genomes (UViG) in samples types from the wastewater impacted Conwy river and coastal zone. a, Venn diagram representation of the number of UViGs shared between different environment types (min 10 TPM for detection). b, Relative abundances of the UViGs at the virus class level per sequencing library (colourcoded per sample type as in figure 2, green = sediment; orange = mussels; blue = river/estuary water; red = wastewater) normalized per library as transcipts (=contig) per million (TPM). dsDNA viruses in shades of dark purple and red; ssDNA viruses in shades of pink and yellow; RNA viruses in shades of green, purple and blue; unknown viruses in shades of grey.
of detection for other DNA viruses by skewing the data towards the most abundant genomes. Given the current pandemic, we also did a search for similarity of the UViG dataset with members of the Coronaviridae family, but no coronavirus signatures were identified in our dataset.
With respect to the family Astroviridae, we recovered one UViG related to bat-infecting astroviruses in mussel (Mytilus edulis) tissue from the Deganwy shellfishery, and one UViG in wastewater sample GI highly similar to Astrovirus MLB1 (FJ222451) which was sequenced from the stool of a child with acute diarrhoea (Finkbeiner et al., 2008) (Supplementary Fig. 3a). For the family Caliciviridae, we were not able to identify any UViGs representing noroviruses, the leading cause of viral gastro-intestinal illness in the UK and indeed worldwide (Ahmed et al., 2014;FSA, 2017;Kirk et al., 2015) in contrast to our pilot study performed in autumn, where we assembled a norovirus GI.2 genome (Adriaenssens et al., 2018). We did, however, find two near-complete sapovirus UViGs and six shorter contigs grouped with the near-complete genomes (Fig. 3, Supplementary Fig. 3b), most closely related to sapoviruses of genotype GII.2 and GII.5 that were collected from children with acute gastroenteritis in Nashville (US) (Diez-Valcarce et al., 2018). This finding suggests that at the time of sampling for the dataset reported here (June 2017), sapoviruses replaced noroviruses (commonly associated with winter illness) as the main cause of gastro-intestinal disease. This theory is supported by our previous RT-qPCR detection study showing that sapovirus concentrations spiked between March and June in wastewater collected at the four WWTPs in the Conwy area (Farkas et al., 2018a). However, this is difficult to formally prove as many norovirus-sapovirus cases are undiagnosed clinically, and the seasonality of norovirus and sapovirus is not consistent in all clinical settings in the UK (Brown et al., 2016;Inns et al., 2019).  *This assignment was based on low similarity scores. **These UViGs were partial genomes, not near-complete genomes. a The number of UViGs clustered at 95% ANI (cd-hit-est) represented by one UViG in the dataset. b Category as defined in Fig. 2 and Table 1.
We identified two potentially pathogenic UViGs in the Picornaviridae family (Table 2) among a host of distantly related picorna-like viruses (Fig. 4). The two potentially pathogenic picornavirus UViGs, which were represented by only partial genome sequences ( Supplementary Fig. 3c), could be identified as coxsackieviruses of the species Enterovirus C, most closely related to human coxsackieviruses A19 and A22 reportedly involved in meningitis, gastroenteritis and herpangina (Tapparel et al., 2013;Zell et al., 2017). Detailed phylogenetic analysis of all calici-and picorna-like RNA-dependent RNA polymerase (RdRP) sequences (Fig. 4) showed that the majority of UViGs found in this study fell within a very diverse, ill-resolved clade (low branch support) comprised of environmental sequences nested within the order Picornavirales (bottom half of circle, Fig. 4). Based on the RdRP sequences, only three UViGs in the picorna-calici group were designated potential human pathogens (Fig. 4, black arrows), the two sapovirus UViGs and one of the coxsackievirus UViGs. Only the sapovirus UViG LI_NODE_9 was detected in all sample types, posing a potential risk for human health as it was detected in the mussel beds of the commercial shellfishery, sediment on the tourist beach and estuarine water ( Supplementary Fig. 4). PCR-based detection of sapoviruses in older studies show that among cases of gastro-intestinal disease, sapoviruses accounted for only 4% of cases (vs 36% for noroviruses) (Amar et al., 2007), however, the primers used in that study (SR80 (Noel et al., 1997), JV33 (Vinjé et al., 2000)) did not match the two sapovirus genomes reconstructed in this study (data not shown). The detection of this complete genome sequence from two different wastewater treatment plants is another indication that sapoviruses are more common in the UK than previously reported, similar to its incidence reported in other countries (Mann and Liebert, 2019;Pang et al., 2019;Varela et al., 2018).
While the phylogenetic analysis does not provide enough evidence for the presence of plant-pathogenic picorna-like viruses in the Conwy river catchment, there is a set of UViGs present that is mollusc-specific (coloured orange in Figure 4). It is therefore likely that we have sequenced and reconstructed a set of mussel/shellfish-associated or -infecting viruses.
Within the non-redundant, species-level clustered dataset, 18 UViGs grouped into three categories according to read recruitment pattern, and were assigned to the species Rotavirus A in the family Reoviridae, representing a further 41 contigs. Analysis of reoviruses is confounded by their segmented nature, i.e. members of the genus Rotavirus contain 11 segments of dsRNA, and the size of the smaller segments is below our 1000 nt contig length threshold. We therefore analysed all contigs larger than 500 nt for the presence of rotavirus signatures and assigned genotypes to each segment recovered (Fig. 5a). The most common rotavirus A (RVA) genome constellation recovered was G2-P[4]-I2-R2-C2-M2-A2-N2-T2-E2, with additional genotypes R1 for the RNAdependent RNA polymerase (RdRP) segment, C1 for the segment encoding VP2, P[1] and P [14] for the outer capsid-encoding segment, A3 and A11 for NSP1 and T6 for NSP2. In many of the wastewater samples, we assembled multiple contigs of the same segment indicating the presence of several co-circulating population lineages of rotavirus A in the population. Phylogenetic analysis of the outer capsid proteins (VP4) confirmed the genotype clustering, and comparison with isolated rotavirus VP4 sequences points towards a human origin for the P[4] and P[14] genotypes (found in samples BE, LI, LE and GI) and a potential bovine zoonotic origin for the P[1] genotype segment (Fig. 5b). The RVA genome segments recovered here are markedly different to those recovered from wastewater influent from Llanrwst (LE-LI) in our pilot study 10 months prior (Adriaenssens et al., 2018), for which the dominant genotypes of RVA were G8/G10-P[1]/P[14]/P[41], and a diverse set of rotavirus C segments were also present. We can conclude that rotavirus shedding into wastewater within the population varied both spatially and temporally, but more data are required to investigate any possible seasonal patterns. From the distribution of the rotavirus fragments in shellfish, beach sediment and estuarine water (Table 2), we speculate that rotaviruses could pose a potential risk for human health in relation to shellfish consumption or recreational activities and bathing within the immediate coastal zone. However, rotaviruses mainly affect  (Nguyen et al., 2015) and visualized with ITOL (Letunic and Bork, 2019). The multiple alignment consisted of 622 sequences and 695 amino acid sites, aligned using MAFFT and trimmed with Trimal (Capella-Gutiérrez et al., 2009;Katoh and Standley, 2013). The best fit model was LG+F+R10 as determined with ModelFinder (Kalyaanamoorthy et al., 2017). Branch support was calculated using the Shimodaira Hasegawaapproximate Likelihood Ratio Test (SH-aLRT) and the UFBoot (ultrafast bootstrap) algorithm on 1000 replications with nodes below 80% (SH-aLRT) and 95% (UFBoot) indicated in grey (Anisomova and Gascuel, 2006;Hoang et al., 2018). The three inner colour strips from inside to outside indicate respectively: viral host or metagenome the RdRP was extracted from, predicted clade, human-associated genera (only reference genomes from human pathogenic viruses coloured). The four outside colours strips indicate detection in shellfish samples (orange), beach/river sediment samples (green), river/estuarine water samples (blue) and wastewater samples (red), with other virome-derived UViGs in light grey and reference virus sequences in middle grey. The black arrows indicate the UViGs found in this study that are likely human pathogens. infants and children under the age of five (Hamborsky et al., 2015), who are less likely to engage with such activities which may be the reason for the lack of reported illnesses.
We identified a number of small contigs related to ssDNA circular circoviruses and parvoviruses, that were originally recovered from environmental or host-associated metagenomes (Dayaram et al., 2015;Phan et al., 2015;Zawar-Reza et al., 2014). Of these, four circovirus-associated vOTUs, representing 18 contigs, showed significant sequence similarity to previously described UViGs from animal or wastewater metagenomes. One parvovirus contig, assigned to the genus Ambidensovirus, was related to a bat metagenome sequence. However, for these types of ssDNA virus UViGs, any causative links with disease syndromes would be very tenuous.

A conceptual model for virus circulation in a freshwater catchment area
The data presented in this study support the following conceptual model of virus circulation in the river system (Fig. 6). Upstream, in the more pristine regions of the river with low human and livestock inputs, viral species richness is low and the water virome is dominated by dsDNA tailed bacteriophages (caudoviruses) and a few algal viruses of Fig. 5. Rotavirus A (RVA) in the virome datasets. a, The 11 segments of the reference genome of RVA ranked by size in black. RVA segments recovered per sample below showing the predicted genotype of the segment and the percentage of nucleotide identity with a representative of that genotype as calculated by the RotaC 2.0 tool. b, Maximum likelihood phylogenetic tree of the VP4 amino acid sequences of selected representatives of all RVA genotypes build with IQ-TREE (Nguyen et al., 2015) and visualized with ITOL (Letunic and Bork, 2019). The multiple alignment consisted of 253 sequences and 774 amino acid sites, aligned using MAFFT and trimmed with Trimal (Capella- Gutiérrez et al., 2009;Katoh and Standley, 2013). The best fit model was FLU+F+R8 as determined with ModelFinder (Kalyaanamoorthy et al., 2017). Branch support was calculated using the UFBoot (ultrafast bootstrap) algorithm on 1000 bootstraps and is indicated with branch colours in shades of grey, with support values higher than 95% in black (Hoang et al., 2018). Colour strip 1 indicates the genotype clustering, using RVC isolates as outgroup. Colour strip 2 shows the host of the isolates with arrows indicating the virome-derived sequences. Fig. 6. Model for the circulation of viruses in a river catchment and coastal zone system with wastewater discharge. Viruses specific to river water are depicted in blue, wastewater in red, beach sediment in green and shellfish in orange. E.M. Adriaenssens et al. the family Phycodnaviridae. At certain points along the river, wastewater effluent from large treatment plants and smaller scale septic tank discharges enter the water. This effluent is much less rich in viruses than untreated wastewater (influent) but can still contain over 1000 different viral species per litre. The entire spectrum of viral diversity detected in this study is represented in effluent (treated wastewater), with DNA and RNA bacteriophages (predicted to infect members of the human gut microbiome) the most commonly detected groups (caudoviruses, leviviruses). Nucleo-cytoplasmic large DNA virus (NCLDVs; phycodnaviruses, mimiviruses, iridoviruses) and common plant-derived viruses present in food and excreted by the human digestive tract (mainly tobamoviruses such as pepper mild mottle virus (Rosario et al., 2009;Zhang et al., 2006)) and groups of enteric viruses such as sapovirus, rotavirus and astrovirus within a wider collection of unclassified RNA viruses are also well represented. These communities are both spatially and temporally distinct. Upon entering the river, the pathogenic virus groups fall below the limit of detection by virome sequencing, which can be attributed primarily to dilution by the river water. However, close to an effluent site and at the estuary that is under tidal control, the number of viral species detected in water samples is much higher. Beach sediment and filter-feeding shellfish (in this case mussels, Mytilus edulis) then act as entrapment matrices enriching the viral content from the surrounding water (Maalouf et al., 2010;Whitman et al., 2014). In the majority of cases, the UViGs that were assembled from wastewater recruited fewer reads from beach sediment, mussel tissue or estuary water libraries, and the read mapping over the genome length was often patchy, leading us to hypothesize that these genomes, and by extension the virions, are likely to be substantially degraded. At the same time, we observed sediment-and mussel-specific viral communities represented by full genomes, mainly picorna-like RNA viruses and unclassified UViGs from invertebrates (Shi et al., 2016), thus excluding technical bias as the explanantion for our failure to detect intact pathogenic virus genomes in sediment and shellfish. In the scenario that we propose, shellfish and sediment become enriched in viruses that are recruited from the environment by filter feeding and adsorption, respectively. Those viruses that do not undergo active replication in the newly occupied niche (human, animal and plant pathogens in particular) are degraded over time or diluted below the limit of detection, while viruses that infect the shellfish, the shellfish microbiome, diatoms or sediment-associated bacteria are maintained, enabling detection of their full genome sequences. In this scenario, the risk of illness due to consumption of shellfish, contact with sediment (beach sand) or swimming, would depend on the time interval between uptake/adsorption of pathogenic viruses in the matrix and ingestion by a human subject. To critically evaluate this, further experimental data on the infectivity/survival kinetics for each viral species are required, as this is likely to vary markedly between viral groups. However, this would be a Herculean endeavour, given the diversity of viruses detected here, the difficulty in propagation and the absence of routine infectivity assays. The conceptual model is supported by the results of our previous year-long q (RT)-PCR study on a subset of enteric viruses, which showed that they were still detected at high titres in wastewater post-treatment, followed by lower titres in river water, shellfish and sediment, and ultimately undergoing capsid degradation in environmental matrices (Farkas et al., 2018a). Our conceptual model of viral circulation is also consistent with theorectical simulations of viral discharge from wastewater treatment plants into the coastal zone (Robins et al., 2019). Importantly, these models have indicated that tidal movement allows viruses in estuarine water to come into contact with shellfisheries and beaches on numerous occasions over a period of days to weeks depending on the lunar tidal cycle.

Conclusion
Viruses and their genetic material are commonly discharged in the environment, but their risk to human health is driven by community outbreaks leading to viral shedding into the wastewater, leading to temporal and spatial variations in the specific genotypes detected. In the environment, these viruses are then subject to cycles of dilution, enrichment and virion degradation influenced by local geography, weather events and tidal effects. Our analyses show that viromics is a useful tool to assess viral diversity in the aquatic environment in order to explore new and emerging human and animal health threats.

Availability of data and material
The datasets generated and analysed in this study are available from the Sequence Read Archive (SRA) under BioProject PRJNA509142, accession numbers SRR8299359 to SRR8299398.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.