Using noninvasive metagenomics to characterize viral communities from wildlife

Abstract Microbial communities play an important role in organismal and ecosystem health. While high‐throughput metabarcoding has revolutionized the study of bacterial communities, generating comparable viral communities has proven elusive, particularly in wildlife samples where the diversity of viruses and limited quantities of viral nucleic acid present distinctive challenges. Metagenomic sequencing is a promising solution for studying viral communities, but the lack of standardized methods currently precludes comparisons across host taxa or localities. Here, we developed an untargeted shotgun metagenomic sequencing protocol to generate comparable viral communities from noninvasively collected faecal and oropharyngeal swabs. Using samples from common vampire bats (Desmodus rotundus), a key species for virus transmission to humans and domestic animals, we tested how different storage media, nucleic acid extraction procedures and enrichment steps affect viral community detection. Based on finding viral contamination in foetal bovine serum, we recommend storing swabs in RNAlater or another nonbiological medium. We recommend extracting nucleic acid directly from swabs rather than from supernatant or pelleted material, which had undetectable levels of viral RNA. Results from a low‐input RNA library preparation protocol suggest that ribosomal RNA depletion and light DNase treatment reduce host and bacterial nucleic acid, and improve virus detection. Finally, applying our approach to twelve pooled samples from seven localities in Peru, we showed that detected viral communities saturated at the attained sequencing depth, allowing unbiased comparisons of viral community composition. Future studies using the methods outlined here will elucidate the determinants of viral communities across host species, environments and time.

Despite the importance of studying microbial communities in the environment and within hosts, classical methods of microbe discovery are not easily applied at the community level. For example, characterization by isolation and culturing is unsuitable for members of the microbial community that is difficult to grow in culture (Fancello, Raoult, & Desnues, 2012). Serological tests of antibody presence are targeted towards specific taxa and can be difficult to interpret due to antibody cross-reactivity and inconsistent cut-off thresholds for positivity (Gilbert et al., 2013). Molecular detection of nucleic acids by targeted PCR remains an important technique for sequencing specific genomic regions, but these approaches cannot identify all taxa present and are inappropriate for discovering new, highly divergent taxa as designing primers or probes requires prior knowledge of nucleotide sequences (Fancello et al., 2012;Temmam, Davoust, Berenger, Raoult, & Desnues, 2014). In contrast, unbiased deep sequencing has the potential to capture a snapshot of microbial communities in a large number of samples without prior expectations about what taxa will be detected.
Deep sequencing has illuminated the structure and function of microbial communities across time and space in ways that would not have been possible using traditional methods. In the field of ecology, theories developed at macro-organismal level have been tested in microbial communities, such as the cycling of predator and prey populations (Rodriguez-Brito et al., 2010) and the existence of elevational diversity gradients (Fierer et al., 2011). Deep sequencing has also demonstrated that both bacterial and viral communities differ across abiotic environments (Dinsdale et al., 2008) in such diverse systems as soil bacteria (Fierer et al., 2012) and marine viruses (Hurwitz, Westveld, Brum, & Sullivan, 2014). In the context of human and animal health, deep sequencing can identify candidate pathogens in unexplained disease (Briese et al., 2009;Cox-Foster et al., 2007;Honkavuori et al., 2008;Palacios et al., 2008) and potential hosts and vectors of emerging pathogens (Masembe et al., 2012;Veikkolainen, Vesterinen, Lilley, & Pulliainen, 2014;Volokhov et al., 2017). Studies of host-associated microbial communities have revealed that microbes vary across body habitats, space and time (Blekhman et al., 2015;Costello et al., 2009), and that a community-level perspective of hostassociated microbes is critical for understanding health and disease (Lecuit & Eloit, 2013;Vayssier-Taussat et al., 2014;Virgin, 2014).
Sequencing host-associated bacterial communities in wildlife has revealed that communities vary over time (Bobbie, Mykytczuk, & Schulte-Hostedde, 2017), that social interactions are key determinants of community composition (Grieneisen, Livermore, Alberts, Tung, & Archie, 2017;Tung, Barreiro, Burns, & Grenier, 2015) and that dietary changes due to habitat degradation can alter bacterial communities (Amato et al., 2013). While host-associated viral communities in wildlife remain relatively unexplored, the divergent responses of host-associated bacteria and viruses to experimental diet modification (Howe et al., 2015) and the biological differences between the two types of microbes suggest that viral communities in wildlife might exhibit different patterns to those observed in bacteria.
Deep sequencing studies of microbial communities typically employ either metagenomics, which is the random sequencing of genomic fragments of an entire sample, or metabarcoding, which is a sequence-specific PCR-based approach (Creer et al., 2016). Studies of bacterial communities frequently use 16S ribosomal RNA (rRNA) metabarcoding to examine highly multiplexed samples. However, viral communities lack a similarly conserved marker across or even within viral families (Mokili, Rohwer, & Dutilh, 2012;Rohwer & Edwards, 2002) and are therefore more commonly characterized using metagenomics. Although this approach is currently less cost-and time-efficient than metabarcoding for large numbers of samples, it can assign taxa at higher resolution (depending on factors such as read length, genomic region and reference database) and avoids PCR biases (Jovel et al., 2016). Shotgun metagenomics also allows the simultaneous characterization of different microbial communities (e.g., bacterial and viral) (Chandler, Liu, & Bennett, 2015;Schneeberger et al., 2016) as well as host population structure and diet (Srivathsan, Ang, Vogler, & Meier, 2016). Furthermore, metagenomics can detect viruses at or below the sensitivity of taxon-specific PCR and qPCR (Greninger et al., 2010;Li et al., 2015;Yang et al., 2011), implying that broader taxonomic coverage does not necessarily trade off with sensitivity.
Targeted approaches also likely underestimate or bias measures of viral diversity, potentially impacting downstream comparative analyses. The ability of metagenomics to sensitively detect taxa that are not specifically targeted and/or were previously undescribed has the potential to overturn prior understandings of viral community diversity and distribution based on serology and PCR.
Despite the great promise of metagenomics for studying viral communities, challenges inherent to sequencing viral genomes and technical uncertainties need to be addressed to maximize comparability. Viral communities include single-and double-stranded viruses with both DNA and RNA genomes, ranging in size from 1,259,197 bp (Megavirus chilensis;Arslan, Legendre, Seltzer, Abergel, & Claverie, 2011) to 1,700 bp (Hepatitis delta virus; Taylor, 2006); larger viral genomes that have a higher probability of being sequenced may be over-represented in the inferred community (Fancello et al., 2012). The RNA virus component of viral communities is highly sensitive to degradation due to temperature and storage conditions, raising questions about how samples should be preserved and transported. Indeed, different storage media alter viral detection in PCR-based studies (Forster, Harkin, Graham, & McCullough, 2008;Osborne et al., 2011) and it is reasonable to assume the same in metagenomic studies.
Two popular methods for preserving viruses from field or clinical samples are viral transport media (VTM), an aqueous solution that typically contains protective proteins, antibiotics, and buffers to control the pH (Johnson, 1990) and RNAlater, a commercial reagent that penetrates tissues and stabilizes RNA (Ambion). VTM has historically been used to preserve samples when viruses are to be detected by PCR or cultured in vitro (e.g., Jensen & Johnson, 1994;Druce, Garcia, Tran, Papadakis, & Birch, 2012  another popular medium for storing microbial samples collected in the field (Bányai et al., 2017;Drexler et al., 2011;Frick et al., 2017;Gomez et al., 2015), as it preserves RNA without requiring immediate freezing. However, its high salt content, while not problematic for solid tissue samples, creates challenges for nucleic acid extraction from the kinds of noninvasive swab samples that are typical of ecological field studies (e.g., blood, urine, faeces and saliva). While viruses are often extracted from an aliquot of supernatant (Baker et al., 2013;Tse et al., 2012;Wu et al., 2012), extraction from the swab itself may be desirable for samples stored in RNAlater (Vo & Jedlicka, 2014). These extraction procedures need to be tested and optimized for more widespread use in noninvasive viral metagenomics.
Another challenge for viral metagenomics is that since genomes are sequenced at random, larger host and bacterial genomes are preferentially detected relative to smaller viral genomes (Nakamura et al., 2009;Yang et al., 2011). For this reason, samples are often enriched for viruses using methods including nuclease treatment, filtration of host/bacterial particles, density gradient centrifugation and removal of rRNA (Hall et al., 2014;Kleiner, Hooper, & Duerkop, 2015;Kohl et al., 2015). DNase treatment is a well-established and effective method of enrichment (Allander, Emerson, Engle, Purcell, & Bukh, 2001), while filtration and centrifugation are sometimes used but can bias the inferred viral community composition (Kleiner et al., 2015;Thurber, Haynes, Breitbart, Wegley, & Rohwer, 2009) and are impractical for ecological studies given the large numbers of samples typically processed and interest in generating community data rather than focusing on a particular pathogen. Depletion of host rRNA is unlikely to bias the viral community (He et al., 2010;Matranga et al., 2014), but may affect the distribution of coverage across the viral genome (Li et al., 2016). Identifying a combination of laboratory methods that maximize the proportion of viral reads while minimizing bias would allow greater multiplexing, enabling metagenomic studies of viral communities on an ecological or evolutionary scale.
Here, we describe a field-laboratory-bioinformatic pipeline to characterize viral communities in noninvasively collected faecal and oropharyngeal swabs from common vampire bats (Desmodus rotundus) in Peru. To optimize our protocol for comparative viral metagenomics, in RNAlater were stored overnight at 4°C before being transferred to dry ice (ca. −80°C), while those in VTM were immediately placed on dry ice. Both were permanently stored in −70°C freezers.

| RNA extraction
Unless otherwise noted, nucleic acid extractions were performed on a Kingfisher Flex 96 automated extraction machine (Thermo) with the BioSprint One-For-All Vet Kit (Qiagen) using a modified version of manufacturer's protocol for purifying viral nucleic acids from swabs (details in Supporting Information Appendix S1).

| Bioinformatic analysis of viral communities
We created a bioinformatic pipeline for virus discovery and viral community analyses in shotgun metagenomic data from vampire bat samples (Supporting Information Appendix S2: Figure S1). Briefly, the pipeline filtered out low-quality reads and duplicates, and then filtered out non-viral reads including those matching the vampire bat genome (Zepeda Mendoza et al., 2018; NCBI BioProject Accession PRJNA414273), the PhiX Illumina sequencing control, ribosomal RNA and other reads with high matches to prokaryote/eukaryote sequences. Remaining reads were assembled into contigs, and then, both raw reads and assembled contigs were assigned to viral taxa by comparison with the NCBI VIRAL REFSEQ database. Viral reads and contigs were converted into lists of viral taxa at different taxonomic levels using MEGAN Community Edition (Huson et al., 2016) with the default parameters of the lowest common ancestor (LCA) assignment algorithm, except that minimum score and minimum support per cent were set to zero to include all hits passing the filters of the bioinformatic pipeline (maximum e-value of 0.001 for each Diamond blast step). For read-level analysis, we did not consider species-level assignments to be trustworthy as reads were only 150 bp long and could match equally well to numerous species within a genus. We included genera that are not yet assigned to families and species that are not yet assigned to genera.
Taxa lists were filtered for vertebrate-infecting viruses using a list of vertebrate-infecting viral families and genera (Supporting Information Table S1) that was compiled from the 2017 ICTV Taxonomy (Adams et al., 2017). Viral family and genus richness were calculated using the R package VEGAN (Oksanen et al., 2017; R Core Team, 2017).

| Pilot study 1: Are samples stored in viral transport media appropriate for viral metagenomic analysis?
Total nucleic acid from two aliquots of FBS was extracted, library prepared and sequenced using a shotgun metagenomic approach (Supporting Information Appendix S3) to evaluate the presence of bovine viruses and to determine whether another storage medium, such as RNAlater, would be more appropriate. The resulting reads were processed through the bioinformatic pipeline (Supporting Information Appendix S2).

| Pilot study 2: What extraction method for swabs stored in RNAlater maximizes nucleic acid?
This experiment used swabs that were inoculated with known concentrations of viral particles to identify the extraction method that maximized viral nucleic acid from swabs stored in RNAlater and to assess efficiency and repeatability of the extraction protocol. Swabs were designed to mimic samples collected from the field, with the caveat that they did not include host material (e.g., faeces and saliva), bacteria, parasites or the community of viruses expected to be present in field-collected samples. These other components of samples could impact extraction and PCR efficiency, for example, by acting as a carrier to enhance RNA extraction or through the presence of compounds that can act as extraction or PCR inhibitors. However, rather than inoculate field-collected swabs, in which differences between sample types or between pathogen communities could introduce uncontrolled variation, we opted for "clean" mock swabs that would allow us to evaluate differences in viral detection between extraction methods. Our second test aimed to approximate the minimal detectable viral concentration by qPCR using this method, to assess repeatability and to estimate extraction efficiency using the cotton-tipped wooden base swabs and rayon-tipped aluminium base swabs used to collect samples in the field. RNA was extracted from swabs, converted into cDNA and quantified by qPCR as described above. Three extraction replicates were performed for each concentration from 10 6 to 10 3 copies/ml for cotton-tipped wooden base swabs, and three extraction replicates were performed for aluminium base swabs at 10 5 copies/ml.

| Pilot study 3: Is rRNA depletion a useful enrichment method for characterizing viral communities?
We tested the effect of rRNA depletion on the number of viral reads and taxa detected. Swabs from 40 faecal and 10 saliva samples were extracted individually, quantified and pooled as described in Supporting Information Appendix S1. Five pools were created using nucleic acid extracts from the same sample type from 10 individuals across one to two sites in the same locality (between 0.14 and 74.1 km apart) within each department of Peru (Table 1; Figure 1).  Appendix S1). The sample was split in half after pooling; one half was subjected to "light" treatment of 2U DNase and incubated at 37°C for 5 min (as above), and the other half was subjected to "harsh" treatment of 10U DNase and incubated at 37C for 15 min.
Both halves were then cleaned up using a 1.8× ratio of Agencourt RNAClean XP beads. Following this step, pools were library prepared and sequenced according to the final protocol (Supporting Information Appendix S1) and processed through the bioinformatic pipeline (Supporting Information Appendix S2).

| Subsampling analysis of viral community saturation using the optimized sequencing protocol
We conducted a subsampling analysis to test whether observed variation in the number of raw sequencing reads ( and 100% of the total reads and was repeated five times per pool. Viruses from subsampled data sets were classified using the bioinformatic pipeline without the assembly step (Supporting Information Appendix S2).
A generalized linear mixed model (

RNA from intact swabs (Pilot study 2)
For swabs stored in RNAlater, extracting directly from the swab itself yielded viral nucleic acid that was measurable by qPCR, while supernatant and pellet did not (data not shown). The limit of detection occurred with swabs that were initially inoculated with 220 viral copies; at this level, virus was inconsistently detectable by qPCR ( All pool IDs reflect the locality (AAC, Ayacucho-Apurímac-Cusco; AMA, Amazonas; CAJ, Cajamarca; HUA, Huánuco; LMA, Lima; LR, Loreto) and sample type (F, faeces; SV, saliva). Some IDs also reflect elevation (H, high; L, low) to differentiate localities with multiple pools. NR and R correspond to ribosomal treatment, either non-enriched or enriched, and one sample (CAJ_H_F) has associated numbers (1 and 2) referring to two batches that received different treatments during viral enrichment. Pools processed using the final protocol are shown in bold. b Colony codes correspond to department within Peru. Colony locations and pool midpoints are shown in Figure 1. c Enrichment tests are abbreviated as rRNA (ribosomal RNA depletion) and DNase (light or harsh DNase treatment).
because these swabs were smaller, and it was difficult to determine whether the virus had absorbed into the rayon. However, the one aluminium-base swab with measurable virus was comparable in final copy number to the wooden-base swabs ( Table 3). The qPCR replicates were generally consistent aside from samples on the edge of detectability, but Ct and copy number varied between extraction replicates of swabs containing the same initial quantity of virus. For the swabs inoculated with 2,200 copies (aluminium and woodenbase), there were on average 1,230 copies present following RNA extraction, yielding an extraction efficiency of about 56% (there were 1,578 copies and 72% efficiency when excluding an outlier wooden-base swab replicate that had 0.94 qPCR copies).

| Viral enrichment is improved by rRNA depletion (Pilot study 3)
The sequenced faecal and saliva samples that were split and trialled for rRNA depletion yielded a total of 172,371,088 reads which were fairly evenly distributed across samples (Table 1)

| Viral enrichment is improved by light DNase treatment (Pilot study 4)
The faecal sample that was split and trialled for light/harsh DNase treatment yielded 17,933,769 reads that were evenly distributed across the two pools (Table 1). Although the number of viral reads was comparable between the two pools, light DNase treatment increased the taxonomic richness of viruses detected, both for all viruses and vertebrate-infecting viruses (Figure 4).
The proportion of low complexity/PCR duplicate reads was also slightly higher in the harsh DNase treatment (1,974,128 reads) compared to the light treatment (1,620,909 reads) (Supporting Information Figure S2). suggesting that the harsh DNase treatment could have created a less diverse pool of nucleic acid prior to reamplification. Viral families that were absent in the harsh treatment included those with single-stranded DNA genomes (Circoviridae), as well as single-stranded RNA genomes (Flaviviridae), suggesting that DNase treatment may also degrade RNA viruses. However, RNA viruses were not always affected negatively by DNase treatment, as the single-stranded RNA family Paramyxoviridae was present in the harsh treatment but not the light treatment. Paramyxoviridae was only represented by two reads in the harsh treatment so it could be a rare virus that was missing from the light treatment due to chance, but the effects of DNase on different viral genome types appear complex and may require more study to resolve.
Although only two pools were compared, and they contained similar numbers of viral reads, a greater diversity of viral families was detected following the light DNase treatment.

| Summary of samples sequenced using the optimized metagenomic protocol
Pooled samples processed according to the final protocol had similar numbers of raw reads, but the proportion of viral reads varied widely across samples (Table 1). Saliva samples consistently contained fewer viral reads than faecal samples. The proportion of reads filtered out during different stages of bioinformatic processing was fairly similar across samples (Supporting Information Figure S2), and we detected sequences matching to vertebrates, arthropods, bacteria and archaea in addition to the viral sequences that were the focus of our study (Supporting Information Figure S3).

| Subsampling validates viral community saturation using the optimized protocol
The number of viral reads increased consistently with the number of raw reads, as would be expected with unbiased sequencing, though Indicates only two of the three qPCR replicates were measurable (one replicate was below the limit of detection). When all three qPCR replicates were below the limit of detection, this is indicated with no C t . All other average Ct and average qPCR quantities are calculated based on three qPCR replicates. b Virus concentration and initial swab quantities are calculated based on qPCR measurements of undiluted virus, which was then diluted to obtain the concentrations used in this experiment.  (Table 4). The exception was vertebrate-infecting viral families detected in saliva; however, detections did plateau at the viral genus level (Supporting Information Table S4; Figure S4). Aside from vertebrate-infecting viral families in saliva, viral richness plateaued at around 80% of the total reads ( Figure 6; Supporting Information Figure S4). Converting this to number of raw reads indicated that, on average, new detections began to level off at 8,358,626 reads (range: 6,929,294-12,097,084).

| DISCUSSION
We developed a field-laboratory-bioinformatic protocol for characterizing viral communities, incorporating the following findings from pilot studies to maximize viral detections: 1. Swab samples should be stored in RNAlater rather than VTM containing FBS.

2.
Nucleic acids should be extracted directly from swabs, rather than from supernatant or pellet. prising, as it is a common cell culture contaminant that has previously been found in high quantities in FBS (Allander et al., 2001;Gagnieur et al., 2014). Consistent with the South American origin of the FBS used in our analyses, BVDV-3, or HoBi-like viruses, was initially reported in FBS from South America and is likely endemic to livestock in Brazil (Bauermann & Ridpath, 2015). The consistent presence and proportion of BVDV as well as Retroviridae and several bacteriophage families ( which are also common in bats (Drexler et al., 2011;Li et al., 2010), including neotropical species (Wray et al., 2016 Ghatak, Muthukumaran, & Nachimuthu, 2013). We tested only one virus in this experiment, which may limit our ability to extrapolate the estimated limit of detection or extraction efficiency to other viruses with different characteristics or to field-collected samples that include host cells and other material. In addition, the quantity of viral RNA extracted from swabs did not appear highly repeatable between extraction replicates. However, our results indicated that extracting directly from the swab improved viral detection relative to other components of the sample.
Our study tested a variety of laboratory methods for enhancing unbiased detection of viruses. The rRNA depletion results suggested that removing host rRNA increased both the number of viral reads and number of viral taxa detected without biasing the viral community, as has been observed in previous studies (He et al., 2010;Matranga et al., 2014). The only case in which there were more reads in the non-enriched samples was the family Retroviridae; however, retroviruses integrate into the host genome and are likely to behave differently than other viral taxa with respect to enrichment.
Although the Ribo-Zero kit is described as being for human/mouse/ rat and should be tested before use on other sample types, it has been used effectively on samples from taxa as distantly related as mosquitos (Weedall, Irving, Hughes, & Wondji, 2015), and we also found it to be effective for enriching samples taken from bats.
Although we were only able to analyse one split sample, the light DNase treatment results suggested an increase in the number of viral taxa detected compared to the harsher treatment. DNase is a well-established method to reduce the number of host and bacterial reads relative to virus (Allander et al., 2001). Our light treatment was intended to knock down rather than remove all DNA, also potentially allowing for better detection of bacteria and parasites compared with an intensive enrichment, although we did not test this explicitly.
Although this step could have caused bias towards RNA viruses, DNA virus reads occurred in all samples, as has been found in other viral metagenomic studies using an RNA-based approach (Hall et al., 2014;Kohl et al., 2015;Wu et al., 2016), including those with a DNase treatment step (Baker et al., 2013;Hall et al., 2014 It is worth noting the caveats of analysing noninvasively collected samples. First, although contamination has not been well characterized in viral metagenomic studies, it is a known problem in bacterial community studies. Samples with low microbial biomass are particularly sensitive to contamination with other microbes, for example from DNA extraction kits (Salter et al., 2014) or ultrapure water (Laurence, Hatzis, & Brash, 2014). Our protocol minimized this risk by pooling samples following extraction to increase the amount of target nucleic acid relative to potential reagent-derived contaminants in downstream steps. Second, noninvasive samples will only detect viruses that are actively shed in urine and faeces, thus may miss latent viruses that are sporadically shed, but might be detectable by sequencing organs from sacrificed animals (Amman et al., 2012). Third, our protocol is not able to discriminate between viruses actively infecting hosts and transient viruses acquired from diet or the environment. Although some sources of bias are unavoidable, and it is likely that not all viral taxa in a given sample will be identified, the same is true of all studies in community ecology where exhaustive sampling is not possible (Gotelli & Colwell, 2001;Hughes, Hellmann, Ricketts, & Bohannan, 2001), and we showed statistically that viral communities in our samples were adequately sampled ( Figure 6). Our approach yielded sufficient depth to confidently characterize viral communities at the viral family or genus level, while identification of species or strains might be achieved by further increasing read depths to generate longer contigs that could be more precisely assigned ( Figure 5).
In summary, our pipeline simultaneously generated comparable viral communities from large numbers of noninvasively collected wildlife samples. A standardized approach to viral metagenomics opens many potential avenues of research in disease and community ecology. For example, viral community data collected across multiple individuals, populations and species allow the investigation of ecological processes shaping host-associated viral community structure (Anthony et al., 2015;Olival et al., 2017). Taxonomic and functional patterns of bacterial diversity across host species are influenced by diet and phylogeny (Ley et al., 2008;Muegge et al., 2011;Zepeda Mendoza et al., 2018), but drivers of host-associated viral communities may be different. In humans, host-associated viral communities are stable over time within individuals, but highly variable between individuals (Minot et al., 2011;Reyes et al., 2010). These observations suggest the potential to use viral communities as a host or environmental "fingerprint" to evaluate interactions between multiple hosts, or between hosts and environments, as has been proposed in humans and primates (Fierer et al., 2010;Franzosa et al., 2015;Stumpf et al., 2016).
Finally, although it was not the focus of our study, we also detected reads from vertebrates, protozoa and bacteria (Supporting Information Figure S3), suggesting that with appropriate bioinformatic modifications, shotgun metagenomic data generated using our protocol could simultaneously shed light on host genetics, diet, other non-viral pathogens and commensal microbes. As metagenomics becomes an ever more popular and powerful tool for viral ecology, use of standardized methods such as those developed here will be crucial for comparative insights from diverse host species and environments. Thanks also to Julio Benavides for advice on the subsampling analyses.