Simultaneous rapid sequencing of multiple RNA virus genomes

Highlights • A method for deriving multiple RNA virus genomic sequences is described.• Twenty RNA virus genomes were routinely sequenced simultaneously.• The sequencing protocol is rapid and accurate.• Bovine viral diarrhea virus and bovine coronaviruses genomes were sequenced.• Sequence coverage of >99% was obtained for 35 of 40 viruses sequenced.


Introduction
Next generation sequencing technologies are used to screen environmental and clinical samples for the presence of viral pathogens. This has resulted in discovery of new, previously unknown viruses that are often uncultivable (Victoria et al., 2008;Li et al., 2009;Phan et al., 2012;Reuter et al., 2012). These technologies have allowed screening of samples collected from domestic animals (Shan et al., 2011;Phan et al., 2011;Reuter et al., 2012) and from animals in the wild to determine if they are carriers of potentially zoonotic viral pathogens (Donaldson et al., 2010;Li et al., 2010;Phan et al., 2011;Wu et al., 2012). Additionally, deep sequence analysis was applied to screening of commercially available vaccines for the presence of contaminants and variants .
Methods to sequence complete viral genomes have been developed. These include methodologies based on PCR amplification of viral sequences, both in fragments (Rao et al., 2013) or fulllength genome amplification (Christenbury et al., 2010). These have primarily focused on specific, closely related viruses. The Sequence-Independent, Single Primer Amplification (SISPA) procedure was first used to amplify DNA populations through the use of random priming and ligation of adaptors for PCR amplification (Reyes and Kim, 1991). This was modified for amplification of viral sequences from serum to include a step where DNase I was used to first degrade host DNA (Allander et al., 2001). This protocol has since been used in the identification of a novel pestivirus (Kirkland et al., 2007), bocaviruses (Cheng et al., 2010) and identification of viruses associated with cytopathic effect in cell cultures of unknown origin (Abed and Boivin, 2009).
Many virological laboratories, especially those associated with diagnostic laboratories, have freezers that contain large numbers of virus isolates collected over varying time spans. Often these isolates have clinical histories associated with them that detail where and when the isolations were made and descriptions of the disease state of the host. There is a wealth of information in these isolates, but up till now, it has been time consuming and expensive to sequence these viral genomes, often requiring sets of strain-specific  primers for PCR amplification and sequencing. The protocol outlined in this study provides the means to rapidly sequence archival collections of viruses to study the evolution of specific viruses over time. This information is important in determining the relevance of virus strains used in currently marketed vaccines.

Viruses and RNA extraction
Viral isolates passaged in cell culture and stored frozen at −80 • C were used in this study. The two virus groups sequenced were 21 bovine viral diarrhea virus (BVDV) isolates and 19 bovine coronavirus (BoCV) isolates. After isolation/propagation of the viruses on appropriate host cells, the medium was frozen/thawed twice, cell debris removed by centrifugation, and aliquots frozen at −80 • C. Routinely, the libraries were constructed from groups of 10 viruses. The virus stock was thawed at room temperature and 180 l was mixed with 20 l of 10× DNase I buffer (100 mM Tris-HCl (pH 7.6), 25 mM MgCl 2 , 5 mM CaCl 2 ). A cocktail of nucleases was added to degrade host nucleic acids as previously described (Victoria et al., 2008). The cocktail consisted of (per virus sample) 5 units DNase I (Life Technologies, Grand Island, NY, USA), 10 units Turbo-DNase (Life Technologies), 5 units Base-line Zero DNase (Epicenter, Madison, WI, USA), 25 units benzonase (EMD Millipore, Billerica, MA, USA), 2 g RNase A/5 units RNase T1 (Thermo Scientific, Waltham, MA, USA), and 4 units RNase One (Promega, Madison, WI, USA). The virus stock/nuclease mixture was incubated at 37 • C for at least 90 min. The viral RNA was purified using the MinElute Virus spin filter kit (Qiagen, Valencia, CA, USA) according to manufacturer's specifications. The RNA was eluted in 22 l of ultra-pure, RNasefree H 2 O (final yield ∼20 l) and immediately placed on ice.

Random-primed cDNA synthesis and sequence analysis
Twenty microliters of RNA were mixed with 100 pmol of a 28mer primer consisting of 20 nucleotides of a known sequence followed by 8 random nucleotides as previously described . Table 1 shows the sequence of the 20 primers used in this study. These primers were developed so that the 20 base known sequence was used for PCR amplification of the library as well as served as a barcode for identifying each viral library following pooling and sequencing. The RNA/primer mix was heated at 75 • C for 5 min and immediately placed on ice. First strand synthesis was done in a reaction mix consisting of 0.5 mM dNTPs, 1× first strand buffer, 2 units RNasin (Promega) and 200 units SuperScript II reverse transcriptase (Life Technologies). The first strand reactions were incubated at 42 • C for 1 h, heated at 95 • C for 2 min and immediately quenched on ice. The second strand reaction was carried out using Sequenase 2.0 (Affymetrix, Santa Clara, CA, USA) and the nucleotides and primers still present following the first strand reaction (Yozwiak et al., 2012). An equal volume of 2× Sequenase buffer with 4 units of Sequenase per reaction was added on ice. The reaction tubes were placed in a thermocycler that was prechilled to 4 • C and were ramped from 4 • C to 37 • C at 0.1 • C/s. The reaction was continued at 37 • C for 10 min. The tubes were heated to 95 • C for 2 min and placed immediately on ice. A second aliquot of Sequenase was added (4 units/reaction) and the ramping and incubation repeated with the exception that the 37 • C reaction was done for 30 min. The double stranded cDNA was purified using Minelute PCR spin columns (Qiagen) and eluted with 13 l of ultra-pure H 2 O. The cDNA was amplified by PCR using primers consisting of the first 20 known bases of the 28mer used in the cDNA synthesis . The PCR reactions consisted of 10 l of double-stranded cDNA, 1× Pfx50 buffer, 400 M dNTPs, 100 pM of appropriate 20mer primer and 5 units of Pfx50 polymerase (Life Technologies). The reactions were cycled at 95 • C for 2 min and 35 cycles of 95 • C for 30 s, 55 • C for 30 s and 72 • C for 1 min. The number of cycles was empirically determined so as not to over amplify the cDNA, where 35 cycles worked well. The DNA was again purified and eluted in 10 l ultra-pure H 2 O, and 1 l subjected to Agilent Bioanalyzer analysis (Agilent Technologies, Santa Clara, CA, USA). The percentage of the DNA that was 300-400 bp in length was determined for each reaction as was the total DNA concentration in each sample using PicoGreen (Life Technologies). All ten reactions were pooled so that the 300-400 bp cDNAs from each were present in equimolar amounts. The pooled DNA was placed in an end-repair reaction (Ion Plus Fragment Library kit, Life Technologies) followed by ligation of Ion Torrent sequencing adaptors. The DNA was size fractionated using Agencourt Ampure Beads (Beckman Coulter, Indianapolis, IN, USA) using 1.6 volumes of the Ampure beads and was done according to manufacturer specifications. This step removed DNA of 100 bp and less which eliminated adaptor dimers and unligated adaptors. At this point, two 10-genome libraries, for a total of 20 genomes, were pooled for each sequencing run. The 300-400 bp fraction was isolated from the DNA pool using the Pippin Prep (Sage Science, Beverly, MA, USA). The 300-400 bp DNA was subjected to sequence analysis with the Ion Torrent PGM and standard chemistries (Life Technologies) using the 316 sequencing chip.

Demultiplexing and virus genome assembly
The raw data files that contained the sequence data from each sequencing run of the PGM were demultiplexed using the 20 base barcode sequence. The raw sequencing reads obtained from the Ion Torrent sequencer were exported as SFF files. Barcode sequences unique to this study were used to populate a custom configuration file that was used in conjunction with the Roche "sfffile" tool to separate the reads into individual barcode-specific SFF files. All the sequences in each barcode-specific SFF file was extracted to a FASTQ file by applying the "sff extract" utility (http://bioinf. comav.upv.es/sff extract/index.html) and specifying the "-clip" option to remove the sequencing tags, barcode sequences, and low quality sequences. The "fastx trimmer" program (part of the Fastx ToolKit; http://hannonlab.cshl.edu/fastx toolkit/index.html) was used to remove the random primer (8-mer) by trimming an additional eight nucleotides from the 5 -end of each read. As a final post-trimming step, a custom script was used to remove all reads that were less than 40 nucleotides in length. The resulting sequence files were used to assemble full length genomic sequences using SeqManNGen and were edited with the SeqMan software of the Lasergene 10 package (DNAstar, Inc., Madison, WI, USA) using related viral sequences obtained from GenBank as the assembly reference. Assembled genomic sequences were further edited using Aligner (Codoncode, Inc., Centerville, MA, USA). The sequences from each library were submitted to the NCBI Sequence Read Archive with the BioProject number PRJNA237789, BioSamples SAMN02630193, and SAMN02639534 through SAMN02639572.

Sequencing
The Ion Torrent raw sequence data was trimmed of sequencing adaptors, barcodes, the 8 base random primer sequence, low quality sequences and sequences less than 40 bases. After preassembly processing of the reads, an average of 82% of initial reads remained for downstream analysis. The BVDV and BoCV sequencing runs produced 954,798 and 2,268,798 individual sequences, respectively. The percentage of viral sequences in each barcoded library varied but was as high as 87% ( Table 2). The ratio of virus:host sequences was highly dependent on the titer of virus in the stock sample. The number of sequences for each barcoded sample in the library varied, even following pooling at concentrations that should have resulted in equimolar amounts of each sample in the library.
To test the accuracy of the Ion Torrent sequencing, a virus previously sequenced by PCR amplification was included to compare sequences generated by this method and by sequencing PCR amplicons. This virus, a BVDV 1b strain isolated from alpaca (GenBank accession JX297520.1; Table 2, library 3, barcode 10), was assembled from Ion Torrent data and was found to have only 1 base difference from the sequence determined earlier (data not shown). This base was ambiguous in the amplicon sequenced by Sanger chemistry (equal peak height of A and G) but was clearly a G by Ion Torrent sequencing.
The sequences from two libraries were analyzed to determine source of any non-viral sequences. The remaining non-viral sequences were assembled into contigs and were used in BLAST searches of GenBank. This revealed that the non-viral sequences were derived from host nucleic acids, both DNA and RNA. The RNA sequences were primarily ribosomal RNAs but some were from more abundant mRNA transcripts.

Assembly of genome sequences
Assembly of genomic sequences was done using templateassisted assembly. BVDV and BoCV genomic sequences used as templates for assembly were obtained from GenBank (accession numbers JN380086.1 and EF424617.1, respectively). All of the individual virus library sequences resulted in assembled sequences that spanned at least 94% of the genome from which they were derived (Table 2).
In libraries 1 and 2, BVDV genomic sequences were determined for 20 virus isolates. The percentage of viral sequences for each barcoded library ranged from 34.8 to 87.4%. BVDV assemblies contained no internal gaps but the extreme 5 and 3 termini were not present, with the exception of 3 sequences of four viruses ( Table 2). The number of nucleotides missing from the termini varied but ranged from 1 to 98 bases. One virus, library 1, barcode 9, had only 658 viral sequence reads but 94.4% of the genome was assembled. This genomic assembly had 8 internal gaps, ranging from 10 to 219 nucleotides. The other 19 viruses had no assembly gaps and genomic coverages of 99.1-99.9%, depending on variable coverage of the genome termini. The low number of sequences for library 1, barcode 9 was believed to be a result of a pipetting error during the quantitation or pooling of the 10 genomic libraries.
The results of sequencing of 19 BoCV isolates are listed in Table 2, under libraries 3 and 4. The coronavirus genomic RNA is 2.5 times longer than that of BVDV, roughly 31,000 nucleotides versus 12,300. Thus, it required more sequences overall to assemble the complete genome to the same coverage depth. Additionally, because BoCV typically grows to a lower titer, there were lower ratios of virus:host sequences, ranging from 2.7 to 37.8% of the total sequences. A major difference in genome assembly between BVDV and BoCV was that, with only a few exceptions, both the 5 and 3 terminal sequences were determined for each BoCV isolate. Also, due to the longer length of the BoCV genome, more assembly gaps were observed. However, in all cases, assembled sequences spanned more than 97% of the viral genome. More assembly gaps were observed in individual virus libraries having lower total sequences. The BVDV isolate sequenced in library 3, barcode 10 gave results similar to that observed with BVDV isolates sequenced in libraries 1 and 2.

Discussion
The advent of new DNA sequencing technologies has led to the development of improved sequencing protocols for detection and characterization of viruses in environmental and diagnostic samples. Many of these protocols use deep sequencing techniques to detect and identify novel, often uncultivable, viruses. This sequencing protocol was designed to sequence viruses rapidly that were isolated from clinical specimens and archived. Often, a clinical report and history is available for isolates making it possible to begin association of genotype with specific phenotypes and time of isolation. Particularly valuable is that this protocol confers the ability to sequence archival collections of specific viruses rapidly in order to study the evolution of the pathogen over time and determine the genotype of the viruses currently in circulation. This information is useful in determining relevance of vaccine strains based on sequences of currently circulating viruses.
The viruses used for sequence analysis were all isolated and passaged in cell culture. Thus, the abundance of host nucleic acids posed a problem in obtaining sufficient numbers of viral sequences to assemble the viral genomes. It was necessary to pre-digest virus samples with high levels of nucleases before purification of viral RNA. The viral RNA was protected within the intact virions while the host RNA and DNA were degraded. The RNA purification procedure included a protease digestion step aiding in the removal of the nucleases before the viral RNA was released from the virion. The nuclease digestion step aided in reaching as much as 87% of all sequences in a barcoded library being of viral origin. Again, the titer of the virus in the original sample was also important in the number of viral sequences obtained. Viruses that grow routinely to low titers require less than 20 libraries in the sequencing run to ensure sufficient reads in each barcoded library to assemble the genomes.
An important aspect of this procedure was sequence independence, requiring no previous knowledge of sequence or pathogens present in a sample. The use of 20 base barcode sequence with 8 base random nucleotides (Victoria et al., 2008) served both to prime cDNA synthesis and to differentiate each library following sequencing. The 20 base known sequence was also used for PCR amplification of the individual libraries.
The Ion Torrent platform was chosen because it allowed variation in the depth of sequencing based on the size of the chip that was used in the sequencing run. For the most part, the 316 chip worked well, routinely giving sufficient sequencing depth to assemble the genomes of 20 viruses in each sequencing run. The Ion Torrent is known to have indels in the sequence data, especially associated with homopolymeric runs of bases (Bragg et al., 2013). Use of template assisted assembly of the genomes greatly assisted in producing accurate genome assemblies. Limited numbers of de novo assemblies were done and low numbers of errors were found that were attributed to indels. The accuracy of sequences generated by this method was demonstrated by the comparing sequences to those previously generated by Sanger PCR amplicon sequencing. Only 1 nucleotide difference was detected in sequences generated using the two different methods demonstrating the accuracy of the sequencing using the protocol described in this paper. The genomic sequences of the BVDV isolates were readily assembled from the sequence data, partially owing to the relatively small size of the genomic RNA. However, the extreme 5 and 3 terminal nucleotides were difficult to obtain, most likely due to the high degree of secondary structure known to be present in these sequences. This may have made priming of first strand cDNA synthesis near the termini more difficult. With the exception of one virus, all BVDV genomes were assembled with no internal gaps in the sequence. The BVDV genome that was not fully assembled had only 658 total viral sequences, yet this was sufficient to assemble greater than 94% of the genome. The low number of sequences was most like due to a pipetting error during the pooling of the 10 individual genomic libraries. This illustrates the importance of accurate quantitation and pipetting of the individual libraries in making the DNA pool used in the template preparation for sequencing.
The BoCV assemblies tended to have more assembly gaps, due in part to the longer genomic RNA of these viruses. Conversely, the genome termini of the BoCV were determined more often then the BVDV, the reasons for which are unclear. The assemblies containing the most gaps were those that had lower numbers of total sequences. Still, the sequences spanned greater than 97% of the total genome of all BoCV isolates examined in this study. These gaps were easily filled by either resequencing the library or by PCR amplicon sequencing of selected regions. Routinely, there was sufficient double-stranded DNA to be included in a second pooled library if additional sequence data was necessary.
In addition to sequencing archived viruses, this protocol can readily be applied to diagnostic applications. With modifications of the sample preparation, most clinical samples submitted to diagnostic laboratories can be used for analysis. One difference in a diagnostic application is that complete genome assembly would not be necessary but detection of specific viral sequences would be indicative of the presence of the virus. This would require, at least initially, de novo assembly and BLAST analysis to determine if viral sequences were present. Preliminary studies showed that viruses can be successfully detected, and in some cases, near full-length genomic sequences assembled, from serum and fecal samples (data not shown). Additionally, with minor modification of the initial double-stranded cDNA synthesis step, both dsRNA and DNA viruses can be sequenced.

Conclusions
A deep sequencing method for rapid determination of genomic sequences of multiple viruses simultaneously is described. The practical application of this protocol was demonstrated by the generation of full-length or near full length genomic sequences of archived pestivirus and coronavirus isolates. With minor modifications, this protocol could be used to sequence multiple dsRNA and DNA viruses. This procedure is sequence independent, requiring no previous knowledge of sequence or pathogens present in a sample. Additionally, this shows promise for the rapid identification of multiple viral pathogens in diagnostic settings and comparison of viruses currently in circulation with those in vaccines.