Background & Summary

Viruses influence the structure, function, ecology, and evolution of microbial communities. They represent the richest reservoir of nucleic acid diversity1, and the great abundance of viral particles in the environment1 reflects the expression of these sequences in host cells2. However, understanding the structure of virus-host interaction networks in the wild still poses a major challenge as pair-wise interactions between specific viruses and their hosts cannot be predicted without isolate based studies. Thus, though viruses have been predicted to play a major role in maintaining the extensive fine-scale genomic diversity of environmental microbes, it has not been possible to systematically evaluate the mechanistic or genomic foundations of these interactions, nor their ecological and evolutionary consequences.

Here, we present the annotated viral genomes of the Nahant Collection, a large-scale virus-host model system of cultivated and genome-sequenced bacterial and viral isolates, built on the extensively characterized environmental marine Vibrionaceae model system. By capturing large numbers of closely related host and virus strains, the Nahant Collection allows evaluation of the impact of ecologically relevant fine scale diversity on the interactions between bacteria and lytic viruses. This collection of 251 virus genomes and their associated pairs is a resource for interrogating the determinants of host range and the molecular bases of specific virus-host interactions within one of the most richly contextualized environmental microbial model systems37 and time series studies8 available.

Viruses and hosts were isolated from samples collected at three time points within a 93-consecutive-day study of littoral marine microbial communities, the Nahant Time Series8. All viruses were isolated on hosts collected on the same day, and hosts were nearly exclusively Vibrio. The Vibrio are a well-suited host group for the evaluation of the role of ecology and evolution in structuring virus-host interactions – they are ubiquitous in marine systems, ecologically diverse, and are among the most thoroughly characterized model systems for the study of bacterial populations in the wild36,46,10. The viruses were recovered using approaches designed to yield representation of diverse viruses, including: isolation from concentrates by plating in agar overlays to allow for more representative recovery of both fast- and slow-growing viruses; use of 2-week incubation times to allow for appearance of plaques by slow plaque-formers; and inclusion of additives to media to improve plaque visualization (glycerol11) and mimic environmental substrates (chitin) that might be necessary to induce expression of host receptors.

To standardize assemblies of purified viral isolate genomes we used an approach informed by predicted differences in packaging strategies among viruses, described in greater detail in the methods. This approach suggests that viruses of the Nahant Collection include members with diverse packaged genome types, including cohesive end overhangs, inverted terminal repeats, headful-packaging type terminal redundancy, and Mu-like host ends. To evaluate whether any of the viruses were prophages derived from the host of isolation, rather than environmentally-derived isolates, sequence-based searches between virus and isolation host genomes were performed; only one case of prophage purification was identified among the virus strains with sequenced host genomes (Supplementary Table 1).

The collection includes highly diverse dsDNA tailed and non-tailed viruses, including the smallest (10,046 bp) and largest (348,911 bp) described Vibrio virus genomes (median 45,072 bp). Using the Virfam12 Caudovirales classifier, we find that the tailed viruses include representatives of all Virfam Types and Clusters previously identified as associated with Proteobacteria, including: Type 1 Clusters 3, 5, 6, 7, 8, 9 (Siphoviridae and Myoviridae); Type 2 (Myoviridae); and Type 3 (Podoviridae), as well as 26 viruses not associated to any previously identified Virfam Types or Clusters. Analyzing portal protein phylogeny revealed groups of closely related viruses as well as extensive collection-wide portal protein diversity (Fig. 1). The non-tailed viruses discovered in the collection are a proposed novel family, the Autolykviridae, and are discussed in greater detail elsewhere13. The overall collection ranges from 37% to 58% GC content, with a median of 43%. Viral genomes in the collection are also notable for their carriage of tRNAs, present in 53 viruses, and the presence of putative CRISPR features, present in 32 viruses (Table 1 (available online only)).

Figure 1: Overview of the diversity of Nahant Collection tailed viruses, organized by portal protein phylogeny.
figure 1

Virfam12 classifier annotation of the Nahant Collection Caudovirales viruses reveals a diverse collection of myo-, sipho-, and podoviruses (indicated by color of leaf lable) representing all Types and Clusters (indicated in first attribute ring, and see Discussion for cluster identifiers) previously known to infect Proteobacteria, as well as many genomes unassignable to previously described groups. All 262 Caudovirales genome sequences are presented, including 28 replicate sub-lineage genomes. Genome %GC is provided in the second attribute ring and genome size is indicated by bars. The portal protein tree is unrooted and based on trimmed alignments; red circles indicate aLRT-supports ≥0.9. Associated data provided in Table 1, portal protein sequences in provided in Supplementary Table 3, interactive tree available at http://itol.embl.de/tree/181897146181191519509155#.

Table 1 Characteristics of Nahant Collection virus genomes

The viruses and hosts of the Nahant Collection, the largest available dataset of sequenced co-occurring cultivated virus-host pairs, are embedded within the rich contextualization of the 93-consecutive-day Nahant Time Series study8. The integration of ecological context, sequence-information, and cultivation-based study available for this model system make the Nahant Collection a unique and robust foundation for the study of the role of viruses in the ecology and evolution of their bacterial hosts.

Methods

Environmental sampling

All viruses and their hosts were isolated from water samples collected at three time points within a larger 3-month study8 of coastal marine microbial communities at Canoe Cove, Nahant, MA, USA in 2010 (42° 25’ 10.6”N, 70° 54’ 24.2”W): August 10 (ordinal day 222, water temperature 13.8 °C), September 18 (261, 16.3 °C), and October 13 (286, 14.2 °C). Bacteria were collected using a size-fractionation approach3,4 designed to partition co-occurring strains on the basis of differential associations in the water column. Here, as in previous studies of the Vibrio3,4, bacteria were isolated by dilution series plating of material resuspended from 63 μm, 5 μm, 1 μm, and 0.2 μm size-fractions onto vibrio-selective media (MTCBS) for colony growth and serial purification. We purified 3,456 bacterial isolates comprised of 1,152 strains from each of 3 days, evenly distributed over the size-fractions. Samples collected for isolation of viruses were 0.2 μm-filtered to remove bacteria, flocculated by addition of iron chloride, and the flocs collected on 0.2 μm filters and re-dissolved in oxalate for storage at 4˚C (ref. 14). Using this approach, viable viruses in 1000x-fold concentrated seawater could be preserved for later isolation on bacterial isolates derived from the same time and place.

Agar overlay direct plating of concentrates for isolation of viruses

To isolate viruses on co-occurring hosts we used a quantitative agar-overlay approach that allowed for equal representation of both slow- and fast-growing viruses as follows. Viral concentrates equivalent to 15 ml of seawater (15 μl iron-oxalate concentrate) were mixed with host cultures to form agar-overlays within which discrete plaques could form and from which viruses could be isolated15. In total 1,334 purified bacterial strains were exposed, comprising >400 strains per isolation day and representing all isolation size-fractions; of these, 295 showed plaques. Agar overlays were performed using 150 μl of host overnight culture, 2 ml of molten top agar (52 °C, 0.4% agar, 5% glycerol, in 2216 marine broth [MB]), and bottom agar containing glycerol and chitin (1% agar, 5% glycerol, 125 ml L−1 of chitin supplement [40 g L−1 coarsely ground chitin, autoclaved, 0.2 um filtered] in 2216 MB). Glycerol was added to increase the visibility of plaques11, chitin was added to increase the probability of recovery of viruses dependent on chitin-induced receptors, and low density top agar was used to increase the probability of plaque formation by larger viruses16. Agar overlays were wrapped with plastic to reduce desiccation and held at room temperature for 14-16 days. Virus plaques were harvested at the end of the incubation period and archived by filtration of plaque eluates, as described in (ref. 15). Half of each eluate was stored at 4˚C, and half was preserved with glycerol (to a concentration of 25% glycerol) for storage at −20˚C.

Purification of viral strains

To build a diverse and representative collection of virus-host pairs, at least one randomly selected virus was purified from each bacterial strain for which plaques appeared in the agar overlay plating of environmental concentrate. To achieve this, we serially purified viruses recovered from archived material, prepared small-scale lysates to boost viral titer, and then generated high titer stocks by confluent lysis in agar overlays. Purification resulted in genome sequencing of 283 viral strains (from 251 independent plaques) from 246 hosts, described below. Viral and host strain naming conventions are described in Table 2, using examples of virus 1.008.O_10 N.286.54.E5 and host 10N.286.54.E5.

Table 2 Strain identifier nomenclature, using example virus 1.008.O_10N.286.54.E5 and host 10N.286.54.E5.

Genome sequencing

Viral genomes were prepared from lysates of the host of isolation, as follows. Lysates were concentrated on centrifugal filtration devices (Ultracel 30 K, Amicon Ultra, Millipore, UFC903024), washed with 1:100 2216MB, and concentrates treated with nucleases to digest unencapsidated nucleic acids (18 ml sample brought to 500 μl and amended with DNase I, RNase A, heated for 65 min at 37˚C). Nuclease-treated samples were extracted by addition of 0.1 final volume of SDS mix (0.25 M EDTA; 0.5 M Tris-HCl, pH 9.0; 2.5% sodium dodecyl sulfate), 30 min incubation at 65˚C, addition of 0.125 volumes 8 M potassium acetate, 60 min incubation on ice, addition of 0.5 volumes of phenol-chloroform, and recovery of nucleic acids from aqueous phase by isopropanol and ethanol precipitation. Illumina sequencing libraries of each extract were prepared as follows. Sample DNA (5 μg in 100 μl) was sheared by sonication (6 cycles of 5 min each at an interval of 30 sec on/off on the ‘Low Intensity’ setting of the Biogenode Bioruptor) to enrich for fragment sizes of ~300 bp. Sequencing constructs were prepared by end repair of sheared DNA, 0.72x/0.21x dSPRI size selection to enrich for ~300 bp sized fragments, ligation of Illumina adapters and unique pairs of forward and reverse barcodes for each sample, SPRI bead clean-up, nick translation, and final SPRI bead clean-up17. Constructs were enriched by PCR using paired-end (PE) primers following qPCR-based normalization of template concentrations. Enrichment PCRs were prepared in eight replicate 25 μl volumes, with the recipe: 1 μl Illumina construct template, 5 μl 5x Phusion polymerase buffer, 0.5 μl 10 mM dNTPs, 0.25 μl 40 μm IGA-PCR-PE-F primer, 0.25 μl 40 μm IGA-PCR-PE-R primer, 0.25 μl Phusion polymerase, 17.75 μl PCR-grade water. PCR thermocycling conditions were as follows: initial denaturation at 98 °C for 20 sec; batch dependent number of cycles of 98 °C for 15 sec, 60 °C for 20 sec, 72 °C for 20 sec; final annealing at 72 °C for 5 min; hold at 10 °C. For each sample 8 replicate enrichment PCR reactions were pooled and purified by 0.8x SPRI bead clean-up. Each sample was then checked by Bioanalyzer (2100 expert High Sensitivity DNA Assay) to confirm the presence of a unimodal distribution of fragments with a peak between 350-500 bp. Sequencing of viral genomes was distributed over 4 paired-end sequencing runs as follows: 1 lane on the Illumina HiSeq2000 (18 viral genomes; 100+100 nt paired-end reads; average of 5.1 million reads per genome), 3 lanes on the Illumina MiSeq (92-96 genomes per lane; 150+150 nt paired-end reads; average of 54 K, 208 K, 210 K reads per genome for each lane). Raw paired-end Illumina reads were imported and demultiplexed using CLC Genomics Workbench v.6.5.1 (https://www.qiagenbioinformatics.com/). Sequencing and assembly of genomes of bacterial hosts is described elsewhere13.

Genome assembly and curation

Differences in packaging strategies among viruses yield distinctive and characteristic distributions of packaged physical genomes in progeny virions18. Common examples of such strategies include production of virions with genomes that have: variable termini comprised of host DNA (Mu-like viruses); 5′ or 3′ single strand terminal overhangs (cos-viruses); or different start sequences and terminal redundancies ranging from 10 s to 10,000 s of bases (pac-viruses). To inform final curation of genome sequences, we first performed initial assemblies to group similar genomes and allow identification of the packaging-associated large subunit terminase gene (TerL) where possible. We then evaluated read mapping profiles within groups, considering terminase-predicted packaging strategy, to define final genome start sites. We next used an iterative approach, as described below, to standardize genome assemblies with conserved gene orders and genomic start positions for related viruses, and to place genomic termini at the contig ends.

Initial assembly and viral genome clustering

Initial assembly and clustering of viral genomes identified groups of related viruses (Supplementary Table 2), but also highlighted the need for systematic measures to standardize genome curation. Initial assembly and clustering were performed as follows: viral genomes were assembled using the de novo assembly tool in CLC Genomics WorkBench v.6.5.1 with default parameters following trimming of reads (default parameters except: quality score=0.01, ambiguous nucleotides=0). Open reading frames (ORFs) were identified using Prodigal19 with default parameters, and reciprocal best BLAST hits with ≥75% coverage of the longer sequence and e-value of ≤10−5 were clustered using OrthoMCL20. Viral genomes were clustered into genome groups on the basis of shared protein clusters using the FT algorithm of the ClustnSee21 plug-in in Cytoscape22. Preliminary curation of individual groups to assess synteny between closely related and replicate viruses (see Technical Validation) using LAST23 indicated that assemblies began at different locations, suggesting that virus genome characteristics were confounding consistency in contig start and end sites.

Final assembly and curation

To systematically address the inconsistency in contigs produced by assemblies of closely related viruses, assemblies were repeated and curated based on read mapping patterns and terminase similarities, as described below.

Viral genomes were re-assembled using the command clc_assembler from CLC Assembly Cell (version 4.4.2, https://www.qiagenbioinformatics.com/), using default assembly parameters and an insert size setting of 100 to 300 bp; 154 out of 285 assemblies resulted in one contiguous sequence (contig). For virus assemblies producing more than one contig, the highest coverage contig was extracted and considered the target viral genome contig; lower coverage contigs were considered contamination from host genome or prophages.

Viral genome open reading frames (ORFs) were identified using Prodigal version 2.6.1 with the -p meta flag to identify small ORFs24, and virus terminase protein sequences were identified by UBLAST25 search with a cutoff evalue<0.001 against a database of terminases from public viral genomes with previously described or predicted physical genomic termini18,25. Terminase identity was initially assessed via UBLAST as described above, and then verified via OrthoMCL clustering of terminase ORFs with the same dataset to gauge the fidelity of the BLAST results. To evaluate read coverage patterns in relation to terminases, original reads were mapped back to the contigs using the clc_mapper command with default parameters and per-base coverage was determined using SAMtools26 and BEDtools27. Consistent with previous findings that different terminases are associated with distinct genome packaging strategies18, and thus genome termini, exploratory evaluation of read coverage patterns showed that viruses with close identity to different known phage terminases also generally showed different read coverage patterns (Fig. 2a–d).

Figure 2: Examples of read recruitment by contig assemblies before and after adjustment for virus genomes with differing read mapping patterns.
figure 2

Coverage mapping onto contigs of viruses with: Headful-like read mapping before (a) and after contig adjustment (e); Terminal repeat-like read mapping before (b) and after (f) contig adjustment; Single-stranded cos-end-like read mapping before (c) and after (g) contig adjustment; Mu-like read mapping before (d) and after (h) contig adjustment. Note that, as indicated in the methods, though read mapping patterns were evaluated for each virus, final adjustment strategy for each virus (Supplementary Table 2) was not determined solely based on read mapping pattern and the majority of virus contigs were defined as starting one open reading frame upstream of the large subunit of the terminase (TerL) regardless of the read mapping pattern.

To standardize final gene order presentations, start and stop positions for each genome were defined manually, considering three criteria: 1) terminase identity, as identified by UBLAST and OrthoMCL clustering; 2) read coverage; and 3) comparison of contigs between viruses within the genome groups identified in the initial assembly. All members of each group were assigned to a common inferred genome packaging strategy category (see Supplementary Table 2 for details and exceptions) on the basis of overall patterns within the group, and rearranged using the approaches described below. Coverage patterns were determined by visual inspection and by a series of custom R scripts. Where possible, finalized virus genomes were quality checked by comparing the synteny of related phage genomes using command line LAST. A total of 283 virus genomes were assembled, including 251 unique viruses, 31 sub-lineages purified in parallel to the primary unique isolate, and 1 technical replicate (Table 1 (available online only)). We note that though we were guided by group-level coverage patterns, our primary aim was standardization rather than inference of true genome topology, which must be defined by individual genome read coverage patterns and complemented by laboratory studies.

Re-arrangement based on specific ORF

The majority of viral genomes (168/283) in the collection were standardized by circularizing the de novo assembled contigs and re-linearizing them at the start of the ORF upstream of the terminase. As a whole, these viruses showed coverage patterns consistent with a headful, or ‘pac’ site, based genome packaging strategy, wherein up to 110% genome-length monomers are sequentially cleaved from a multigenome-length concatemer, beginning from a conserved ‘pac’ site. Terminase best BLAST matches were dominated by similarity to viruses with headful-like strategies (Sf6, 97 best hits; 933W, 12; and T4, 5), though best hits to short direct terminal repeat (T7, 8) and 5′-cohesive ends (P2, 26) viruses were also identified, along with cases of no similarity to reference virus terminases (20). Read coverage patterns among these viruses were dominated by either a pattern of gradual decreases/shifts (112) consistent with a ‘headful’ or packaging site (‘pac’) – based genome packaging strategy, or even coverage (46); though other patterns of coverage including short peaks (cos pattern, 1; short internal peak, 7) and multiple coverage peaks (2) were also observed. Examination of read coverage following TerL-based rearrangement of contigs (Fig. 2e) often showed coverage maxima localized near the start, consistent with previous observations that headful-packaging viruses commonly have a pac site in or near the small subunit of the terminase gene18, which generally lies upstream of the TerL. Viruses curated using this approach included all the viruses from 7 of the preliminary groups (1, 4, 6, 9, 10, 13, 16), the majority of viruses from group 3, and a single virus from group 7.

Re-arrangement based on peaks or valleys in coverage

The second most commonly applied strategy for standardization (66/283) was circularization of contigs followed by re-linearization by cutting in the middle of a short region of either aberrantly low (36), or high (30), coverage (Fig. 2f,g). As a whole, these viruses showed patterns consistent with the presence of either direct terminal repeats (DTRs) or single-stranded cohesive (‘cos’) ends associated with their genome termini. Genomes with DTRs may yield sharply defined regions of elevated coverage. ‘cos’ genomes may yield regions of either high or low coverage, depending on whether they are 3′ or 5′ overhangs, due to low frequency ligation of ends during library preparation, as well as T4 DNA polymerase 3′ to 5′ exonuclease activity (degradation of 3′ overhangs) and 5′ to 3′ polymerase activity (endfill of 5′ overhangs) of unligated ends. Terminase best BLAST matches were dominated by similarity to viruses with cohesive ends (lambda ‘cos’, 22; HK97 ‘cos’-3′, 7; P2 ‘cos’-5′, 3) and DTRs (N4, 8; T7, 13), though best hits to headful viruses (933W, 4; P22, 1), were also identified, along with cases of no similarity to reference virus terminases (8). Read coverage patterns among these viruses were dominated by either distinctive ‘cos’ (32) or short internal peak (22) patterns, though other patterns of coverage including shifts in coverage (8), multiple coverage peaks (2), even coverage (1), or no pattern (1) were also observed. This approach included all virus genomes in preliminary groups 8, 14, and 18; the majority of viruses in groups 5, 7, 11, and 12; and a minority of viruses in groups 3 and 17.

Scaffolded assembly against reference

If viruses did not follow the patterns described above, but closely related members of the same group (identified as sharing 100% of translated proteins identified via reciprocal UBLAST) did follow a particular pattern, viruses were assembled using closely related strains as a scaffold; this approach was used for genomes in group 5 (3).

Maintenance of original de novo assembly

Singleton viruses with no similar members within the dataset were treated based on closest terminase identity and read coverage pattern, but if no distinct pattern was observed the original assemblies were maintained. This approach was used for 16/283 viruses, including viruses in groups 2 (8), 3 (1), 7 (5), 11 (1), and 12 (1).

Removal of terminal unconserved sequences

A subset of viruses (9/283) were found by BLAST comparison to possess Mu-like terminases, suggesting that they also used a Mu-like replicative transposition headful mechanism that incorporates host DNA upstream and downstream of the site of insertion. Read coverage patterns of initial assemblies of Mu-like viruses exhibited sharp drops in coverage at the termini followed by regions of low coverage (Fig. 2d), these regions of low coverage, representing small pieces of the host genome, were removed in the adjusted assemblies (Fig. 2h). However, closer evaluation of these assemblies revealed several cases of truncated sequences and final Mu-like virus assemblies were performed in CLC Genomics Workbench 8.5.1 as follows. Sequences were trimmed using the NGS Core Tools Trim Sequences tool with trims based on quality scores (limit 0.0001), number of allowable ambiguous nucleotides (max 0), and discard of reads <50 bases. All remaining read pairs and orphans were assembled using the De Novo Assembly tool with a word size of 64 and otherwise default options. The largest contig was extracted from the assembly for each virus and all genomes were aligned using the Geneious 6.0.6 Map to Reference tool to standardize orientation. Genome termini were defined based on the beginning and end of conserved regions at the left and right genome ends, respectively. This yielded 9 independently isolated genomes that were all 100% nucleotide identical and with a length of 31,617 bases, with the exception of virus 1.159.O, which contained a single SNP that was present in both the new and previous assembly versions. Open reading frames for these genomes were called with Prodigal 2.6.3 using the -p meta flag and otherwise default parameters.

Iterative assembly

A subset of the viruses (21/283), described elsewhere as a new family13, had distinctively short (~10 kb) genomes and did not contain predicted terminases. BLAST comparison of ORFs from these genomes showed similarity to the protein-primed DNA polymerase of viruses of the Tectiviridae, which have linear genomes and inverted terminal repeats (ITR), and thus these viruses were also evaluated for ITRs. Final assemblies for this group were performed iteratively, as follows. Following initial assembly, second and third assembly iterations were performed using an increased word size of 64, and the largest contig from the previous assembly was included as one of the “reads” for the successive round of assembly. The longest contigs from each of the three assemblies were then compared and the longest contig that also exhibited ITRs was used, when none of the contigs contained ITRs the longest assembled contig was determined to be the final assembly.

Annotation

Viral genomes were annotated using multiple approaches and tools, as described below. Genomes are available through Genbank (Data Citation 1).

Virfam classification of viral Types and Clusters and morphotypes

Viral proteins were annotated using the Virfam12 classifier, which identifies multiple genes of the head-neck-tail modules of viral genomes and assigns viral genomes to morphotypes within Types and Clusters on the basis of previous characterization of diverse tailed viruses. Annotation was performed individually per genome by submission to the Virfam webserver (http://biodev.cea.fr/virfam/). Output reports for all Virfam annotation runs are available through figshare (Data Citation 2).

Genome content annotation

Phage proteomes were compared to KEGG28, COG29, eggnog30, Pfam31, ACLAME32, CAMERA Viral Proteins (CVP)33 and the OM-RGC collection of sequences34 via UBLAST24. Annotations were determined as the best hit (maximum bit score) to a non-hypothetical protein from EggNOG, KEGG, COG, ACLAME or Pfam (minimum alignment of 75%, minimum percent identity of 35%). Best hits to remaining databases as well as CVP and OM-RGC are reported as notes within the final annotations. Annotations were combined with annotations identified using InterProScan version 5.17-56.0 using the iprlookup, goterms, and pathways options. InterProScan is a program from EMBL-EBI that uses the InterPro database for annotations. The InterPro database contains by default 13 databases, which are listed here: https://github.com/ebi-pf-team/interproscan/wiki/HowToRun#included-analyses. For these annotations, two optional databases were included: TMHMM for predicted transmembrane proteins and SignalP for predicted signal peptide cleavage sites. tRNA sequences were identified using tRNAscan-SE version 1.23 (ref. 35) using the general tRNA model (-G). CRISPR-like elements were identified using CRT36.

Portal protein phylogeny

Portal proteins were identified directly using the Virfam classifier as described above, which provides a portal prediction, as well as by using HMM- and blastp-based searches of all Nahant Collection virus proteins, as follows. The portal protein for the representative virus of each Virfam cluster was downloaded through the Virfam page (http://biodev.cea.fr/virfam/), and an HMM generated by performing 3 iterations of Jackhmmer37 https://www.ebi.ac.uk/Tools/hmmer/search/jackhmmer. Searches of the Nahant Collection virus proteins with this collection of HMMs using the hmmsearch38 tool (hmmer version 3.1b2) identified putative portal proteins in 241 genomes, these 241 together with 6 portal proteins identified directly through the Virfam web page, were used to search the Nahant Collection with blastp39, identifying putative portal proteins in 262 of the 263 Caudovirales (e value <0.0001). In the 2 cases where the predicted portal proteins differed across the two methods (Supplementary Table 3), HHpred as implemented in the MPI bioinformatics Toolkit40 (https://toolkit.tuebingen.mpg.de/#/tools/hhpred) was used to evaluate both predictions and the protein with the longer sequence similarity to a portal protein was selected. The portal protein in virus 1.031.O could only be predicted using the Virfam approach and this protein was included, though HHpred and Phyre2 (ref. 41) based structural similarity based searches did not indicate similarity to known portal proteins. Sequence alignment, trimming, and tree-building were performed using the eggnog41 workflow in the ETE3 (ref. 42) version 3.0.0b36 tree building tool, the tree was visualized using iTOL43, and the figure prepared using Adobe Illustrator.

Data Records

All virus and host-associated sequences and annotations associated with this work have been deposited to the Nahant Collection NCBI BioProject (Data Citation 1), specific accession numbers for each strain are provided in Supplementary Table 1. Viral genome annotation reports generated by the Virfam tool have been deposited with figshare (Data Citation 2).

Technical Validation

Given the known abundance of prophages in bacterial genomes we evaluated whether any viruses in the collection represented induced prophages from the host of isolation. Using megaBLAST in Geneious 6.1.8 we searched all virus genomes against all sequenced hosts, we identified only a single case of a high query cover and high identity match. The virus 1.202.O (32,014 bp) shared a 30,051 bp 100% identity match with its host 10N.222.45.E8; this match region occurred within a larger host contig of 120,557 bp, suggesting that the failure to achieve a full match with the remaining 1,963 bp region of the virus genome contig was not due to incomplete assembly of the associated host region. Full genomes for the host of isolation were not available for 29 viruses and thus this prophage derivation could not be assessed for these strains, information on host sequence availability is provided in Supplementary Table 1.

This dataset contained sets of virus pairs and triplets that served as biological replicates for assembly optimization. Such sets derive from instances of independent purification of viral sub-lineages from a parent plaque due to the occurrence of variable plaque morphology. Though they exhibited sites of polymorphism at the nucleotide level, ranging from 0 to 4 SNPs, and indels of up to 201 bp (Table 3), members of these sets consistently showed identical gene content and are expected to have the same genomic structure and gene order. Methods for rearrangement to maintain synteny were developed around such sets and were verified via alignment of similar/duplicate genomes before and after rearrangement (Supplementary Figures 1–4).

Table 3 Replicate virus genome comparisons.

Additional information

How to cite this article: Kauffman, K. M. et al. Viruses of the Nahant Collection, characterization of 251 marine Vibrionaceae viruses. Sci. Data 5:180114 doi: 10.1038/sdata.2018.114 (2018).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.