The Baltic Sea Virome: Diversity and Transcriptional Activity of DNA and RNA Viruses

Inferred virus-host relationships, community structures of ubiquitous ecologically relevant groups, and identification of transcriptionally active populations have been achieved with our Baltic Sea study. Further, these data, highlighting the transcriptional activity of viruses, represent one of the more powerful uses of omics concerning ecosystem health. The use of omics-related data to assess ecosystem health holds great promise for rapid and relatively inexpensive determination of perturbations and risk, explicitly with regard to viral assemblages, as no single marker gene is suitable for widespread taxonomic coverage.

we utilize this data set as well as additional virus size-class samples and metatranscriptomic data to examine the viral consortia. The samples represent a salinity gradient of 0 to 34.35 practical salinity units (PSU) and include a freshwater Arctic lake, Lake Torneträsk (Abisko, Sweden; site GS667) ( Fig. 1 and Table S1 in the supplemental material). Sample sites span the following subbasins within the Baltic Sea area: Bothnian Bay (BB), Bothnian Sea (BS), Baltic Proper (BP), and Baltic West (BW [referring to sites off the Swedish west coast within Kattegat and Skagerrak Bays connecting to the Arctic North Sea). In total, approximately 23 million and 8.6 million predicted peptides from the metagenomic and metatranscriptomic data sets, respectively, were analyzed in this study (Table S1B and C). To compare the virus size-class samples to those from previous Global Ocean Sampling (GOS) expeditions, orthology and similarity analyses were used on all predicted protein sequences, including those from samples collected from (i) the Baltic Sea, (ii) the Indian Ocean (9), (iii) Chesapeake Bay, and (iv) the California Current. As expected, the majority (51.3%) of sequences are similar to those of samples from Chesapeake Bay, also a brackish system with eutrophication as a major perturbation, and from the California Current, which were similar to the BW samples, while 10.5% were unique to the Baltic Sea area (see Fig. S2 in the supplemental material).
Virome diversity. Using metagenomic data, we identified the taxonomic diversity of major natural dsDNA viral groups; not surprisingly, taxonomic distributions differed between the viral sequences obtained from cell-sized filters (200 to 0.1 m) and those from the virus-sized fraction (0.1 m to 50 kDa), possibly indicative of viruses associated with hosts and ambient viruses that were detected between infection cycles, respectively (Fig. S1). The largest group of viruses consisted of members of the Caudovirales (tailed phages), which include the families Myoviridae (with contractile tails), Siphoviridae (with noncontractile tails), and Podoviridae (with short tails). These bacteriophages infect a wide range of microbial hosts, many of which are dominant phyla in marine environments, including Cyanobacteria, Proteobacteria, and Bacteroidetes. Siphoviridae viruses were not detected at appreciable levels in the open reading frame (ORF)-based metagenomic analysis. Interestingly, sequences phylogenetically related to N4-like viruses were present within the virus-size and larger fractions for sample GS678 within the Landsort Deep. The species represented was Enterobacteria phage N4; however, the database did not contain more recent N4-like isolate genomes (e.g., Roseophage genomes) that might be more relevant for marine environments.
The ambient viral pool (0.1 m to 50 kDa) showed a relative increase in the numbers of podoviruses (~46.5% of Caudovirales) compared to the cell-sized fraction (~11.0% of Caudovirales), largely consisting of those infective for Pelagibacter sp., which is among the most abundant bacterial taxa in the Baltic Sea samples; the data are suggestive of an elevated level of host biomass providing a persistent presence of phage regardless of the position within environmental gradients. Podoviruses also represented the dominant cyanophage family in two sites, GS679 and GS695. These sites contrast with regard to salinity and displayed differing cyanobacterial host populations within the Synechococcus genus, as determined using phylogenetic placement against a reference tree of concatenated conserved genes (32,33). On the basis of these data, site GS679 was dominated by brackish strains related to the outgroup in Synechococcus phylogenetic studies, Synechococcus sp. strain WH5701 (marine subcluster 5.2 [34,35]), and site GS695 by strains related to Synechococcus CC9311 (marine subcluster 5.1B [34,36]) ( Fig. S3; data for all nodes and the associated taxonomy levels achieved are given in Table S2). GS695 and GS679 are the most similar to one another with regard to phage function potential (Fig. S4), which may stem from the presence of cyanophage core genes within these libraries; e.g., viral photosystem genes are well represented for these sites (Fig. S3).
Taxonomic binning of assembled contigs of Ͼ5 kb was also used to assess diversity within the virus size class (0.1 m to 50 kDa; coassembly of all samples). The highest percentage of assembled sequences was classified within the Siphoviridae family (35.5%; Fig. S5), unlike what was seen in the read-based analysis, where this taxonomic family represented one of the less populous bacteriophage groups within the Caudovirales (4.4%). This could potentially be due to the low diversity within this group contributing to greater assembly success with adequate depth of closely related sequences. The assembly consisted largely of sequences similar to those of nonmarine phages, e.g., Salmonella, Staphylococcus, and Streptococcus phages, and of the marine siphovirus Cyanophage PSS2 (37). The next most abundant family in this analysis was Podoviridae, and the sequences of the members of that family generally related to those of the Pelagibacter phage podovirus (HTVC010P [38]) and Roseobacter phage (39,40).
Virus-host inference. Viruses are thought to play a crucial role in marine aquatic ecosystems by altering microbial community diversity structure, host metabolic status, and cycling of nutrients via host lysis (41)(42)(43)(44)(45)(46)(47). We examined viral sequence signatures, i.e., annotated sequences with similarity to known bacteriophage, to address the niche differentiation that may constrain viruses as a result of the use of abiotic (oceanographic contextual data) or biotic (putative host) sources. Sequence data provide a more robust and sensitive assessment of viral identity than the more traditional tools, such as epifluorescence microscopy, which does not account for putative host type and underestimates populations based on size and nucleic acid biases (48). The virus/ bacterium ratio has been used in natural and laboratory systems as a metric of the relationship between viral and bacterial populations as well as of infection potential (49)(50)(51)(52)(53). In natural communities, however, these phenomena become convoluted due to other prevailing hypotheses. For example, the "dilution effect" hypothesis states that as the number of host species (richness) increases the infection potential decreases; however, there can also be the opposite effect, or "rescue effect," where high species diversity increases infection potential due in part to more hosts being competent reservoirs for the viruses present (54)(55)(56)(57)(58). Using bacteriophage sequence signatures, we sought to evaluate the potential influence of viruses on bacterial community diversity by linking the infection potential (virus-bacterium ratio [VBR] per host richness, calculated as the number of the most abundant nodes that account for 50% of the reads assigned to core genes in each sample [node50]) to bacterial diversity (standardized effect size mean pairwise distance [sesMPD]) (59). In other words, the rationale for infection potential (VBR/node50) is based on two factors: (i) the number of viruses relative to the number of bacteria and (ii) how many hosts-either susceptible or nonsusceptible-are present. For many samples, these estimates were negatively correlated, indicating a possible link between virus infection potential and bacterial diversity ( Fig. 2 and Table S3). Increases in VBR/node50 values and decreases in bacterial sesMPD values suggest that virus infection is impacting host diversity. All sites with data indicating a high potential of infection events (VBR/node50) displayed correspondingly low bacterial phylogenetic evenness, suggesting that viruses could be driving mortality of the abundant host lineages present at the time of sampling, reminiscent of the "kill the winner" hypothesis (60). Correlations between sesMPD and VBR/node50 values were negative across all sites (Spearman ϭ Ϫ0.62, P ϭ 0.004); however, the variability in the data (R 2 ϭ 0.442) is likely due to virus-bacterium interactions accounting for only a portion of the variation in global diversity. Samples with sesMPD values lower than the VBR/node50 values were generally from sites with higher salinity, but no other clear correlations were evident with respect to the environmental data.
Periodic surface blooms of nitrogen-fixing cyanobacteria (mainly Nodularia spumigena, Aphanizomenon sp., and Dolichospermum sp.) are common in the central and south Baltic Sea (mainly in Baltic Proper) during summer (61)(62)(63). In addition to negative effects from harmful toxin production, such blooms in the Baltic Sea are strongly linked to development of widespread benthic hypoxia, which is a major ecosystem health hazard (64). Viral infection of these blooms is therefore of increased interest as this would facilitate biomass recycling in the pelagic zone through the microbial food web via the viral shunt. Phage capable of infecting cyanobacteria were identified and correlated to putative shifts in host community structure using metagenomics and transcriptional activity estimates to assess the microbe-virus interactivity of this prevalent and important group. While host and viral populations were detected throughout the Baltic Sea using metagenomics, viral activity (transcript abundance) was primarily detected within the Baltic Proper (BP) (Fig. 3A). To evaluate viral activity, metatranscriptomic libraries targeting mRNA were constructed using rRNA subtraction and total mRNA amplification, thereby avoiding bias toward transcripts with poly(A) tails. Cyanophage have dsDNA genomes and undergo an mRNA intermediate step during infection; therefore, this method provides an important insight into the populations which are transcriptionally active at the time of sampling. The limited spatial activity of cyanophages (samples GS679, GS680, and GS684) could reflect favorable environmental conditions during sampling of these sites or could indicate an abundant standing stock of cyanophages which are not active until advantageous abiotic or biotic conditions develop. Indeed, samples GS679 and GS680 were taken at a site where a decaying Nodularia bloom was observed at the time of sampling. Prochlorococcus phage is likely not present in the Baltic Sea at appreciable levels due to the lack of host species seen in previous amplicon studies of the Baltic Sea (19,20,65,66). Our previous study was able to taxonomically assign sequences to Prochlorococcus spp.; however, these were largely phylogenetically nonreliable and most are likely more similar to those of other unsequenced Baltic Sea picocyanobacteria within the Synechococcus and Cyanobium lineages. Further, on the basis of observations during sampling of a Nodularia bloom demise, we hypothesize that these sequences belong to Nodularia phage, previously identified within the Baltic Sea using the major capsid protein gene as a molecular marker (15), or to another picocyanobacterium phage. Although not within the scope of this study, population-level growth rate measurements could be an important parameter for further understanding virus-host dynamics. Previous reports indicate that fast-growing heterotrophic bacteria occur in lower numbers as a strategy to escape viral predation and are negatively correlated with viral abundance (67). In contrast, slow-growing groups, such as cyanobacteria, are positively correlated with viral abundance. Here, we leveraged metatranscriptomic data to assess interactions using cyanophage transcriptional activity compared to host transcriptional activity and diversity. In general, host number and virus transcript abundance were positively correlated ( Fig. 3B, P ϭ 0.04, Spearman ϭ 0.59). Further, viral transcript abundance, a proxy for infection activity, and host diversity were negatively correlated in most sites ( Fig. 3C, P ϭ 0.005, Spearman ϭ Ϫ0.79), suggesting a role for viral activity in driving the decline of cyanobacterial host diversity, as shown using (VBR/node50)/sesMPD analyses of whole bacterial/phage communities. These data regarding parasitic pressures on prevalent and important primary producers aid in understanding cyanobacterial bloom dynamics and reoccurrence of microorganisms capable of episodic pulses of toxin production.
The widely distributed picoeukaryote Ostreococcus and its viruses were detected at many sites, with a negative association (high virus sequence abundances relative to host sequence abundances within the data set) at sites GS680 (BP), GS683, and GS684 (Belt Sea, grouped with BW in this study; Fig. 4A). In general, Ostreococcus virus OsV5 was well represented throughout the data set using fragment recruitment analysis and was present at a level of 8ϫ coverage in samples originating from sites GS683 and GS684 and at 18ϫ across all 21 sites (Fig. 4B; for a color scheme of read mappings arranged according to site and salinity, see Fig. S7). Ostreococcus virus transcriptional activity was also highest at sites GS683 and GS684 relative to other sites (17.8 and 38.9% of Ostreococcus virus transcripts, respectively) and to the transcriptional activity of their hosts (6.0 and 14.0% of Ostreococcus transcripts, respectively). The negative correlation of host and virus transcript sequences points to a possible infection event that was captured during sampling at this site. Recently, unique Ostreococcus operational taxonomic units (OTUs) were identified within the Baltic Sea, primarily arising in Bothian Bay and Akorna Basin samples, representing unique clades (clades D and C, respectively) (65). Clade D contains the marine species Ostreococcus tauri, and our data would suggest that this is the putative host, since viruses OsV5 and O. tauri virus 1 and 2 are the dominant taxa based on homology and their location is within the areas with the higher salinity levels.
Global virome from transcriptome sequencing (RNA-seq) methods. Examining viral populations using measures of activity provided insight into putatively active virus-host interactions. This is particularly useful for DNA viruses that are transcribed into an RNA intermediate. However, viruses with RNA genomes may not go through an mRNA intermediate that can be explicitly detected using this method; therefore, we focused our analyses of RNA viruses on distribution and potential associations with hosts rather than transcriptional activity. Throughout the Baltic Sea, viral transcripts were detected for some of the major dsDNA virus-host systems identified using metagenomics and led to the identification of sequences for RNA and single-stranded DNA (ssDNA) not seen previously using traditional metagenomic techniques, thus providing a new view of the global Baltic Sea virome. Of note, ssDNA and RNA viruses have much smaller genomes (midpoint among genomes of ca. 6 and 17 kb, respectively) than dsDNA viruses (midpoint ca. 600 kb); therefore, lower relative sequence abundances may be due to genome size rather than ecological significance. Overall, numbers of viral sequences from transcriptomes ranged from approximately~140,000 to 1.9 million reads per site and size class (8.6 million total viral sequences; Table S1C), for an overall total of 3.2% viral sequences per total sequences recovered. In most sequence libraries, dsDNA viruses were the dominant genome type, comprising primarily Caudovirales (bacteriophage) and Phycodnaviridae (large eukaryotic phytoplankton virus) groups (Fig. S6). The higher relative abundance of ssDNA and ssRNA viral sequences detected using the 3.0-and 0.8-m-pore-size filters suggests that their putative hosts are of larger cell size and are likely microeukaryotic in origin (Fig. 5). Recent reports from cultivation studies (68)(69)(70)(71) and cultivation-independent studies (72,73) suggest that microeukaryotic hosts may be the primary microbial prey for marine RNA viruses. Conversely, dsDNA viral sequences were more prevalent within the 0.1-to-0.8-m size class and most likely originated from either bacteriophages or picoeukaryotic phytoplankton viruses that were retained due to increased particle size (Fig. 5 and Fig. S6A).
Microvirus and Circovirus strains have ssDNA genomes and have been discovered in diverse marine environments and hosts (74)(75)(76), although they are typically selected through the bias associated with sequence amplification strategies, particularly using phi29 polymerase and whole-genome amplification as used during sequence library preparation (77,78). Here, RNAseq provided an unbiased assessment of ssDNA virus transcript activity. While most sites had low (0 to 9%) abundance values, one sample from a site at a freshwater lake, Lake Torneträsk (GS667), showed a substantial increase in recovered microvirus sequences (Fig. 5 and Fig. S6), representing a diverse group of bacteriophage, primarily associated with the 3.0-to-0.8-m size class, and accounted for 43.7% of all ssDNA sequences recovered. Bacteriophage have been identified within larger size classes (9) and are proposed to be associated with particle-associated bacteria (32); therefore, our data suggest that the Microviridae-related sequences obtained here are likely associated with bacteria that were retained on the 3.0-and 0.8-m-pore-size filters. The members of the Parvoviridae represent an additional ssDNA group with wide distribution in the Baltic Sea, with particular enrichment in the northern Baltic Sea (BB and BS) and in the 74-m-deep sample within the BP (GS678, Landsort Deep) (Fig. 5). The members of this family infect both invertebrate and vertebrate hosts, including fish, tetrapods, and insects. In these samples, almost all sequences were binned into Porcine parvovirus, which causes reproductive failure in swine populations. Industrial pig farms are numerous along coastlines of the southern Baltic Sea, which functions as a drainage area for runoff, including possible associated microbes, viruses, and nutrients (79).
Both dsRNA and ssRNA viruses were present at various sequence abundances within all of the samples. Of note, these sequences may originate from individual genomes rather than from transcripts. Generally, the greatest taxonomic representation was that of dsRNA-like sequences and marine algal viruses, using phylogenetic placement on an RNA-dependent RNA polymerase (RdRP) reference tree ( Fig. S6C; green). Additionally, sequence similarity revealed a high proportion of Retroviridae-like sequences within the RNA viral groups; however, they comprised only 0.6% of total viral transcripts. The members of this family consist of diverse viruses, including certain fish viruses. Additionally, Picornavirales species were detected within the marine BW and brackish BP waters, whereas members of the Mononegavirales and Ourmiavirus groups were de- tected further northward. These groups had not been detected previously, in any appreciable amount, using sequence-based similarity approaches in GOS omics-derived data.
Viruses as predictors of ecosystem health. One of the most powerful applications of DNA and RNA sequencing for Baltic Sea studies is related to potential assessments of ecosystem health. The use of omics-related data to develop indices of ecosystem health holds great promise. Once suitable baselines are established, such methodology could be employed for rapid and relatively inexpensive determination of perturbations. Within our data set, a relatively high portion of transcripts of fish viruses were detected within the range between 0.01 and 4.4% of total transcripts (the range of all pathogen averages detected was 0.05 to 1.7%). These viral sequences were similar to those of members of diverse families and host ranges, such as Nodoviridae and Retroviridae (RNA) and Iridoviridae (DNA) (Table S1C), which have the potential to cause devastation to local fisheries.
Human viral pathogens, many capable of causing gastroenteritis, specifically, Picobirnavirus and Norovirus, were also detected using sequence-based similarity (Fig. 6). In addition, Giardia lamblia virus and hepatitis E virus, causing gastrointestinal pain and liver disease, respectively, were identified particularly within BP and the BW. All of these pathogens are known to be commonly transmitted via contaminated drinking water. Human coronavirus and rhinovirus A, capable of causing respiratory illness and the common cold, were also found among sites within BP and BW. Human influences are also detected, i.e., the swine viruses from agriculture practices potentially polluting the northern sites and concentrating the pollution in the Baltic Proper. These data provide a first glimpse at the use of meta-omics for the study of human impacted regions, and more statistically robust phylogenetic and epidemiology data will be required to truly understand the role that these pathogens play in relation to the natural biodiversity and human threats.

DISCUSSION
Coupled metagenomic (potential) and metatranscriptomic (transcriptional activity) analyses of dominant dsDNA viral populations enabled a unique look at the distribution of active viruses in the wider Baltic Sea area. Cyanophage sequences are found at appreciable levels in most surface and photic zone samples; however, transcriptional activity was primarily limited to central Baltic Proper. This approach also allowed us to identify sites of increased activity of major environmentally relevant taxa, Cyanobacteria and the picoalga Ostreococcus, as well as of viruses, suggesting a way to facilitate future detailed studies of gene transfer and interplay between these groups. Using sequencebased metrics of infection potential (VBR/node50) and host richness (sesMPD), we were able to circumvent issues associated with traditional microscopy counts, e.g., the inability to discern taxonomic affiliation and loss of viruses with small genomes due to diminished fluorescence. However, our approach is dependent on robust database support for taxonomic assignments and therefore could potentially underestimate viral abundance. Notably, samples analyzed in this study were dominated by marine bacteriophage that are well represented in current databases, e.g., Pelagibacter phage and cyanophage; however, we do recognize that this is a possible limitation of the analysis. Furthermore, increased time-series efforts will shed new light on the dynamic virus-host systems that shape the Baltic Proper through their yearly reoccurring cyanobacterial bloom and bust cycles.
In addition to the dsDNA virome, our methodology facilitated recovery of other more enigmatic RNA virus populations that have remained elusive due in part to size, cultivability, and achieved sequence depth limitations. RNA viruses were detected within the Baltic Sea metatranscriptomic sequence libraries at an appreciable level. These data could pave the way for new biomarkers which are necessary for establishment of fundamental baselines concerning marine RNA virus diversity and distribution. Such approaches will ultimately provide a greater understanding of the human influence on the natural biodiversity of the Baltic Sea ecosystem. Similarly, this study provided an unbiased (compared to previous reports) assessment of transcriptionally active ssDNA viruses present within the Baltic Sea samples and highlighted the relative overabundance within certain sites, such as Lake Torneträsk (GS667) and the Landsort Deep (GS678).
Viruses are known to modulate community structure through infection cycles, which leads to various chemical fluxes within the system resulting in transformed biogeochemical cycles. Sequence data provide evidence that can be extrapolated to gain a greater understanding of the viral ecological niche. Using the VBR, normalized by the number of abundant nodes (taxa) on a bacterial core gene phylogenetic tree, as a metric of infection potential, we found that this potential was most often negatively correlated with the Baltic Sea bacterial diversity. This supports the prediction that greater host diversity results from a negative-density-dependent selective environment. Consequently, more-abundant and more-efficient host populations may have an increased susceptibility to viral attack. Therefore, this scenario may lead to a direct relationship between host diversity and risk of infection. The dilution effect hypothesis is well supported in human infection disease models, where greater host diversity leads to a reduced risk of infection (55,80). Other evidence suggests that biodiversity loss in a system may increase disease or, conversely, that high biodiversity may serve as a pathogen source (81). These data support the idea that maintenance of biodiversity is crucial for promoting ecosystem health, in particular, within the Baltic Sea, where the influx of anthropogenic viral sources may lead to the loss of species and ultimately to unbalanced feedback. Together, the increases in exogenous microbial and viral populations may lead to increased rates of disease and evolution that surpass key ecological thresholds, pushing the ecosystem to its tipping point. This study showed the potential for improved use of omics-derived data as part of risk assessment and management of marine ecosystems, including the Baltic Sea.

MATERIALS AND METHODS
Sample acquisition. Samples were obtained during a 2009 expedition of Sorcerer II detailed by Dupont et al. (19). Briefly, 200 liters of seawater was collected, prefiltered (using a 200-m-pore-size Nytex net), and serially filtered using 3.0-, 0.8-, and 0.1-m-pore-size impact filters. Virus-sized samples were obtained through concentration of the water produced from 0.1 m-pore-size filters via tangential flow filtration (Pellicon Maxi Cassette; Millipore) (50 kDa). Filters were stored in DNA extraction buffer with RNAlater (Life Technologies, Inc.). Viral concentrates (VCs) had 20% molecular grade glycerol added. All samples were frozen onboard and transferred on dry ice.
DNA isolation and preparation of sequence libraries. For detailed methods describing DNA extraction from GOS-derived filters, see reference 92. For details of the preparation of viral DNA from VCs for sequencing using 454 GS FLX titanium sequencing platform, see the methods described by William-son et al. (9). Briefly, environmental genomic DNA was sheared using a Covaris instrument and size selected to 500 to 800 bp. Linkers were ligated to DNA fragments, and amplification was performed using 15 cycles in triplicate. Sequencing adaptors were ligated to the amplified products and purified using AMPure beads prior to sequencing using a 454 GS FLX titanium sequencing platform. Viral metagenomic sequences were trimmed for both the linker and adaptor. Artificial replicates were screened for and removed using the approach described by Gomez-Alvarez et al. (93). To screen for cellular contamination, viral particles were subjected to DNase treatment (3ϫ) and RNase treatment (1ϫ) followed by PCR using universal primers for 16S ribosomal DNA (rDNA). Once sequencing was complete, metagenomic data were screened to identify genome equivalents using hidden Markov model (HMM) searches of conserved genes (most of which were single-copy genes; see reference 32 for more information on bacterial genome equivalents).
Metagenomic viral library assembly. Sequences from virus-size samples (0.1 m to 50 kDa) derived from 454 metagenomic sequence libraries were assembled together using a Newbler assembler (Roche). Prior to assembly, sequences of low complexity were identified and removed using DUST (82). The sequences were first assembled using Newbler with default parameters, including the minimum identity (-mi) set at 86 and the -rip option, which outputs each read into only one contig. Following the initial assembly, downsampling or bioinformatics normalization was performed as previously described by Allen et al. (83). Briefly, areas of the genome with high coverage were randomly reduced to within 2 standard deviations (SDs) of the average contig coverage. These methods were implemented to increase assembly metrics, i.e., fewer contigs, greater length, and greater N50 scores.
Metagenomic sequence annotation. Sequencing reads from the filters were processed as previously described. Virus-size fractions were first processed through FragGeneScan (84) for open reading frame (ORF) calling using the 454-10 train file, given an approximate error rate of 1%. Taxonomic profiles were generated using the Automated Phylogenetic Inference System (APIS), which generates phylogenetic trees from top BLASTp hits (E value, Յ1eϪ9) from PhyloDB. Subsequently, the closest relative becomes the taxonomic lineage for the query (environmental) sequence annotation. The use of APIS for metagenomic sequences has been previously described (9,85). Additional functional information was derived from HMM, KEGG, and GOS cluster searches. Photosystem genes were identified using BLAST (E value, Յ1eϪ3) against a boutique database consisting of cyanophage reference sequences from PhyloDB for psaA, psaB, psaC, psaD, psaE, psaE, psaF, psaJ, psaK, psbA, and psbD.
PhyloDB Total mRNA transcriptome libraries were made with a ScriptSeq v 2 RNA-Seq kit (Illumina Inc., San Diego, CA) using subtractive hybridization of rRNAs with antisense rRNA probes (94) recovered via PCR using 0.1-, 0.8-, and 3.0-M-pore-size filters and a mixture of DNA obtained from the thirteen Baltic sampling stations comprising this study. A 250-ng volume of total community RNA was used for subtractive hybridization. Multiple rounds of subtractive hybridization of rRNAs were used to obtain at least 30 ng of rRNA-depleted total RNA. The quality of the rRNA-depleted total RNA was analyzed on a 2100 Bioanalyzer with Agilent RNA 6000 Pico kits (Agilent Technologies, Santa Clara, CA). A 5-ng volume of rRNA-depleted total RNA was used as an input for a SciptSeq v 2 RNA-Seq kit (Illumina Inc., San Diego, CA) following the protocol of the manufacturer. A total of 10 7 copies of ArrayControl RNA spikes (Life Technologies, Inc., Carlsbad, CA) were added to each sample prior to ScriptSeq amplification. AMPure XP beads (Beckman Coulter, Inc.) were used for cDNA and final library purification. Library quality was analyzed on a 2100 Bioanalyzer with Agilent high-sensitivity DNA kits (Agilent Technologies, Santa Clara, CA). The resulting libraries were subjected to paired-end sequencing via the use of an Illumina HiSeq system.
For poly(A) mRNA transcriptomes, 0.8 g of total community RNA and 2 ϫ 10 9 copies of ArrayControl RNA spikes (Life Technologies, Inc., Carlsbad, CA) were amplified using TruSeq RNA library preparation kit v 2 (Illumina), following the protocol of the manufacturer with minor adjustments. Specifically, the fragmentation time was modified according to RNA quality. Library quality was analyzed on a 2100 Bioanalyzer with Agilent high-sensitivity DNA kits (Agilent Technologies, Santa Clara, CA). The mean size of the libraries was around 400 bp. The resulting libraries were subjected to paired-end sequencing via the use of an Illumina HiSeq system. Transcriptome assembly. Reads were trimmed for quality and filtered to remove primers, adaptors, and rRNA sequences using Ribopicker v.0.4.3 (86). De novo assembly of Illumina HiSeq reads into contigs was accomplished with multiple rounds of assembly using CLC Assembly Cell (CLC bio) clc_novo_assemble. Samples were individually assembled, and ORFs were predicted on contigs using FragGeneScan (84). Read counts for each ORF per each sample were obtained by mapping reads to predicted ORFs using CLC Assembly Cell (CLC bio) clc_ref_assemble_long.
Transcriptome annotation. ORFs were annotated de novo for function via KEGG, KO, KOG, Pfam, and TIGRfam assignments. Taxonomic classification was assigned to each ORF using the reference data set PhyloDB (see "PhyloDB" above). Sequences from both the poly(A)-and random hexamer-derived libraries were pooled at the annotation step for site and size class. Further, pathogen annotation was conducted using literature searches of the best species/strain hit for evidence of infectivity of fish, shrimp, human, plant, or insect species.
Virus-to-bacterium ratio and host diversity. Annotated metagenomic sequences were taxonomically binned into groups of viruses that infect bacteria based on similarity to known bacteriophage and bacteria using APIS. node50 is a measure of taxonomic richness derived from the distribution of core genes in a sample assigned to nodes in a reference tree and is calculated as the number of the most abundant nodes that account for 50% of the reads assigned to core genes in each sample. The mean pairwise distance (MPD) was calculated from a phylogenetic placement 16S rDNA tree (see reference 87 for methods), where MPD is the tree distance between all pairs of hits within a sample. To account for sample size, a normalized MPD (the standardized effect size MPD [sesMPD]) metric was used for determinations of bacterial diversity (88)(89)(90). The sesMPD value represents the number of standard deviations (SDs) by which a value for a sample deviates from the average for samples of the same size taken from the entire fraction. The calculation for determining sesMPD is as follows: sesMPD ϭ [(observed MPD) Ϫ (mean randomized MPD)]/(SD of randomized MPD). One limitation of the MPD metric is that some samples have many more hits within the bacterial tree and, thus, more pairs; therefore, MPD can be skewed by sample size. The sesMPD is used to attempt to normalize these numbers by comparing the MPD of a sample with the average MPD from randomized subsampling (100ϫ for these data) of the same size taken from all the samples mixed together. All correlation coefficients were calculated using cor.test (R; http://www.R-project.org) using the Spearman method and one-sided P values. For cyanobacteria/cyanophage comparisons (Fig. 3), bootstrap analysis (1000 iterations) was performed to confirm that the positive and negative correlations were within the 90% confidence interval.
Accession number(s). The metatranscriptomic data were deposited at NCBI GenBank SRA under BioProject accession no. PRJNA320636. The metagenomic libraries were deposited in iMicrobe under project code CAM_P_0001109.

ACKNOWLEDGMENTS
Support is acknowledged from Science for Life Laboratory, the Knut and Alice Wallenberg Foundation, the National Genomics Infrastructure (funded by the Swedish Research Council), and Uppsala Multidisciplinary Center for Advanced Computational Science for assistance with massively parallel sequencing and access to the UPPMAX computational infrastructure. Support by BILS (Bioinformatics Infrastructure for Life Sciences) is also gratefully acknowledged. The Askö Laboratory (Askö, Stockholm Archipelago, Sweden) is acknowledged for permission to use their field and laboratory facilities. We thank Mark D. Adams for providing feedback on data interpretation and the manuscript and the Sorcerer II staff and crew for sample collection.
This work was supported by the Baltic Sea 2020 and the O. Engkvist Byggmästare Foundations and Science for Life Laboratory (Solna, Sweden), the Beyster Family Fund of the San Diego Foundation, and Life Technologies Foundation. The research was supported in part by awards from the Gordon and Betty Moore Foundation (GBMF3828) and National Science Foundation (NSF-OCE-1136477) to A.E.A. and Gordon and Betty Moore Foundation grant GBMF3297 (L.Z.A.).