DNA Viral Diversity, Abundance, and Functional Potential Vary across Grassland Soils with a Range of Historical Moisture Regimes

ABSTRACT Soil viruses are abundant, but the influence of the environment and climate on soil viruses remains poorly understood. Here, we addressed this gap by comparing the diversity, abundance, lifestyle, and metabolic potential of DNA viruses in three grassland soils with historical differences in average annual precipitation, low in eastern Washington (WA), high in Iowa (IA), and intermediate in Kansas (KS). Bioinformatics analyses were applied to identify a total of 2,631 viral contigs, including 14 complete viral genomes from three deep metagenomes (1 terabase [Tb] each) that were sequenced from bulk soil DNA. An additional three replicate metagenomes (∼0.5 Tb each) were obtained from each location for statistical comparisons. Identified viruses were primarily bacteriophages targeting dominant bacterial taxa. Both viral and host diversity were higher in soil with lower precipitation. Viral abundance was also significantly higher in the arid WA location than in IA and KS. More lysogenic markers and fewer clustered regularly interspaced short palindromic repeats (CRISPR) spacer hits were found in WA, reflecting more lysogeny in historically drier soil. More putative auxiliary metabolic genes (AMGs) were also detected in WA than in the historically wetter locations. The AMGs occurring in 18 pathways could potentially contribute to carbon metabolism and energy acquisition in their hosts. Structural equation modeling (SEM) suggested that historical precipitation influenced viral life cycle and selection of AMGs. The observed and predicted relationships between soil viruses and various biotic and abiotic variables have value for predicting viral responses to environmental change.

bioinformatics have provided more details about the types of DNA viruses that inhabit different soil ecosystems (3)(4)(5), including thawing permafrost (3,4), dry Antarctic soil (6), forest soil (7), and a range of soils from a global ecosystem comparison (5). These metagenomic analyses have suggested that the majority of soil DNA viruses are bacteriophages (3)(4)(5), although giant viruses that infect eukaryotes have been detected in forest soil (7) and permafrost (8). Additionally, soil viruses identified from metagenomes have been shown to have auxiliary metabolic genes (AMGs) that potentially encode an array of metabolic functions outside essential host infection and viral replication (3,4). The high prevalence, host interactions, and intriguing auxiliary metabolic functions of DNA viruses highlight their potential ecological importance in soil.
Climate change is currently influencing the soil environment, with unknown consequences on soil viruses. For example, changes in precipitation patterns result in soil moisture shifts that influence soil biotic and abiotic properties (9). Field and laboratory studies have shown that changes in soil moisture impact the microbial composition and functional potential of the soil microbiome (10,11). However, it is currently unknown how changes in precipitation shape the soil virosphere and influence viral types, abundances, activities, and lytic/lysogenic lifecycles (12,13). During the lytic life cycle, bacteriophages replicate inside the host and are released over short intervals via lethal disruption of host cells (14). In contrast, during the lysogenic life cycle, a prophage does not directly result in virion production or release. Instead, the prophage is replicated along with the host cell genome during binary fission (14). Information about viral life strategies is key to predicting how soil viruses will regulate host dynamics in response to shifts in precipitation due to climate change (15,16). By microscopy, abundance of virus-like particles in soil has been positively correlated with soil water content (17). This could be a result of increased bacterial numbers and enhanced diffusion/advection under high soil moisture conditions, which could increase the chances of virus-host encounters as well as prophage induction to a lytic life cycle. In contrast, bacteria extracted from dry Antarctic soils have been shown to contain high proportions of temperate viruses (18). These findings lead to a hypothesis that lysogeny is higher in dry soils than in wetter soils. This hypothesis is particularly important to test due to the roles that viruses play in the regulation of host abundances and other, yet unexplored, auxiliary metabolic functions.
Here, we aimed to contribute new information to this complex problem by directly comparing viral diversity and abundance, prevalence of lysogeny, and metabolic potential in three grassland soils across the continental United States with historical differences in annual precipitation. One extreme was an arid grassland soil from eastern Washington (WA), an area with hot dry summers and 180 mm average annual precipitation (19). The other extreme was a grassland soil from Iowa (IA), with large amounts of annual precipitation (934 mm) and tile drainage to prevent soil moisture saturation (20,21). We also collected soil from Kansas (KS), a state with shifting precipitation due to climate change (9), having trends toward drier summers in the southwest and wetter summers in the northeast. The KS location was Konza native prairie with 835 mm annual rainfall (19). We used a variety of bioinformatics approaches to determine differences in viral lifestyles, abundances, and functional potentials from bulk soil metagenomes that were obtained across the locations. The results of this study indicate major differences in soil viruses along the precipitation gradient and provide context for predicting how future changes in climate will influence the soil virosphere.

RESULTS
Soil samples were collected in triplicate in the fall of 2017 across the three grassland locations having differences in historical annual precipitation (IA . KS . WA). For each location, one deeply sequenced composite metagenome (19) (.1 terabase [Tb]) was obtained to detect viral contigs using a bioinformatics workflow (Fig. 1). In addition, each of the three field soil samples were individually sequenced (;0.5 Tb each) to provide three replicate metagenomes for a statistical comparison of the impact of historical annual precipitation on viral types, relative abundances, diversity, and AMGs. Therefore, there were four metagenomes per site (one large and three smaller) for a total of 12 metagenomes. The soil samples were also analyzed for soil biotic and abiotic properties, including soil moisture content at the time of sampling, chemical composition and microbial community biomass, composition, and diversity ( Fig. S1; Fig. 2). Although KS had a historically lower annual precipitation than IA, at the time of sampling, KS had the highest average soil moisture content (30.32%), followed by IA (20.22%) and WA (5.03%). Unlike the other two sites, the IA soil was tile drained to prevent the soil from being saturated, which might be one explanation for its relatively lower soil moisture content at the time of sampling. Alternatively, the relatively lower soil moisture content in IA compared to KS at the time of sampling could be due to timing of recent precipitation events. The WA soil also had a much higher pH (8.67) compared to the other two soils (IA, pH 6.81; KS, pH 6.25). There were also significant differences in organic matter content (IA . WA, P , 0.01; IA . KS, P , 0.05) and iron concentrations (IA . WA, P , 0.01) across some of the locations (Fig. S1).
A 16S rRNA gene-based analysis revealed that the grassland microbial communities were dominated by Actinobacteria and Proteobacteria. The microbial community composition was more similar in IA and KS compared to WA (Fig. S2). The microbial biomass was estimated by the DNA concentration extracted from the soil, with the highest estimates in IA compared to KS (P , 0.05) and WA (P , 0.01) (Fig. 2a). Microbial diversity, based on the number of 16S rRNA gene clusters (at 97% identity) FIG 1 Overview of bioinformatic workflow. The bioinformatic workflow comprises four modules. (a) Viral mining. Contigs with length greater than 2,500 bp (bp) were selected and screened for viral sequences by an integrated approach that combines a probabilistic approach (1) together with database searching (2) against IMG-VR (63) and (3) VirSorter (66) (NCBI), and machine learning (3). Four different criteria were then used to determine highly confident viral assignments. Only those that corroborated at least three out of the four criteria were included as confident viral sequences for further analyses. (b) Viral clustering. An integrated viral clustering network was constructed based on a protein sharing matrix by vContact (v2.0.9.10) (22). The remaining viral contigs were then linked to the clusters according to their tetranucleotide frequencies (pyani v0.2.9) (92). (c) Host assignments. The primary method (M1) used to assign the hosts of viral contigs was via matching CRISPR spacers in the nonviral contigs with the classified taxon. The additional host assignment methods (M2 to M4) were based on sequence similarity to either nonviral sequences (M2) or reference viruses (M3 based on local alignment; M4 based on protein sharing and tetranucleotide frequency). (d) Auxiliary metabolic gene (AMG) classification. Potential AMGs were first annotated by searching different functional gene databases (CAZY [84], EggNOG [61], and FOAM [85]). Potential AMGs were then classified into five categories based on their gene arrangements, and only those that contained viral genes both upstream and downstream and had a confirmed 3D structure reconstruction by Phyre2 (87) were considered AMG candidates. enumerated in the metagenomes, exhibited the opposite trend, with the lowest diversity in IA and highest in WA (IA versus KS, P , 0.05; IA versus WA, P , 0.05; Fig. 2b).
Viral abundance, diversity, and potential activity across the precipitation gradient. A total of 2,631 viral contigs (.2.5 Kb with an average length of 11 Kb) were identified from the three grassland soils, and 14 of these were complete and high-quality viral genomes (Table S1 and Fig. S3). The total number of viral contigs was highest in WA (1,577 contigs), followed by KS (756 contigs), and lowest in IA (298 contigs). This trend still held when mapping reads from three additional replicate metagenomes per site to the viral contigs and normalizing per gram of soil (IA versus KS, P , 0.05; IA versus WA, P , 0.05; KS versus WA, P , 0.05; Fig. 2d).
The number of viral clusters per sample (richness) was used to represent viral diversity and compared across the grasslands with a historical precipitation gradient. Only 63% of the viral contigs (1,666 out of 2,631 viral contigs) could be grouped into clusters. In total, 214 viral clusters were obtained. We first analyzed the contigs using vConTACT (22), which relates different contigs to each other based on a protein sharing matrix which clustered 58% of the viral contigs into 143 clusters (nodes in circular shape, Fig. 3a). Another 5% of the viral contigs (139 viral contigs; nodes in diamond shape, Fig. 3a) were added to the vConTACT clusters as extended edges based on correlation of their tetranucleotide frequency Z scores. An interactive network of viral clustering (Fig. S4) is available for a closer exploration of these clusters that can be filtered by the complexity of each cluster (number of edges). The majority of the clusters (144) were novel and did not cluster with known viruses in reference databases, and the rest were clustered with NCBI reference genomes representing viruses spanning four Percentage of CRISPR spacers extracted from each soil metagenome that exactly matched viral contigs identified from the same site. (d) Estimates of viral abundances in the three soils. The sum of the average read coverage of the viral contigs identified in each soil (identity, .95%; coverage, .80%) was used to represent the viral abundance at each site. (e) Viral diversity based on clustering of detected viral contigs using a protein sharing matrix and tetranucleotide frequency (details in Materials and Methods). (f) Abundance of viral lysogenic markers. The total read coverage (identity, .95%; coverage, .80%) of viral genes encoding integrases and excisionases was used to assess the prevalence of lysogeny. All of the values shown in panels a, b, d, e, and f were first normalized to gram of soil and then log transformed. Statistical tests were performed via pairwise comparisons among the three grasslands (n = 3). Significant differences were determined using t tests in the R package ("rstatix"). In each boxplot (panels a, b, c, d, e, and f), the top and bottom of each box represent the 25th and 75th percentiles, respectively, and the center line indicates the median. orders, Caudovirales, Kalamavirales, Petitvirales, and Tubulavirales. These results highlight that soil viruses are highly diverse and largely uncharacterized. Similar to differences in viral relative abundance and in microbial diversity across locations, the number of viral clusters was highest in WA (compared to IA and KS, P , 0.05; Fig. 2e) and lowest in IA (compared to KS and WA, P , 0.05; Fig. 2e). A total of 45 viral clusters were detected across all of the grassland soils, referred to as "common clusters." Of these, 37 contained viral contigs with genome coverages of .50% in the replicate metagenomes (Fig. 3b). The estimated abundances of the common clusters accounted for an average of 67% of the IA virosphere, 55% of the KS virosphere, and 50% of the WA , and Iowa (IA, orange) grassland metagenomic sequences were clustered together with NCBI reference viruses (gray). Viral contigs that primarily clustered based on their protein sharing matrices are shown as closed circles, and contigs that were added to the vConTACT network based on their Z score correlations by tetranucleotide frequencies are shown as diamonds. (b) The heatmap illustrates abundance estimates for each viral cluster detected in the three replicate metagenomes from each site. The estimated abundances were log transformed, and warmer colors represent higher abundances. The abundance profile was grouped according to the similarity of the viral cluster composition. Viral clusters that were common and detected in all soil locations are labeled "common" (red line). The clusters that were found in any two of the three soil locations are labeled "shared" (brown line). Clusters that were unique to each site are labeled "site-specific" (blue line). (c) The percentage of common and shared viral clusters that differed according to differences in historical precipitation. Viral clusters with significantly differential abundances across sites were colored in red (P , 0.05), with the remainder in gray. (d) Percentage of contigs with CRISPR spacer hits in common clusters (red) versus site-specific clusters (in blue). In each boxplot, the top and bottom of each box represent the 25th and 75th percentiles, respectively, and the center line shows the median. virosphere (Fig. 3b). An additional 55 clusters were shared between two of the three sites, and the majority of these clusters (23) were also detected in the replicate metagenomes ("shared clusters"; genome coverage, .50%; Fig. 3b). The remaining 114 clusters were unique to only one soil location ("site-specific clusters"). The majority of the site-specific clusters (103) were also detected in the replicate metagenomes (genome coverage, .50%; Fig. 3b). The remaining clusters that were not detected in the replicate metagenomes could represent contigs that were in low abundance.
Differential abundance of the common and shared viral clusters that were detected in the replicate genomes was calculated and compared across samples. Many of these viral clusters significantly responded to differences in historical precipitation (24% of common clusters and 32% of shared clusters; Fig. 3c). Common clusters contained significantly more viral contigs that showed signs of host interaction, as was evident by their significantly higher frequency of CRISPR spacer hits than site-specific clusters (P , 0.01; Fig. 3d).
Potential virus-host interactions across the precipitation gradient. Potential virus-host interactions at the time of sampling were determined by sequence homology searches for exact matches of CRISPR spacers against the viral contigs (24). The CRISPR spacers were bioinformatically identified from the contigs representing cellular genomes. This method has been previously demonstrated for predicting potential interactions between viruses and prokaryotes (5,25). A total of 3,708, 2,791, and 746 CRISPR spacers were matched to the viral contigs detected from IA, KS, and WA, respectively (average of absolute counts from three replicates for each soil). As prokaryotes keep an inventory of viral genome fragments from previous and current viral infections as arrays of CRISPR spacers (26), we used the percentage of spacers that were exact matches to the viral contigs identified at the time of sampling to inform the relative frequency of more recent virus-host interactions. These differences were significant across the precipitation gradient (IA versus WA, IA versus KS, and KS versus WA; P , 0.05; Fig. 2c), with a higher percentage in the historically wetter IA soil (15%), followed by KS (6%) and WA (2%). In addition, we assessed the coverage of lysogenic markers, such as genes encoding integrases and excisionases, that have previously been used as proxies for the prevalence of lysogeny (27). The relative abundances of these markers for lysogeny were higher in the historically drier WA soil (Fig. 2f). These findings suggest that temperate phages were more abundant in historically drier soil.
Four nested methods were applied to assign putative hosts to the viral contigs (M1 to M4, Fig. 1c). One percent of the nonviral contigs that are representative of possible viral hosts contained CRISPR arrays, yielding 12,892 spacers, with 8,481 of them being unique. These were used for CRISPR matching (3, 4) (M1, Fig. 1c) to assign 43% of the viral contigs to their hosts. Subsequently, to increase the number of host assignments, a second method was applied to assess the sequence similarities of viral contigs to nonviral sequences (M2, Fig. 1c). Using this approach, hosts for an additional 8% of the viral contigs were assigned. The third method (M3, Fig. 1c) assigned an additional 1% of the viral contigs to host taxa according to their sequence similarities to NCBI reference viral genomes with known hosts. Finally, the fourth method (M4, Fig. 1c) assigned potential hosts to another 10% of the viral contigs by clustering them with known reference viruses.
Virus-host pairings revealed that the detected soil viruses potentially targeted a broad range of host phyla across kingdoms, including archaea, bacteria, and fungi ( Fig. 4). Although the viral communities recovered from the three grasslands differed in relative abundance and overall composition, the majority of the detected viral contigs were consistently assigned to Acidobacteria, Actinobacteria, and Proteobacteria as the hosts, which also represent the dominant grassland soil microorganisms (Fig. S2). These dominant bacterial phyla were also the primary hosts of the common viral clusters (Fig. 4). For the shared viral clusters, the host composition of KS and IA grasslands were more similar than those for WA (Fig. 4). In contrast, the site-specific clusters had different predicted hosts, thus illustrating unique assemblages of soil viruses and hosts across sites with differences in historical precipitation (Fig. 4). Interestingly, some of the host assignments that we found have not previously been reported for soil viruses in thawing permafrost (3,4) and the Earth's virome study (5) (e.g., Candidatus groups NC10 [28] and Rokubacteria [29]).
Within VC_110, one of the most abundant common clusters, there were a total of 71 viral contigs, and 6 of these contigs were potential viral generalists (5). Each of the 6 viral contigs had more than 10 spacer hits and could be assigned with a broad host range spanning across several phyla, including Actinobacteria, Proteobacteria, Acidobacteria, Chloroflexi, Planctomycetes, and Verrucomicrobia (Table S1).
Viral auxiliary metabolic genes (AMGs) differed across the precipitation gradient. We used a three-step approach to classify auxiliary metabolic genes (AMGs) from the detected viral contigs (details in Fig. 1d) and to estimate their abundances across the three locations (Fig. 5). A total of 94 putative AMGs were found that were represented in 18 metabolic pathways, including energy acquisition and carbon ). Alluvial plots illustrate virus-host pairing using the integrated approach described in Fig. 1c (host assignment). The plots are grouped by the grassland soil location (IA, orange; KS, blue; WA, purple). The number of predicted virus-host pairs are shown on the y axis. Within each soil location, virus-host pairs were plotted for viral clusters that were common to all three locations (common), shared among two of three locations (shared), or only found in one location (site-specific). For each plot, the left stratum represents host assignment, colored by phylum, and the right stratum shows viral clusters separated by horizontal white lines. The flow of the pairing is colored by the host lineage assigned. The height of the colored strata demonstrates the relative dominance of each host phylum that the identified soil viruses were predicted to target.
Precipitation Impacts on Soil Viruses ® metabolism (Table S2). The most confident AMG assignments (category 1, Fig. 1d) had viral hallmark genes on the same contig as the AMG. There were 65 AMGs that fulfilled category 1 criteria; 14 of these had viral hallmark genes both upstream and downstream of the AMG. The other 29 AMGs fulfilled category 2 criteria (Fig. 1d) without viral hallmark genes but with viral nonhallmark genes both upstream and downstream (the genomic context of category 1 and 2 AMGs is provided in Table S2). The viral contigs with AMGs are visualized in Fig. S5. Identified AMGs were dereplicated based on annotated functional categories, and AMG richness was used to represent the range of unique auxiliary metabolic potentials of the viral community of each site. The historically driest site (WA) had the highest AMG richness, with 28 that were involved in 17 pathways, followed by KS with 15 AMGs involved in 7 pathways and IA with 8 AMGs involved in 7 pathways (Fig. 5).
An unexpected finding was the discovery of bacteriophages carrying putative AMGs with homology to parts of the oxidative respiratory chain. These AMGs were found on viral contigs from the KS grassland; one contig contained a cytochrome c oxidase, and one contig contained three subunits for NADH-quinone oxidoreductase ( Fig. 5 and Fig. S5). The putative AMGs annotated as three subunits for NADH-quinone oxidoreductase on one viral contig were classified as category 2 AMGs with viral genes both upstream and downstream after manual inspection (details in Materials and Methods, Table S2, and Fig. S5). AMGs encoding NADH-quinone oxidoreductase subunits were all located on the same contig; genes for subunits L and J were located next to each other, and the gene for subunit M was located 16 Kb downstream. Each of the three AMGs had at least one viral nonhallmark gene upstream and downstream, ). The heatmap illustrates abundance estimates for each putative AMG in three replicate metagenomes (IA, orange; KS, blue; WA, purple). The estimated abundances represent log transformed read coverages of the viral contigs detected with the AMGs, and warmer colors represent higher abundances. The abundance profile is clustered by sample according to the similarity of AMG composition. AMGs are further grouped by KEGG Orthology ("function," details in Table S2). AMGs with a black star in the "function" cells are also detected in the complete viral genomes (Fig. S3).
suggesting that they were carried on a virus (Table S2 and Fig. S5). To our knowledge, this is the first report of finding AMGs involved in oxidative respiration in soil viruses and could be analogous to the discovery of AMGs involved in photosynthetic production of ATP via proteorhodopsin in marine viruses (30). Examples of other intriguing auxiliary metabolic potentials in viral contigs included (i) an acyl carrier protein (category 2 AMG), a key enzyme for fatty acid biosynthesis in all domains of life, (ii) genes involved in the production of UDP-glucose (category 2 AMG), a substrate for cellulose biosynthesis, (iii) an alkaline phosphatase (category 2 AMG), a key enzyme involved in phosphate cycling, and (iv) genes regulating bacterial sporulation (Fig. 5).
Another set of putative AMGs were found to be involved in metabolizing a diverse range of carbon compounds with various complexities, including xylan, cellulose, chitin/chitosan, catechol, pectate, fructose, and mannose (Fig. 5). For example, we detected three AMGs predicted to be involved in xylan degradation-acetyl xylan esterase (category 1 AMG with viral hallmark genes both upstream and downstream), b-1,3-xylanase (category 1 AMG with viral hallmark genes both upstream and downstream), and b-1,4-xylanase (category 2 AMG) that converts xylan to xylose. Genes encoding viral chitinases were also identified in metagenomes from all 3 grassland soils. In contrast, catechol 1,2-dioxygenase (category 1 AMG with a viral hallmark gene downstream), an enzyme involved in the breakdown of aromatic compounds, was only found in the WA grassland soil. Five putative AMGs involved in mannose metabolism were also identified on 16 viral contigs; 15 were category 1 AMGs and 1 was a category 2 AMG. Among these, the mannan endo-1,4-betamannosidase AMG was previously confirmed to be active in cleaving b-1,4-linked mannose, a plant-derived polymer, in a soil virosphere from thawing permafrost (3). In addition, 6 enzymes involved in folate biosynthesis were detected as putative AMGs on 17 viral contigs (13 category 1 and 4 category 2). Three of the AMGs for folate biosynthesis (6-pyruvoyltetrahydropterin/6-carboxytetrahydropterin synthase, 7-carboxy-7-deazaguanine synthase, GTP cyclohydrolase) were located on the same viral contigs in WA and KS metagenomes.
We also found putative AMGs on completely closed viral genomes (Fig. S3), which enables connections of viral AMGs to the other viral genes. All 14 of the complete viral genomes were characterized as novel bacteriophages with no qualified matches to NCBI viral genomes (coverage, .50%; identity, .50%). Two of the complete viral genomes were detected from KS and 12 from WA. The AMGs annotated on these complete viral genomes included those involved in metabolism of xylan, cellulose, mannose, chitin/chitosan, and pectate. As these 14 soil viruses were assigned to the dominant bacterial taxa as putative hosts (i.e., Actinobacteria, Acidobacteria, and Proteobacteria), the identified auxiliary metabolic functions carried by the soil viruses could potentially impact the turnover of organic matter in soil by influencing the host communities.
Structural equation modeling suggested causal relationships between biotic and abiotic data. We applied structural equation modeling (SEM) to a constructed model based on the above-described findings to determine potential dependencies between environmental factors, soil viruses, and microbial communities across the historical precipitation gradient (31,32). The resulting SEM provided a good fit as evaluated by a chi-squared (x 2 ) test (x 2 = 7.16, degree of freedom, or df, = 10, P = 0.71) and other commonly used indices (goodness of fit, or GFI, = 0.83, comparative fit index, or CFI, = 0.99, standardized root mean square residual, or SRMR, = 0.04) (33). Thus, the SEM provides a basis to inform possible causal relationships among the data for future evaluation (Fig. 6a). We recognize that this model could be strengthened with additional data, so here it is presented as a foundational framework for future validation.

DISCUSSION
Here, we aimed to determine how differences in historical precipitation influence viral ecology. We compared viral abundance, diversity, lysogenic markers, and AMGs in bulk soil metagenomes from three geographically distinct grassland/prairie soil locations in IA, KS, and WA. We acknowledge that the different soils varied in many other ways, in addition to differences in precipitation. For example, there were differences in soil chemistry between the locations that could confound interpretation of our results. Also, KS and IA were much more similar in many respects, including historical precipitation, compared to the arid soil in WA. We hypothesized that historically drier soil environments would select for temperate viruses, where the hosts are less abundant and less active than in historically wetter soil environments. This hypothesis was supported by the SEMbased analysis that revealed a strong influence of historical precipitation and associated biotic and abiotic factors on relative abundance and activity of soil viruses, with a potential impact on soil organic matter content. Interestingly, in the historically drier WA soil, where microbial biomass was lower, a more diverse and abundant DNA viral community was detected (Fig. 2). This apparent dichotomy could be explained by a preference for temperate viruses in the virosphere of historically dry soils. Indeed, we found several indications of lysogeny in the WA grassland, including significantly fewer virus-host interactions via the CRISPR-Cas system and more viral integrases and excisionases. A preference for lysogeny, therefore, may allow soil viruses to persist under dry conditions.
Because soil DNA viruses are primarily bacteriophages that infect dominant bacterial taxa such as Acidobacteria, Proteobacteria, and Verrucomicrobia as observed in this study, bacteriophage predation pressures may contribute to substantial bacterial cell death under certain conditions (3)(4)(5). For instance, cell lysis may be enhanced when soil conditions become favorable for the increase of microbial biomass (2) and/or for the transition from lysogenic to lytic viral life cycles. Our data suggested that temperate phages were more prevalent in the historically dry WA soil, as has been observed in other dry soils (18). Thus, we predict that the switch of temperate prophages to lytic life cycles, for example following a wetting event, could have profound impacts on bacterial host dynamics and organic matter cycling in traditionally dry soils. virus-host interactions (orange boxes) and the virosphere (blue boxes). The estimated abundances of viral integrases and excisionases were used to infer a lysogenic lifestyle ("lysogeny" in blue box). The total read coverages of the viral contigs with differential abundances (data in Fig. 3c) were used to represent the viral abundances in each location ("Vabd" in blue box). The DNA yield per gram of soil was used to estimate the microbial biomass ("Mbio" in orange box). The percentage of CRISPR spacers that were exact matches to the detected viral contigs was used to represent the degree of virus-host interactions ("VHI" in orange box, data shown in Fig. 2c). Organic matter ("OM" in brown box) represents the percentage of soil organic matter. The number of auxiliary metabolic categories detected from each grassland is noted as "AMG" (blue box). Blue and red arrows represent positive and negative pathways, respectively. Arrow width is proportional to the strength of the relationship, and numbers on the arrows are the path coefficients and the P value. The direction of arrows represents the direct impact of one variable on another supported by SEM. Parameters evaluating the model fitness were Chi-square (x 2 ) = 7.16, df = 10, P = 0.71, goodness of fit, or GFI, = 0.83, comparative fit index, or CFI, = 0.99, and standardized root mean square residual, or SRMR, = 0.04. (b) An illustration based on the SEM model to summarize viral responses to either wet or dry soil conditions and their associated impacts on the soil microbial community. In wet soil where the environment is more homogenous and microbes are higher in biomass and lower in diversity, viruses are more active and frequently interact with hosts, resulting in more host lysis. The carbon released due to host lysis may contribute to the organic matter pool in soil. In dry soil where the soil habitat is more disconnected, microbial diversity is higher and so is the associated virosphere. Instead of immediately lysing the hosts, temperate viruses carrying AMGs are selected. The hosts (lysogens) may in turn benefit from the auxiliary metabolic functions carried by the viruses, both to cope with the dry environment and when moisture becomes available.
Precipitation Impacts on Soil Viruses ® We also found higher viral diversity in soil with lower historical precipitation. A possible explanation for this finding is that as soil dries, it experiences reduced hydrologic connectivity that can promote niche differentiation and lead to higher microbial and viral diversity within soil microenvironments (15). Other factors that could also be contributing toward the differences between the grassland soils include differences in pH. For example, WA had a much higher pH than the other two soils. Because pH is a strong driver of soil microbial community composition (35) and the dominant bacterial taxa in the studied grasslands were targeted by viruses, pH may also indirectly influence viral diversity. However, IA and KS had similar soil characteristics, including pH and viral host composition, but still exhibited significant differences in their viral communities that were aligned with differences in historical precipitation. Other work across the Stordalen permafrost thaw gradient has suggested that viral community composition was correlated with bacterial community composition, pH, and soil moisture content (3). Together with our data, these results highlight a dynamic viral response to host variation and environmental variables (36).
Our findings also indicate that despite differences in environmental variables and soil biogeochemistry across the sites, grassland soils harbor some common viral groups. The majority of the viral clusters were site specific and not shared between the three soil locations. However, 21% of the viral clusters were found across all three soils (Fig. 3). These common clusters accounted for more than half of the estimated viral load of each grassland soil. The common clusters were most abundant in IA, the site with the highest historical precipitation. In addition to the high abundance, these common clusters were enriched for viral contigs that actively interact with the hosts via the CRISPR-Cas system. Some of the common clusters contained viral contigs that represented putative generalists that recently interacted with more than one bacterial host. We propose that representative viruses in common viral clusters are well adapted to diverse ecological contexts in grassland soils, and importantly, are able to partner with many species that are prevalent in soil microbiomes. In contrast, viral representatives in site-specific clusters may reflect the impact of historical differences in precipitation and/or soil biogeochemical factors on viral type selection and diversity. Further studies of the patterns of common and site-specific viral clusters can fuel our knowledge of viral ecology in different soil matrices and how biotic/abiotic factors play a role in structuring viral communities.
Similarly, some of the common clusters contained potential viral generalists that infected more than one host. The concept of "viral generalists" with expanded host ranges has long been suspected (37), and recent studies have reported viruses with hosts across orders (38) or even phyla (5,38,39). Examples include a global virome survey that found several viral clusters associated with hosts from different phyla (5). In addition, four phages isolated from Lake Michigan were shown to infect different bacterial phyla (39). A broad host range facilitates the ability of the viruses to cope with stresses encountered in soil, including dispersal limitation in heterogeneous environments (40)(41)(42)(43). We hypothesize that viruses with an expanded host range could be a beneficial strategy for more successful infections in soil environments. We acknowledge that the bioinformatic assignment of virus-host linkages only suggests possible virus-host pairings with chances of introducing false-positive assignments. More validation is therefore needed to determine the host range of soil viruses.
Soil bacteriophages with predominantly lysogenic lifecycles also have the potential to provide metabolic and evolutionary advantages (44,45) to their hosts. Our SEM analysis supported this hypothesis and linked incidence of lysogeny to AMG richness. Furthermore, AMG richness was also connected to host interactions via CRISPR-Cas immunity, which indirectly influences both microbial and viral abundance (Fig. 6a). Finally, the historically driest soil had the highest number of AMGs, suggesting a selection for viruses with a wider range of auxiliary metabolic potentials that could be beneficial to hosts coping with environmental stresses. These results provide new perspectives on potential impacts of AMGs carried by temperate viruses on the soil microbial community that remain to be further tested with paired expression data.
We discovered putative AMGs with a range of metabolic functions on the viral contigs and genomes, including energy acquisition via oxidative respiration, fatty acid biosynthesis, phosphorus cycling, precursor processes for central metabolism (e.g., folate biosynthesis), host sporulation (e.g., whiB), and carbon cycling (Fig. 5). Some of these AMGs have previously been detected on viral contigs in other studies (4,46,47), but several were novel. For example, we found AMGs that could potentially encode three subunits of NADH oxidoreductase (all on one contig) and cytochrome c oxidase that are involved in the oxidative respiratory chain that generates ATP. We propose that this finding is analogous to the finding of ATP production via proteorhodopsin in marine viruses (48), although this remains to be experimentally validated. As viruses need to use hosts' ATP for DNA packaging and capsid maturation (49), the oxidative phosphorylation-related AMGs could enhance host energy production that potentially benefits viral replication during infection. We were also able to detect AMGs on some of our high-quality, complete viral genomes, including a set of putative carbon metabolic genes (Fig. S3), e.g., for metabolizing xylan, cellulose, mannose, chitin/chitosan, and pectate, the common components of plant and fungal cell walls (50,51). The AMG candidates that we detected here further contribute new knowledge of how soil viruses could potentially contribute to the metabolic functions of soil microbiomes.
In summary, this study provides new knowledge of soil viruses across three disparate grassland locations with historical differences in annual precipitation. SEM-based analysis inferred possible causal relationships between precipitation and the soil viral and microbial communities. Based on the SEM, we present a conceptual model of the soil viral response to either wet or dry soil conditions (Fig. 6b). In drier soil temperate viruses are more prevalent, perhaps because their microbial hosts are less abundant and less active than in wetter soils. In contrast, in wet soil where the host microbes are more active and abundant, viruses would also be more active with more frequent viralhost interactions (as observed via CRISPR-Cas), potentially resulting in more host lysis. The higher turnover of microbial biomass due to viral lysis could also be linked to higher organic matter content in wet soil than in dry soil. Furthermore, the higher prevalence of AMGs in dry soil may mirror the "batten down the hatches" strategy proposed for marine viruses, where the infecting viruses turn the hosts into lysogens, increasing host fitness, and therefore their own, via metabolic complementation under stress conditions (34). Although future work is needed to experimentally validate this conceptual model, it should serve as a valuable framework for hypothesis testing.

MATERIALS AND METHODS
Soil sampling and DNA extraction. Three surface soil cores (top 5 to 15 cm) were collected from each of three randomly selected field block locations (;10 m apart) from grassland soils in Iowa (41°559140N, 93°4 49590 W), Kansas (39°N 069120 N, 96°369500W), and Washington (46°159040N, 119°439430W) in the fall of 2017. WA and KS represented an unmanaged, native grassland and native prairie, respectively. Iowa was a reconstructed grassland/prairie and was tile drained. The individual soil cores were shipped in sterile foil packets on blue ice to the Pacific Northwest National Laboratory (PNNL) for processing. Upon receipt at PNNL, the three soil cores per field block location were immediately combined, sieved (4 mm), aliquoted into 50-ml Falcon tubes, flash-frozen, and stored at 280°C until processing. This resulted in a total of three field soil replicates per grassland soil location. The field soil replicate samples were sent to Kuo Lab testing (Pasco, WA) for soil chemical analysis (E1), following standard protocols (51).
Total community DNA was extracted from multiple 0.25-g samples for each of the three field replicates to achieve sufficiently high DNA yield for sequencing. Extraction was performed using a PowerSoil DNA extraction kit (Qiagen, Germantown, MD) following the manufacturer's instructions with the addition of a 20 min, 55°C incubation prior to bead beating. The eluted DNA was quantified with the Qubit BR DNA quantification kit (Invitrogen, Waltham, MA). DNA yield from the soils was used for biomass estimates. The total amount of soil used for generating the respective metagenomes was used to normalize the data per gram of source soil.
Metagenome sequencing and assembly. Metagenome libraries were prepared with the TruSeq PCR free kit (Illumina, San Diego, CA) using 1 mg DNA as starting material. For obtaining metagenomes with deep coverage of soil virosphere, the DNA extracted from the field replicates were combined to obtain enough DNA for sequencing, resulting in one large metagenome for each location (IA, KS, and WA). Each of the large metagenomes per grassland location was sequenced using seven lanes of an Precipitation Impacts on Soil Viruses ® November/December 2021 Volume 12 Issue 6 e02595-21 mbio.asm.org 13 Illumina HiSeq X (Fulgent, California; noted as "large metagenome"), resulting in each having a size of ;1.1 Tb with 7.3 to 7.7 billion paired-end reads (19). Assembly of the large metagenomes was performed using the metaHipMer pipeline (52) on the NERSC Cori platform (https://docs.nersc.gov/systems/ cori/). The details of sequence processing and assembly parameters were described previously (19). Sequence information of the metagenomes with accompanying summary statistics is in Table S3. In addition, the three field replicates per grassland location were sequenced separately (Illumina HiSeq X; Genewiz, New Jersey), yielding an average of 0.5 Tb sequence data per replicate (Table S3; noted as "replicate metagenome"). Each of the replicate metagenomes was used to calculate the depth of coverage of the contigs assembled from the respective large metagenome for statistical analyses. All of the reads of the replicate metagenomes were trimmed to remove the Illumina adaptors and to retain high-quality reads (score of .30 and length of .36 bases), as recommended by Trimmomatic (v0.33) (53).
To avoid double counting, the quality-filtered forward reads were used to align to the contigs that were assembled from the large metagenomes using BamM (v1.7.3, bamm make, https://github.com/ Ecogenomics/BamM), and the reads that mapped were filtered using stringent cutoffs, i.e., percent identity higher than 0.95 and percent alignment greater than 0.80 (BamM v1.7.3, bamm filter). After filtering, the abundances of the assembled contigs in the large metagenomes were estimated based on the average base coverage mapped by the reads of the replicate metagenomes (SAMtools v1.9, SAMtools depth, http://www.htslib.org/doc/). The read coverage for each detected viral contig was then checked by the breadth of genome coverage using InStrain (v1.3.9) (23). Abundance estimates for the viral contigs with quality-filtered paired reads covering .50% of the contig length were included for the following analyses to minimize false-positive detection of contigs as previously described (23). The abundance estimates of the viral contigs with genome coverages of less than 50% in the replicate metagenomes were transformed to zero for the following analyses. Finally, the read coverage was normalized to the number of reads per gram of source soil that were used for DNA extraction to enable cross-site comparisons.
Bioinformatic identification of viral contigs. Metagenome-assembled contigs with lengths longer than 2.5 Kb were included in our pipeline for bioinformatic assignment of viruses. Here, we also acknowledge that inclusion of contigs that are shorter than 10 Kb has been discouraged (54) due to a likelihood of introducing false positives. Because of our stringent downstream analyses to confirm viral contigs, we retained a more permissive length cutoff of 2.5 Kb. For confident viral assignment, we integrated both database-dependent and -independent assignment approaches and specified four criteria using the stringent cutoffs suggested previously (3, 54-56) (Fig. 1a). The first approach is a probabilistic method. Contigs passing screening using this probabilistic method were noted as passing criterion 1 ("1. Probabilistic method" in Fig. 1a). Genes were predicted and translated using Prodigal (57) and searched against several custom and existing viral and microbial genome databases using hmmsearch (Hmmer v3.1b2 [58], E value cutoff of 1.0e 205 ). The five databases searched included three viral databases: (i) a custom database curated previously (55) that comprises 25,218 viral protein hidden Markov models (HMMs) built upon viral protein-coding genes from NCBI viral genomes (55), (ii) a custom database comprising 1,147 curated viral protein families (Pfam [59], Table S4), and (iii) the existing Nucleo-Cytoplasmic Viruses Orthologous Groups (NCVOG) protein set (60). In addition, the EggNOG bacterial and archaeal databases (61) were screened to discriminate viral proteins from those known to be bacterial or archaeal. The bit scores obtained from the five databases were ranked, and taxon annotations were assigned to the highest bit scores, with a minimum bit score of 50 as the threshold (62). Contigs with more "viral-like" than "bacterial or archaeal-like" genes were considered likely viral candidates and satisfied criterion 1.
A second comprehensive contig-based search was also used to screen for soil viral contigs, thus satisfying our criteria 2 and 3 ("2 & 3. Database searching" in Fig. 1a). The contigs were searched against two virus-specific databases: (i) IMG/VR (released on 1 July 2018), the largest publicly available database that includes both cultured and uncultured viral sequences (63), and (ii) the curated RefseqVirus genomes database (NCBI RefseqVirus [64] v69, January 2014). The length-filtered assemblies (.2,500 bp) were compared to the viral sequences in IMG/VR using a basic local alignment search tool, BLASTN (65). Contigs having E value cutoffs equal to or lower than 1.0e 205 fulfilled our criterion (Fig. 1a) and were retained and checked by the other criteria. Additionally, for contig-based searching, we used the existing tool, VirSorter (66), which searches against the RefseqVirus genomes and evaluates the searches using a strategic scoring matrix. If the contigs passed screening by VirSorter (categories 1 to 3) (66), they satisfied our criterion 3 (Fig. 1a). We recognize that category 3 VirSorter contigs, characterized as "possible" viral contigs, are more likely to be false positives than categories 1 and 2. However, inclusion of category 3 contigs also provides room for detection of smaller lysogenic phages and novel viruses that cannot be annotated using the reference databases that are implanted in VirSorter (66). The contigs satisfying criterion 3 were thus passed through three additional screening criteria (criteria 1, 2, and 4) to constrain the searches. The third approach differentiated the viral nucleotide (8-mer) frequency patterns from nonviral ones by a machine learning approach using VirFinder (67). The contigs with P values less than 0.05 and scores greater than 0.9 passed criterion 4 (Fig. 1a). Contigs that satisfied at least three of the four criteria were assigned as identified viral contigs that were used for the following analyses.
Detection of lysogenic markers. To assess viral lysogeny, reads for lysogenic markers, e.g., genes encoding viral integrase and excisionase (69,70), were recruited from the three replicate metagenomes from each grassland location (IA, KS, and WA). Nucleotide sequences from a total of 20,712 unique viral integrases and 329 unique excisionases were collected from NCBI Virus and used as a reference database for lysogenic markers (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/, accessed on 16 December2020; sequence d-replication using VSEARCH v2.13.4, derep_fulllength). The same read mapping strategies described above were applied here as well. In brief, the quality-filtered forward reads were aligned to the curated lysogenic marker reference database using BamM (v1.7.3, bamm make, https://github.com/ Ecogenomics/BamM), and the mapped reads with percent identity higher than 0.95 and percent alignment greater than 0.80 were retained (BamM v1.7.3, bamm filter). The relative abundances of the known lysogenic markers in each of the three replicate soils per site were estimated based on the average base coverage mapped by the reads of the replicate metagenomes (SAMtools v1.9, SAMtools depth, http:// www.htslib.org/doc/). Viral clustering. The soil viral diversity was inferred by grouping highly confident viral contigs into clusters. vConTACT (v2.0.9.10) (71) was applied using the default parameters to first compute the protein similarities using Diamond (72), to generate the protein clusters using the Markov cluster algorithm (MCL) and to cluster the viral contigs according to the protein sharing matrix using ClusterONE (73) (for details refer to the published protocol at https://dx.doi.org/10.17504/protocols.io.x5xfq7n). The clusters generated by vConTACT (71) were visualized in a network with viral contigs as the nodes with the strength of node connectivity as the edges. Because of incomplete assemblies from soil metagenomes, contigs that were related and with multiple viral features without a significant number of shared proteins were unable to be clustered (71). Therefore, to further cluster more viral sequences from the assembled soil metagenomes, tetranucleotide frequency analysis (67,(74)(75)(76) was also implemented. The tetranucleotide frequency-based clustering method has been previously tested to group nonoverlapping but related viral contigs (74). To carefully benchmark the sensitivity and suitability of this method, Pearson correlations were calculated between Z score distributions for every tetranucleotide in a subset of viral contigs screened from grassland soils and in deposited NCBI viral genomes (64) (a total of 77 archaeal viruses, 1,675 bacteriophages, and 443 fungal viruses) after removing the dereplicated strings or substring using VSEARCH (77). When the same contig was assigned to multiple clusters, only the pairings with the highest correlation coefficients (greater than 0.6) were retained. The network in Fig. S6 shows that the tetranucleotide frequency-based method can cluster the related viral contigs according to the taxonomy annotations. We then applied this complementary method to the viral contigs that were initially unclustered by vConTACT (22). The additional viral contigs that were able to be clustered using tetranucleotide frequency analysis were merged into the vConTACT network as extended edges.
Host assignment. We used a nested approach to assign host taxonomy to the screened viral contigs (host assignment module, Fig. 1c). The viral contigs were first searched for spacer hits from microbial clustered regularly interspaced short palindromic repeat (CRISPR) regions, a hallmark of virus-host interactions (M1, Fig. 1c). Spacers were identified from the assembled contigs of the three soil metagenomes using MinCED (https://github.com/ctSkennerton/minced) and searched against the screened viral contigs using "blastn-short task" with previously described parameters (25,80). The taxonomy of nonviral contigs containing CRISPR regions was predicted using the Contig Annotation Tool (CAT, https://github .com/dutilh/CAT, default cutoffs). Putative host phylogeny was assigned based on the taxonomy annotation of the nonviral contigs carrying spacers that matched the screened viral contigs. The percentage of the detected CRISPR spacers that were exact matches to the identified viral contigs was used to estimate the relative frequency of virus-host interactions across the sites at the time of sampling.
As CRISPR arrays are not universally found in all prokaryotes, especially many soil bacteria (81), and assembling repetitive regions from metagenomes can be problematic (82), more complementary methods were needed to assign hosts. Thus, viral contigs with no matching CRISPR spacers were assigned to a potential host using a sequence homology screening approach based on assumptions that viruses will share genes with their hosts (83) and with relevant reference viruses. The viral contigs were searched against nonviral contigs from the same soil metagenomes that had taxonomy assignments (M2, Fig. 1c) and against NCBI viral genomes with known hosts (BLASTN [65]; thresholds of E value, 10 23 ; bit score, 50) (M3, Fig. 1c). Unassigned viral contigs inherited host assignments from clustered reference phages with known hosts (M4, Fig. 1c). Due to the largely uncharacterized viral and host diversity in soil, the rest of viral contigs remained unassigned.
AMG classification. Potential auxiliary metabolic genes (AMGs) were first annotated with the following three databases: the KEGG orthology database by EggNOG (61), the carbohydrate-active enzyme database (84) (CAZy, dbCAN HMMdb release 7.0) with stringent searching parameters, E value of ,1e 215 , and coverage of .0.35, and the functional ontology assignments for metagenomes database (FOAM [85], hmmscan, -cut_tc) (AMG classification module, step 1, Fig. 1d). Potential AMGs were then determined by checking gene positions on the viral contigs with or without viral hallmark genes and classified into five categories (step 2, Fig. 1d). Category 1 AMGs were present on contigs with "viral-like" genes (as annotated by the viral database searching from module 1) both up-and downstream and contained at least one viral hallmark gene, such as structural genes ("capsid," "portal protein," "tail," "coat"), terminases, or integrases. Category 2 AMGs were those with virus-like genes both up-and downstream but no viral hallmark genes. Category 3 AMGs were those with viral hallmark genes and virus-like genes only upstream or downstream. Category 4 AMGs were those with virus-like genes only upstream or downstream but no viral hallmark genes. Lastly, category 5 AMGs were those located at the edge of the viral contigs. We only selected categories 1 and 2 as AMG candidates for the following steps.
After search ranking and gene arrangement checking, the third checkpoint was to predict the 3D structure of the translated AMG and search against the Protein Data Bank (PDB [86]) through Phyre2 (87) (step 3, Fig. 1d). We only included viral metabolic genes in categories 1 and 2 that had highly confident Phyre2 predictions (100%) as putative AMGs. These stringent criteria were used to reduce the potential of host genome contamination and to identify AMGs with high confidence.
As chitosanases and chitinases are known to share structural similarities with viral lysozymes (88,89), we further screened the putative chitosanase and chitinase AMGs against a profile of lysozyme HMMs to assess their sequence similarities. The profile of lysozyme HMMs includes five pfams (PF13702, PF00959, PF04965, PF18013, and PF00062) and a lysozyme HMM built upon the newly self-curated lysozyme sequences that were deposited at NCBI Virus (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Protein, accessed on 16 November 2020). The putative chitosanase and chitinase AMGs were searched against the lysozyme HMMs using hmmsearch with default options (Hmmer v3.1b2 [58]). The putative chitosanase and chitinase AMGs without lysozyme HMM hits were retained.
Microbial diversity estimation. The bacterial and archaeal 16S rRNA gene reads were first recruited from the replicate metagenomes of each grassland location (IA, KS, and WA) by mapping to the wellcurated 16S RefSeq database at NCBI (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/16S_process/) via BamM (v1.7.3, bamm make). The read mappings were filtered by percent identity higher than 0.95 and percent alignment greater than 0.80 (BamM v1.7.3, bamm filter). The mapped 16S rRNA gene RefSeq sequences were then clustered at 97% identity by CD-HIT (v4.8.1) (90) to inform the microbial composition and diversity of the site.
Structural equation modeling. We applied structural equation modeling (SEM) (31,32) to test the strengths of the following hypotheses: (i) historical precipitation has an impact on soil viral life cycles and auxiliary metabolic potentials; (ii) Changes in viral life cycles and abundances impact the soil microbial community; and (iii) Viral and microbial dynamics influence soil organic matter concentration. The code availability of the applied SEM can be found in the section describing "Data and Code Availability." The quantitative data for soil viruses that were coded into the model included estimated relative viral abundance, estimated relative abundance of potentially lysogenic viruses, and richness of putative AMGs. The total read coverage of the viral contigs with differential relative abundances (data in Fig. 3c) was used to represent the viral abundance at each site. The estimated abundances of viral integrases and excisionases (details in the section of "Detection of Lysogenic Markers") were used to quantify viral lysogeny. The number of auxiliary metabolic categories detected from each grassland was used to estimate AMG richness. Microbial biomass was estimated by DNA yield per gram of soil (data shown in Fig. 2a). The degree of virus-host interactions was inferred by the percentage of CRISPR spacers that were exact matches to the detected viral contigs (data shown in Fig. 2c). The environmental data included historical precipitation for each site and organic matter content (%).
A Chi-square (x 2 ) test was performed, and multiple statistical parameters were calculated (i.e., goodness of fit, or GFI, comparative fit index, or CFI, standardized root mean square residual, or SRMR) to evaluate the model fit (cutoffs: x 2 P . 0.05, GFI . 0.8, CFI . 0.9, SRMR , 0.08). After selecting the model with the best fit to the data, R package semPaths (91) was used to create the SEM diagram with the result of path analysis in SEM with statistics. The connection and strength of the resolved paths in the SEM model were evaluated by path coefficients representing the change of dependent variable with a unit change in the explanatory variable and P values demonstrating the significance of the paths (P , 0.05 was considered significant in this study). The direct impact of one variable on another that was supported by the SEM was plotted as a directional arrow.
Data and code availability. Raw reads and the assembled data of the large metagenomes (TmG.1.0) were packaged and previously published (19). Replicate metagenomes (WA-TmG.2.0, KS-TmG.2.0, IA-TmG.2.0) along with the 14 complete viral genomes assembled from the large metagenomes (Iso-VIG14.1.0) have been deposited in the PNNL Project DataHub repository with DOIs (https:// data.pnl.gov/group/7/nodes/dataset/13708) and are available to download. The download link for each data set is included in Table S3. The detailed metadata are included in each data package. The R codes used in this study, including the details of graphical and statistical packages and the structural equation model, viral contigs screened from the metagenome assemblies and lysozyme HMMs, can be found at https://github.com/Ruonan0101/GrasslandVirome_PNNL.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.

ACKNOWLEDGMENTS
This work was supported by the Department of Energy (DOE) Office of Biological and Environmental Research (BER) and is a contribution of the scientific focus area "phenotypic response of the soil microbiome to environmental perturbations." PNNL is operated for the DOE by Battelle Memorial Institute under contract DE-AC05-76RL01830.
We also thank Rob Egan and Kathy Yelick from the Joint Genome Institute and NERSC, respectively, at the Lawrence Berkeley National Laboratory, and their funding through the DOE-supported ExaBiome Project for supplying computational resources for the metagenome assemblies.
J We declare no conflicts of interest.