Complementary Metagenomic Approaches Improve Reconstruction of Microbial Diversity in a Forest Soil

Microbial ecologists have historically used cultivation-based approaches as well as amplicon sequencing and shotgun metagenomics to characterize microbial diversity in soil. However, challenges persist in the study of microbial diversity, including the recalcitrance of the majority of microorganisms to laboratory cultivation and limited sequence assembly from highly complex samples. The uncultivated majority thus remains a reservoir of untapped genetic diversity. To address some of the challenges associated with bulk metagenomics as well as low throughput of single-cell genomics, we applied flow cytometry-enabled mini-metagenomics to capture expanded microbial diversity from forest soil and compare it to soil bulk metagenomics. Our resulting data from this pooled-cell sorting approach combined with bulk metagenomics revealed increased phylogenetic diversity through novel soil taxa and rare biosphere members. In-depth analysis of genomes within the highly represented Bacteroidetes phylum provided insights into conserved and clade-specific patterns of carbon metabolism.

S oil is considered to be among the most biologically diverse ecosystem types, and yet much of its microbial diversity remains poorly characterized (see, e.g., references 1 and 2). Each gram of soil is estimated to harbor 1,000 to 1,000,000 different bacterial species (see, e.g., references 3 to 7). Investigating soil microorganisms in situ is challenging due to the heterogeneous nature of the soil environment (see, e.g., references 8 to 10). As a result, terrestrial habitats remain immense reservoirs of untapped genetic and metabolic diversity (7,11) encoded within microbial communities that drive important ecosystem-level processes, including nitrogen cycling and carbon dioxide flux (12)(13)(14). Soils are regarded as critical for global health, as they contain 3,000 Pg of carbon and have the potential to act as either a carbon source or a carbon sink, which is important to consider under conditions of climatic shift (15,16). It is therefore essential to characterize soil microbial diversity to better understand ecosystem function and resilience in the face of rapid environmental change.
Historically, microbial diversity has been studied using laboratory cultivation techniques (17,18) with only a minute fraction of estimated bacterial diversity being successfully cultivated. Substantial efforts are being made to develop innovative cultivation techniques, including the ichip and droplet-based sorting coupled with laboratory cultivation (17,19). These approaches have contributed to expansion of diversity within novel families. However, cultivation-independent investigations may further our understanding of microbial diversity by facilitating description of novel higher taxonomic ranks. Thus, challenges associated with direct study of soil microorganisms have yielded a large knowledge gap regarding terrestrial microbial diversity. Due to limitations associated with cultivation, relatively few isolate genomes are available as references for soil microbes (20). From the publicly available Integrated Microbial Genomics (IMG/M) database (21), we were able to curate a collection of 3,024 isolate genomes, single amplified genomes (SAGs), and metagenome assembled genomes (MAGs) from previous soil studies. However, with soil estimated to contain 1,000 to 1,000,000 species per gram (9), these references represent only a small percentage of soil microbes.
In addition to culture-based approaches, amplicon studies have greatly contributed to our knowledge of microbial community structure (1,22). However, amplicon sequencing primers that target the small-subunit (SSU) rRNA gene may not adequately amplify some organisms due to primer biases through mismatches (22). Additionally, estimates of organismal abundance may be conflated by variation in gene copy number (23). Phylogenetically divergent taxa may be overlooked using PCR-based approaches, thereby hampering our ability to describe an expanded diversity of organisms (22). High-throughput sequencing technologies combined with novel metagenome binning algorithms (24,25) enable genome-resolved metagenomics studies and have greatly expanded the availability of reference genomes from uncultured taxa by circumventing challenges associated with cultivation-and amplicon-based studies (11,26,27). The more recent applications of directly sequencing DNA from soil microbial communities allow one to obtain a broader perspective on the taxonomic and functional potential of soil microorganisms. However, metagenomics in highly diverse environments may capture only the most abundant and therefore bestassembling representatives from the total community (28)(29)(30), and population heterogeneity can hamper the efficiency of assembly, even of abundant microorganisms (31).
Population microheterogeneity of closely related strains within microbial communities makes the separation of individual strains challenging (32). Soils are typically dominated by a small set of highly abundant taxa (12), and the rare biosphere may therefore be overlooked in metagenomic studies despite playing an important role in soil biogeochemical processes (33). Lastly, bulk metagenomics can also include extracellular DNA from dead microorganisms, which may be abundant in the environment. the presence of this exogenous DNA has the potential to inflate estimates of diversity and genomic potential (34)(35)(36) and to further reduce our ability to assemble sequences from rare taxa. Decoupling intracellular and exogenous DNA during sequencing may provide a more accurate estimate of microbial diversity (36).
Challenges associated with bulk metagenomics may be mitigated by reducing community complexity. The most extreme example involves the application of fluorescence-activated cell sorting (FACS) for separating communities into single cells for single-cell genomics, which provides genomic information with strain-level resolution (37)(38)(39). However, the resulting SAG assemblies are often highly fragmented and incomplete, and the overall process is prone to biases and contamination. In order to circumvent some of the challenges associated with bulk metagenomics and single-cell genomics, we applied a pooled-cell sorting approach coupled to shotgun sequencing, termed mini-metagenomics, to forest soils collected from the Barre Woods soil warming experiment at the Harvard Forest Long-Term Ecological Research (LTER) site. This mini-metagenomic approach separates a researcher-defined number of cells from the larger community, which then undergo lysis and whole-genome multiple-displacement amplification (MDA), followed by sequencing.
Prior to the application of cell sorting to Harvard Forest soil in this study and in that by Schulz et al. (40), mini-metagenomics analysis of microorganisms had been used only in aqueous environments, including hot springs, hospital sink biofilms, and activated sludge (40)(41)(42)(43)(44). Mini-metagenomics has higher throughput than single-cell genomics, providing the opportunity to capture more diversity than is possible with single-cell sequencing. Mini-metagenomics may enable investigation of different components of the soil community in comparison to bulk metagenomics, including cells that can be dissociated from particles, and cells with susceptibility to the single-cell lysis step. The use of two overlapping metagenomic methods may allow us to capture a broader taxonomic diversity than the use of only one approach on its own. Additionally, cell sorting using FACS requires cells to be intact in order to be sorted, thereby minimizing challenges introduced by extracellular DNA in bulk soil samples. Using mini-metagenomics to reduce the number of cells relative to bulk metagenomics may decrease the number of genomes collapsed into a single MAG (41). Hence, we evaluated this method as a tool to complement bulk metagenomics in uncovering the "microbial dark matter" in soil.
Here, we combined mini-metagenomics and bulk metagenomics as complementary approaches for capturing a more holistic perspective of microbial community diversity. We discovered additional diversity of uncultivated microorganisms in a forest soil microbial community and thus contribute to the known diversity of both major soil clades and understudied taxonomic groups, which can be used as reference sequences in future studies. Additionally, we provide an example of how the mini-metagenomics and bulk metagenomic approaches can be used in complement to investigate potential metabolism and ecological roles of microorganisms. Separation of intact cells from soil via FACS enabled mini-metagenomic sequencing, while bulk metagenomics provided total community context for benchmarking. Our approach generated 200 sorted-MAGs and 29 bulk metagenome MAGs of medium quality, expanding the known phylogenetic diversity (PD) of soil clades. Our data suggest that the sorted-MAGs represent some of the diversity of previously unsequenced organisms that are challenging to access using bulk approaches, offering insights into the functional potential of soil dark matter.

RESULTS AND DISCUSSION
Improved assembly and binning from mini-metagenomes. Our application of mini-metagenomics combines microbial cell sorting and metagenome sequencing in order to divide a complex soil community into many smaller, less complex subsets. We performed FACS on pools of cells from four soil samples collected from the Barre Woods experimental warming plots at the Harvard Forest Long-Term Ecological Research (LTER) site. From each of the four samples we sequenced 90 replicate pools of 100 cells for a total of 359 mini-metagenomes (one mini-metagenome failed quality control standards). In conjunction with mini-metagenomic sequencing, we performed bulk metagenomics on these four soils, generating totals of 1.2 Gbp and 1.3 Gbp, respectively ( Fig. 1; see also Table S1 in the supplemental material). contamination, and Յ10% strain heterogeneity. We considered MAGs with less than 50% completeness to represent "low quality" based on MIMAG standards (46) and excluded them from additional analyses ( Fig. 2; see also Fig. S1). Overall, quality filtering removed lower-quality sorted-MAGs on the basis of completeness, whereas bulk MAGs were removed due to higher degrees of contamination and strain-level heterogeneity. Assessment of MAG quality using CheckM showed average percent completeness of 81.5% in medium-quality bulk metagenome MAGs (n ϭ 29), which was higher than the 61.9% seen with the medium-quality sorted-MAGs (n ϭ 200; P ϭ 3.29 ϫ 10 Ϫ7 ) ( Fig. 2; see also Fig. S1). Assessed for marker gene contamination, bulk metagenome MAGs revealed an average estimated level of contamination of 1.92%, indicating an estimated level of contamination higher than the average of 0.98% contamination in the sorted-MAGs (P ϭ 0.01117) ( Fig. 2; see also Fig. S1). Analysis of strain-level heterogeneity across medium-quality MAGs and sorted-MAGs revealed a lower degree of multiple strain contamination in sorted-MAGs than in bulk MAGs as assessed by CheckM (45). The average level of strain heterogeneity for the bulk MAGs was 1.16%, compared to 0.04% in the sorted-MAGs (P ϭ 3.89 ϫ 10 Ϫ6 ; Table S2). This decrease in strain heterogeneity seen using mini-metagenomics indicates that sorted-MAGs collapse fewer strains into a single MAG.
As one measure to compare mini-metagenomics and bulk metagenomics methods, we assessed GC content and found averages of 49.2% GC and 60.5% GC in sorted-MAGs and MAGs, respectively ( Fig. S1; see also Table S2). Variation in GC content can be attributed to known biases in the single-cell workflow such as susceptibility of cells to sorting and lysis (37,47), as well as amplification bias introduced during MDA (48). The cell isolation method used in mini-metagenomics reduces inflation of community diversity as a result of exogenous DNA. Additionally, the difference in DNA extraction procedures between mini-metagenomics and bulk metagenomics represents an opportunity to capture an expanded diversity of microorganisms, as each approach may access a different component of the community. Taking the data together, minimetagenomics and bulk metagenomics generated a large number of quality MAGs that can be used as complementary data sets in genome-resolved studies to investigate broad microbial diversity.
Expansion of phylogenetic diversity. As one aim of our study was to provide reference genomes that represent soil microbiome diversity, we evaluated the contribution of both sorted-MAGs and bulk MAGs to phylogenetic diversity in the context of previously published genomes of soil bacteria and archaea. We inferred the phylogenetic relationships using concatenated marker genes from the 200 sorted-MAGs, the 29 bulk MAGs, and 3,024 soil microbe reference genomes from the IMG/M (Fig. 3A) (21). For this analysis, we clustered sequences at 95% average nucleotide identity (ANI) to estimate distinct species-level lineages, resulting in 170 sorted-MAGs, 25 bulk MAGs, and 2,341 reference MAGs and isolate genomes from IMG/M ( Fig. 3A; see also Fig. S2). This small decrease in the number of MAGs as a result of clustering indicates very little redundancy between previous MAGs and available reference sequences. Sorted and bulk MAGs from this study contributed genome diversity across numerous soil clades, including Alphaproteobacteria (16 sorted-MAGs, 2 bulk MAGs), Acidobacteria (11 sorted-MAGs, 14 bulk MAGs), and Planctomycetes (2 sorted-MAGs, 1 bulk MAG). Sorted and bulk MAGs also contributed diversity to less-abundant soil taxa, including TM6 (6 sorted-MAGs, 1 bulk MAG) and Betaproteobacteria (3 sorted-MAGs, 1 bulk MAG).
Comparison of MAGs recovered through mini-metagenome and bulk metagenomics revealed a broad diversity of soil bacteria, as well as a few archaeal taxa, and demonstrated the complementarity of these approaches for biological discovery. The sorted-MAGs expanded the known diversity of the taxa which were previously found to be abundant and ubiquitous across soil types (49), as well as of the taxa considered part of the rare biosphere that may still be widespread but remain at relatively low abundances in microbial communities (33). The more abundant taxa represented by the sorted-MAGs include Bacteroidetes (n ϭ 48) and Verrucomicrobia (n ϭ 8), while the Gray represents the total branch length contributed by soil reference sequences from the IMG database. Orange bars represent total branch length from sorted-MAGs, and blue represents branch length from bulk MAGs. The percentage of increase in phylogenetic diversity from this study is shown next to each bar. taxa with typically lower abundances in soils included Thaumarchaeota (n ϭ 4), Omnitrophica (n ϭ 3), Ignavibacteria (n ϭ 2), Melainabacteria (n ϭ 1), and Firestonebacteria (n ϭ 1) (1,49). Interestingly, numerous sorted-MAGs belonged to phyla typically comprised of pathogens and endosymbionts such as the Chlamydiae (n ϭ 31) and Gammaproteobacteria, specifically within the order Legionellales (n ϭ 30), as well as TM6 (n ϭ 7) (50-53) ( Fig. 3A; see also Fig. S3). Genomes in the phylum Chlamydiae and in the order Legionellales within the phylum Gammaproteobacteria are considered entirely intracellular (54,55). The phyla identified by sorted-MAGs represented abundant taxa found in previous soil community studies (1,49,56) in addition to the rare biosphere, demonstrating the utility of mini-metagenomics for expanding diversity beyond abundant soil taxa ( Fig. 3; see also Fig. S3).
As for the bulk MAGs, some of these belonged to rare taxa not recovered through mini-metagenomics, including WPS-2 (n ϭ 3), Euryarchaeota (n ϭ 1), and Saccharibacteria (n ϭ 1). We assessed phylogenetic diversity (PD), the total amount of branch length contributed by sequences of interest within a phylogenetic tree, in the sorted-MAGs to determine the contribution of this single study to the known range of microbial diversity. Calculation of phylogenetic diversity revealed a 7.2% increase in total branch length contributed by the sorted-MAGs in relation to the soil reference sequences from IMG/M (Fig. 3B). Mini-metagenomes expanded the range of available evidence not only of phylogenetic diversity within clades of known soil bacteria and archaea but also of candidate phyla and low-abundance taxa typically found in forest soils. More specifically, the sorted-MAGs increased the branch lengths of well-studied bacterial groups, including Bacteroidetes (33.6%) and Alphaproteobacteria (19.4%), along with those of groups notoriously recalcitrant to laboratory cultivation, such as TM6 (62.1%), Verrucomicrobia (41.3%), and Acidobacteria (28.0%) (42,57). Most notable was the PD increase in the Chlamydiae (72.5%), a taxonomic group which is typically overlooked in soil metagenomic studies due to their low abundance and likely dependence on eukaryotic host cells (58). We hypothesize that the application of mild detergent and syringe filtration during sample processing may have lysed the microbial eukaryotes that serve as hosts for bacterial endosymbionts, making these bacteria more accessible for FACS. A similar phenomenon was suggested for the detection of 16 novel giant viruses from these same samples (40), as these viruses are most often associated with eukaryotic host cells (59). The hypothesized liberation of these intracellular bacteria makes mini-metagenomic sequencing a useful tool for investigating the diversity and evolution of the intracellular life strategy (55,60).
The sorted-MAGs demonstrated the potential for mini-metagenomics to increase our knowledge of diversity beyond what can be achieved using MAGs from bulk metagenome studies alone. The bulk MAGs contributed to the phylogenetic diversity of many of the same clades of soil bacteria as the sorted-MAGs, including Acidobacteria (10.5%), TM6 (6.8%), and Alphaproteobacteria (2.6%). Even in clades where more bulkderived genomes were added than sorted-MAGs, such as in Acidobacteria, the sorted-MAGs were phylogenetically more diverse. These calculated increases in phylogenetic diversity with the addition of MAGs from this study are limited with regard to scope, as not all available reference sequences are publicly accessible in the IMG/M database. However, this database is updated monthly with newly uploaded sequences from GenBank (21).
Complementarity of mini-metagenomics and bulk metagenome sequencing. Mini-metagenomics has not been widely applied in soils to date and will serve as a valuable tool for expanding our knowledge of soil biodiversity. In this study, we applied both bulk metagenomics and mini-metagenomics to compare analyses of complex community samples as well as to identify the advantages and disadvantages of each. This approach is capable of generating higher-quality MAGs than bulk metagenomics due to the reduction of strain-level microheterogeneity when selected pools of cells are sequenced (32). Although they are lower in estimated genome completeness than bulk MAGs, sorted-MAGs from soil also demonstrate a lower degree of strain heterogeneity, indicating that fewer genomic fragments from multiple organisms have been collapsed into a single genome bin (45) (Table S2). The sorted-MAG reduced genome completeness is, at least in part, a likely result of uneven whole-genome amplification (WGA), as has been extensively reported in single-cell genomic studies (47). The larger number of sorted-MAGs presents opportunities for improved resolution for taxonomic classification and for genome-informed investigations of microbial metabolism and linking the potential metabolism to processes at the ecosystem level. Taxonomic classification of organisms using high-quality MAGs has become a critical approach for expanding knowledge of microbial diversity, given that we currently lack information for the majority of uncultivated organisms (61). Finally, although not applied in this study, FACS-based sample processing may be modified to achieve cell and/or population separation that is more highly targeted (62), thereby further expanding the utility of mini-metagenomics to detect microbial dark matter.
Although the mini-metagenomics approach produced a greater number of medium-quality genome bins than bulk metagenomics, this approach is not without challenges. In comparison to bulk metagenomics, the requirements associated with mini-metagenomics may be prohibitive, as it involves equipment and expertise that may not be easily accessible. In addition to logistical obstacles, methodological challenges, including cell isolation and GC-based genome amplification skew, likely introduce bias during sample processing. The formation of extracellular polysaccharides is a strategy widely used by microorganisms to protect against changes in the environment, as well as for exchange of nutrients and materials (63). These matrices may support the maintenance of stable microbial consortia and cellular adhesion to soil particles (63). These larger aggregate structures are subject to exclusion in sample preparation steps, including filtration, prior to FACS. Methodological challenges such as these may be reflected in our data, where organisms which are typically abundant in forest soils, such as Actinobacteria, Chloroflexi, and Firmicutes (49), were present in low numbers using mini-metagenomics compared to traditional bulk metagenomics ( Fig. 3; see also Fig. S3). Though these taxa might have been missed due to the aforementioned biases, it is also possible that sequences from these organisms were not binned or were placed in a lower-quality bin based on our filtering threshold. For example, bacteria in the phylum Spirochaetes were represented by 47 distinct sorted-MAGs; however, none of these passed quality filtering standards and all were therefore excluded ( Fig. 3; see also Fig. S3). An alternative DNA amplification method, termed WGA-X, has been developed which improves cell lysis and amplification of high-GCcontent organisms over MDA (48). With this improved method of DNA amplification, more extensively representative mini-metagenomic sampling might be possible.
Bulk metagenomics presents fewer opportunities to introduce bias and may more accurately capture the total soil community than the mini-metagenomic approach. Using bulk metagenomics, DNA from the total soil sample is extracted and sequenced, which circumvents cell and particle size selection introduced via FACS. Thus, bulk metagenomics remains an invaluable tool for understanding the diversity of microbial communities, particularly that of the dominant soil microorganisms. Sorted-MAGs, however, provided additional genomic data covering broader phylogenetic diversity compared to the bulk MAGs, further enhancing biological discovery. The scientific question of interest should guide the selection of one approach over the other. We support the use of both approaches in complement to one another in order to capture the broadest scope of soil microbial diversity.
Representation of sorted-MAGs and MAGs across terrestrial soil metagenomes. To assess the representation of our newly generated soil reference genomes across other terrestrial ecosystems, we searched for protein coding sequences from our collection of sorted-MAGs and MAGs across publicly available soil metagenomes from 80 terrestrial metagenome studies. For this analysis, we dereplicated the 200 sorted-MAGs and 29 bulk MAGs from this study by clustering at 95% average nucleotide identity without reference sequences, resulting in 173 sorted-MAGs and 28 bulk MAGs as cluster representatives (Fig. 4; see also Fig. S2). We assessed these sorted-MAGs and bulk MAGs in the context of broader terrestrial community studies by comparing them against 2,210 metagenomes from the 80 terrestrial studies using LAST (64) (Fig. 4; see also Table S3). We defined highly represented sorted-MAGs and MAGs as those with at least 200 protein coding sequences with hits to metagenome samples at Ն95% amino acid identity (AAI) over a 70% alignment length (65,66).
Some of our sorted-MAGs and MAGs detected in previous metagenomic soil investigations were members of the phylum Acidobacteria (10 sorted-MAGs and 15 MAGs) ( Fig. 4; see also Table S3). Five bulk MAGs in the phylum Proteobacteria were detected in metagenomes from forest, agricultural, arctic, grassland, and vadose zone soils, whereas two bulk MAGs in candidate division WPS-2 were detected in metagenomes from Harvard Forest and other forest soil metagenomes, as well as arctic and surface soils. Interestingly, one MAG in the Planctomycetes was detected only in metagenome sequences from the Harvard Forest, indicating that this may represent a unique MAG which had not been found in previous terrestrial metagenome studies.
The phylum Acidobacteria was the most abundant phylum represented in the bulk MAGs (77%) and unbinned metagenome data (32%), compared to the sorted-MAGs (8.5%) (Fig. S3). In contrast, the phylum Bacteroidetes was highly represented by the sorted-MAGs (55.5%), compared to the bulk metagenome MAGs (0.1%) and unbinned  Table S3. The figure was rendered using iTOL (96). metagenome data (3.8%) (Fig. S3). The sorted-MAGs in the phylum Bacteroidetes increased the phylogenetic diversity of this group by 33.6% (Fig. 3) and appeared to be novel as they had a relatively low number of matches to protein coding sequences from publicly available soil metagenomes, with only 6 of 67 Bacteroidetes MAGs having similarity of at least 200 coding sequences with published soil metagenomes (Fig. 4). This presumed novelty could also contribute to computation challenges associated with sequence assembly, as only the most abundant taxa are overrepresented in public databases (29). And yet many of these sorted and bulk MAGs were not represented in previous Harvard Forest metagenomes (Fig. 3). Taking the data together, the low level of representation of our Bacteroidetes sorted-MAGs across previously published metagenome samples illustrates the expanded biodiversity gained through the use of mini-metagenomes, demonstrating the utility of this approach for accessing the rare taxa within phylogenetically diverse samples.
Biological insights into carbon metabolism in soil Bacteroidetes. Bacteroidetes spp. make up ϳ10% to the total microbial community in soils (1), and yet most of our knowledge about members of this phylum stems from sequenced isolates from vertebrate guts and aquatic habitats (67)(68)(69). Bacteria in the phylum Bacteroidetes are known to be important degraders of polysaccharides; however, little is known about the role of this abundant group in soils. Given the relatively small body of work on soil Bacteroidetes and the substantial contribution of 67 putatively novel sorted-MAGs from this study to phylogenetic diversity estimates ( Fig. 3; see also Fig. 5), we further explored these sorted-MAGs from Bacteroidetes to gain insight into their physiological potential and assess functional similarities to previously known Bacteroidetes.
The genome sizes of the sorted-MAGs ranged from 1.6 to 5 Mb (Table S4), which is within the range of previously reported Bacteroidetes genome sizes of from 0.9 Mb (Cardinium endosymbiont) (70) to 9.1 Mb (Chitinophaga pinensis) (71). The finding of smaller genome sizes of the sorted-MAGs was likely due to genome completeness estimates, which ranged from 50% to 80.5% based on analysis of CheckM marker genes ( Fig. 5; see also Fig. S4) (45). These sorted-MAGs were distributed across three distinct families, including Cytophagaceae, Chitinophagaceae, and Sphingobacteriaceae, as well as a clade of unclassified sorted-MAGs (Fig. 5). Bacteroidetes are known to have a large set of genes that encode enzymes for carbohydrate degradation (69), including a broad array of glycoside hydrolases that are phylogenetically conserved (72). The distribution of CAZy gene families across these Bacteroidetes taxa exhibited clade-specific abundance patterns of glycoside hydrolases, glycosyl transferases, and carbohydrate binding modules ( Fig. 5; see also Table S4) (73).
Sorted-MAGs within the Cytophagaceae family appeared to be specialized for polymeric carbon degradation, namely, degradation of cellulose, as they encode proteins in glycoside hydrolase family 5 which exhibit endocellulase activity (74,75). In contrast, members of the Chitinophagaceae and Sphingobacteriaceae families appeared to be generalists in carbon utilization. More specifically, the Chitinophagaceae sorted-MAGs harbored the potential to use cellulose, hemicellulose, and chitin. Seventeen of the 27 sorted-MAGs in the Chitinophagaceae family contained at least one chitinase in glycoside hydrolase family 18 or 19 (76) along with cellulases in glycoside hydrolase families 5, 8, and 9 and glycoside hydrolases in family 43 that may degrade hemicellulose and pectin (77) (Fig. 5; see also Fig. S5). In support of this conjecture, the sequenced genome of Chitinophaga pinensis (a member of the Chitinophagaceae family) contains genes to degrade leaf matter and fungal structures, suggesting its ability to degrade both cellulose and chitin (78). Twenty sorted-MAGs belonged to the family Sphingobacteriaceae and typically harbored the potential to degrade cellulose, xylan, and chitin, with GH families 2, 3, 5, 13, 18, and 20 being the most abundant across sorted-MAGs in this group. Interestingly, one sorted-MAG (Q3300020668_2) had the highest number of glycoside hydrolase genes within the Sphingobacteriaceae (125 annotated glycoside hydrolases), representing a diverse array of carbohydrate degradation capabilities and Forest Soil Microbial Diversity potential metabolic flexibility. This is consistent with previous investigations describing the family Sphingobacteriaceae as capable of degrading diverse polysaccharides (79).
Putatively novel Bacteroidetes sorted-MAGs stemming from experimental warming plots at the Harvard Forest Long-Term Ecological Research site spanned three different taxonomic families and harbored an extensive diversity of enzyme families, including those involved in hydrolysis of polymeric chitin, cellulose, and hemicellulose substrates. The genomic potential to utilize these labile carbon compounds is consistent with previous metagenomic investigations in soils of warmed plots (16,80). Interestingly, the number of identified carbohydrate active enzyme genes increased with genome size for each of the six CAZy categories (Fig. 5; see also Fig. S4), illustrating that these organisms accumulated the capacity to degrade various carbohydrates, thereby expanding their niche for carbohydrate utilization in soil. And yet 17 sorted-MAGs belonged to an unclassified clade of Bacteroidetes spp. which were depleted in glycoside hydrolases and carbohydrate binding modules but retained a high number of glycosyl transferases (Fig. 5; see also Fig. S5), suggesting a limited role for these organisms in substrate decomposition. Rather, the relatively higher abundance of glycosyl transferase genes involved in the formation of glycosidic bonds may indicate that these organisms are responsible for synthesis of higher-molecular-weight compounds and may depend on living in close association with other organisms.
To further support the role of the Bacteroidetes in polymeric carbon degradation in soils, we investigated specific carbohydrate degradation using the KEGG database (81,82) and predicted the completeness of metabolic pathways using KEGG-Decoder (83). The majority of sorted-MAGs in Sphingobacteriaceae and Chitinophagaceae have nearly complete pathways coding for alpha-amylase, beta-glucosidase, chitinase, and diacetylchitobiose deacetylase activity, further supporting the idea of a role of these organisms as generalists in polysaccharide degradation (Fig. S6). Additionally, seven of the sorted-MAGs within Sphingobacteriaceae contain nearly complete pathways for pullulanase. Consistent with analysis of carbohydrate degradation potential using the CAZy database (Fig. S5), 22 of the sorted-MAGs were found to contain only one complete pathway or no complete pathways for polymeric carbohydrate degradation (Fig. S6). This limited potential for carbohydrate utilization does not correlate with decreased genome completeness (Fig. S4). Rather, we hypothesize that these sorted-MAGs have an alternative survival strategy in the soil environment similar to those exhibited by other Bacteroidetes, including "Candidatus Amoebophilus asiaticus" (84), Cardinium sp. (85), "Candidatus Sulcia muelleri" (86), and Blattabacterium sp. (87), which are known symbionts (Fig. 5).
Similarly to known symbionts, the estimated GC contents of unclassified sorted-MAGs in this study were low relative to those of other Bacteroidetes sequences, with an average of 39.97% GC (88). These unclassified Bacteroidetes demonstrate limited ability for carbon utilization and reduced central carbon metabolism and chemotaxis (Fig. S6) while retaining genome sizes of 2.4 Mb on average, which are comparable to those of Bacteroidetes previously identified as host-associated species (Fig. 5; see also Fig. S5 and S6). Symbionts may undergo the process of reduction in genome size when in contact with the host organism, resulting in a linear relationship between the number of protein coding genes contained and the size of the genome (87)(88)(89). The abundance of unclassified Bacteroidetes within this study may represent further evidence of the liberation of symbionts from host cells and vacuoles prior to FACS. Alternatively, the

FIG 5 Legend (Continued)
Chitinophagaceae, and Sphingobacteriaceae, and a clade of unclassified sorted-MAGs. Column B shows genome sizes, with the darkest color representing the largest genome of 9.1 megabases and the lightest representing a genome size of 0.6 megabases. Column C shows genome completeness based on CheckM marker genes, ranging from 50% to 80.5%, as a color gradient. Reference sequences represent isolates with complete genomes. Column D presents genome GC content as a color gradient that ranges from 21.13% to 61.24%. In columns E to G, percentages of genes annotated as glycoside hydrolases (column E), glycosyl transferases (column F), and carbohydrate binding modules (column G) are illustrated as bar charts with vertical lines denoting 0% and 50% of annotated genes. Bacteroidetes with known symbiotic relationships are indicated with an asterisk. The collapsed clade contains Sulcia muelleri, a known symbiont of sap-feeding insects, and Blatellabacterium sp., a known symbiont of the cockroach Blatella germanica. relatively low carbohydrate degradation potential of sorted-MAGs within the unclassified clade may be indicative of an opportunistic life strategy (74).
Conclusions. This application of mini-metagenomics and bulk metagenomics has demonstrated the utility of these complementary techniques for biological discovery within the complex soil ecosystem. Using mini-metagenomics to reduce the number of cells prior to sequencing, we uncovered bacterial and archaeal soil diversity that could not be accessed using bulk metagenomics alone. Mini-metagenomics is a powerful tool for the discovery of rare biosphere organisms and potential endosymbionts, revealing biodiversity in dominant soil groups as well as in low-abundance taxa. Taken together, the mini-metagenomics and bulk metagenomics approaches allow us to probe deeper into microbial diversity and function within heterogeneous environments beyond soil.

MATERIALS AND METHODS
Sample collection and incubation. Soils were collected on the 24th of May 2017 from the Barre Woods long-term experimental warming plots located at the Harvard Forest Long Term Ecological Research (LTER) site in Petersham, MA, USA. This site consists of two 30-by-30-m plots: one which has remained at ambient soil temperature and one that has been artificially warmed since 2002 using heating cables buried at 10-cm depth (90). Soil respiration, nitrogen mineralization, and vegetation cover and growth as well as soil and litter chemistry have been measured over the course of the long-term experiment. The canopy overstory is dominated by paper birch and black birch (Betula papyrifera and B. lenta, respectively), red maple (Acer rubrum), black oak and red oak (Quercus velutina and Q. rubra, respectively), and American beech (Fagus grandifolia) (56).
Two intact soil cores were taken from subplots within the larger 30-by-30-m experimental plots, including a subplot within heated plot 2 and a subplot within control plot 12. The subplots included in this study were chosen at random. The collected soil cores were separated into organic (approximately top 5 cm of soil core) and mineral (lower 5 cm of soil core) horizons by visual inspection and were sieved with a 2-mm-pore-size mesh, resulting in a total of 4 individual soil samples.
Both treatments (heated and control) and soil horizons (organic and mineral) were represented by these four soil samples. Approximately 5 g of soil was immediately frozen in a dry ice/ethanol bath for DNA extraction and was then transported to the University of Massachusetts Amherst for storage at -80°C. Approximately 15 g of soil was transferred to a 50-ml Falcon tube for transportation on ice to the Joint Genome Institute (JGI) in Walnut Creek, CA, USA. Samples were further processed as described previously Schulz et al. (40). The study was limited to four soil samples in order to maintain the cost-effectiveness and overall efficiency of the techniques applied.
Sample preparation and cell sorting. Cells were separated from four incubated soils (heated organic, heated mineral, control organic, and control mineral samples) for FACS through the addition of 0.02% Tween 20 followed by vortex mixing performed for 5 min. Samples were centrifuged for 5 min at 500 ϫ g to pellet large soil particles. Following centrifugation, the supernatant was filtered through a 5-m-pore-size syringe filter to remove the remaining soil particulates. Samples were diluted 1:100 in phosphate-buffered saline (PBS) and stained with SYBR green (Thermo Fisher Scientific, Waltham, MA, USA). For each of the four soil samples, 90 pools of 100 SYBR-positive (SYBR ϩ ) cells were sorted into microwell plates using a BD Influx cell sorter (BD Biosciences, San Jose, CA, USA) to perform FACS. Sorted pools underwent cell lysis and whole-genome amplification using a Qiagen RepliG single-cell kit for multiple-displacement amplification (MDA) (Qiagen, Hilden, Germany). A total of 360 libraries were generated for sequencing with a Nextera XT v2 kit (Illumina, San Diego, CA, USA) with 9 rounds of PCR amplification.
Mini-metagenomes. Following library preparation, the 360 mini-metagenome libraries were sequenced on an Illumina NextSeq platform (Illumina, San Diego, CA, USA) at the DOE Joint Genome Institute (JGI; Walnut Creek, CA, USA). Pools of 90 libraries were processed in four sequencing runs with 2 ϫ 150-bp read lengths. Raw Illumina reads were quality filtered to remove contamination and lowquality reads using BBTools (v37.38) (91), resulting in 359 mini-metagenomes for downstream analysis, as one mini-metagenome did not pass quality filtering standards. Read normalization was performed using BBNorm (91), and error correction was conducted using Tadpole (91). Assembly of filtered, normalized Illumina reads was completed using SPAdes (v3.10.1) (92) with the following options: -phred-offset 33 -t 16 -m 115 -sc -k 25,55,95. All contig ends were trimmed of 200 bp, and contigs were discarded if the length was Ͻ2 kb or the level of read coverage was less than 2 using BBMap (91) with the following options: nodisk ambig, filterbycoverage.sh: mincov.
Bulk metagenomes. Total DNA was extracted from ϳ0.25 g of soil using a DNeasy PowerSoil DNA extraction kit (Qiagen, Hilden, Germany). Extracted DNA was assessed using a Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) and Qubit (Thermo Fisher Scientific, Waltham, MA, USA). Unamplified TruSeq libraries were prepared for 4 DNA samples prior to sequencing on an Illumina HiSeq-2000 platform (Illumina, San Diego, CA, USA) at the DOE JGI. Raw Illumina reads were trimmed, quality filtered, and corrected using bfc (version r181) with the following options: -1 -s 10g -k 21 -t 10. Following quality filtering, reads were assembled using SPAdes (v3.11.1) (92) with the following options: -m 2000 -only-assembler -k 33,55,77,99,127 -meta -t 32. The entire filtered read set was mapped to the final assembly, and coverage information was generated using BBMap (v37.62) (91) with default parameters except ambiguousϭrandom. The version of the processing pipeline used was jgi_mga_meta_rqc.py, Genome binning and quality assessment. Assembled contigs from the bulk and minimetagenomes were binned into MAGs and sorted-MAGs based on tetranucleotide frequency using MetaBat2 (93). Sorted-MAGs were generated for mini-metagenomes without contig coverage patterns due to MDA bias. Genome bins were assessed for estimated completeness and estimated contamination marker genes included in the CheckM (45). Bulk metagenome MAGs and sorted-MAGs were filtered to Ն50% completeness, Յ10% contamination, and Յ10% strain heterogeneity to retain medium-quality sorted-MAGs and bulk metagenome MAGs for downstream analysis (46). Following quality filtering, 200 medium-quality sorted-MAGs and 29 medium-quality bulk metagenome MAGs were used for phylogenomic analysis, metagenomic recruitment, and investigation of metabolic potential.
Phylogenetic tree construction and phylogenetic diversity. A concatenated marker gene phylogenetic tree was constructed for 200 medium-quality sorted-MAGs, 29 bulk MAGs, and 3,024 reference genomes from soil bacteria and archaea available in the IMG/M database. A set of 56 universal single-copy marker proteins (41,92) was identified with hmmsearch (v3.1b2) (94) and specific hidden Markov models (HMMs) for each of the markers. For every marker protein, alignments were built with MAFFT (v7.294b) (95) and subsequently trimmed with BMGE using BLOSUM30 (96). MAGs and reference sequences were clustered at 95% average nucleotide identity with FastANI v1.0 (97), resulting in 170 sorted-MAGs, 25 bulk MAGs, and 2,341 reference sequences with distinct taxonomic classifications. Single-protein alignments were then concatenated, and a phylogenetic tree was inferred with FastTree2 using the options -spr 4 -mlacc 2 -slownni -lg (98) and was visualized using iTol (99).
The contribution of sorted-MAGs and bulk MAGs to phylogenetic diversity was determined by calculating the sum of the total branch lengths of the contributed genomes relative to the reference genomes (100). Total branch length was calculated for a phylogenetic tree containing only 2,341 bacterial and archaeal reference sequences from IMG/M (21). We then calculated the additional total branch lengths contributed by sorted-MAGs and bulk MAGs. The percentage of increase in total branch length was determined for the complete phylogenetic tree, as well as for clades that included sorted-MAGs.
Taxonomy was assigned to sorted-MAGs, bulk MAGs, and metagenome reads by searching sequences against the NCBI-NR database using DIAMOND (101). BLAST results were imported into MEGAN6 (102) for taxonomic assignment. The relative abundance of each phylum was computed and visualized in R using ggplot2 (103).
Protein recruitment. Sorted-MAGs (n ϭ 200) and bulk MAGs (n ϭ 29) were dereplicated by clustering based on 95% average nucleotide identity. Protein coding sequences from the resulting 199 representative sorted-MAGs and MAGs were compared against coding sequences predicted from 2,210 soil metagenome samples from 80 terrestrial metagenome studies stored in the IMG/M database using LAST (64) (Fig. 4; see also Table S3 in the supplemental material). Individual sorted-MAGs and MAGs were designated a match to metagenome samples if the following criteria were met: a minimum of 200 coding DNA sequences (CDS) with hits at Ն 95% amino acid identity over 70% alignment lengths to CDS of an individual metagenome. The rationale for choosing the minimum 200 hit count was to ensure that the evidence included more than merely housekeeping genes, which may be more highly conserved. The 95% amino acid identity cutoff was chosen based on a study reported previously by Luo et al. (65), who asserted that organisms grouped at the "species" level typically show Ͼ85% AAI among themselves. Since our data set included divergent sublineages, the more conservative threshold of 95% amino acid identity was adopted. The average percentage of CDS with a metagenome hit was calculated for each mini-metagenome ( Fig. 4; see also Table S4), and the results were plotted as a multibar chart in iTol (99).
Bacteroidetes phylogeny and metabolic predictions. A maximum likelihood tree for Bacteroidetes was constructed using IQTree (104) for the 67 sorted-MAGs and soil Bacteroidetes references from IMG/M. The tree was rooted with Pedosphaera parvula in the phylum Verrucomicrobia. Family-level taxonomic classification and genome size and genome size based on CheckM marker gene assessment (45) were visualized using iTol (99). Functional annotation for sorted-MAGs was assigned using the Carbohydrate Active Enzyme (CAZy) database (73) implemented in dbCAN2 (105). The percentage of total annotated genes assigned to each gene family was calculated and is displayed in a multibar chart in iTol (99).
Additional metabolic annotations were assigned to the 67 Bacteroidetes sorted-MAGs using the GhostKoala server (82). Following annotation of protein coding genes, assigned knockouts (KOs) were used to estimate the completeness of selected pathways using KEGG-Decoder and a heat map was generated using "static" visualization mode to depict the completeness of each pathway (83).
Data availability. The bacterial and archaeal MAG data sets generated and analyzed in this study were deposited at NCBI GenBank under BioProject accession number PRJNA608716 and at https:// bitbucket.org/lvalteio/forest_soil_mags_and_sortedmags/src, together with sequence alignments and phylogenetic trees generated in this study. Metagenomes and their corresponding metadata are available at IMG/M (https://img.jgi.doe.gov/m) under the taxon OIDs (identification numbers) indicated in Table S1.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.   The contents of this article are solely our responsibility and do not necessarily represent the official views of the USDA or NIFA.
We thank the Division of Computational Systems Biology for providing and maintaining the Life Science Compute Cluster (LiSC) at the University of Vienna. Additionally, we thank Alexander Truchon for assistance with sample collection and contributions to initial data analysis.