Large scale metagenome assembly reveals novel animal-associated microbial genomes, biosynthetic gene clusters, and other genetic diversity

Large-scale metagenome assemblies of human microbiomes have produced a vast catalogue of previously unseen microbial genomes; however, comparatively few microbial genomes derive from other vertebrates. Here, we generated 5596 metagenome-assembled genomes from the gut metagenomes of 180 predominantly wild animal species representing 5 classes, in addition to 14 existing animal gut metagenome datasets. The MAGs comprised 1522 species-level genome bins (SGBs); most of which were novel at the species, genus, or family levels, and the majority were enriched in host versus environment metagenomes. Many traits distinguished SGBs enriched in host or environmental biomes, including the number of antimicrobial resistance genes. We identified 1986 diverse biosynthetic gene clusters; only 23 clustered with any MIBiG database references. Gene-based assembly revealed tremendous gene diversity, much of it host- or environment-specific. Our MAG and gene datasets greatly expand the microbial genome repertoire and provide a broad view of microbial adaptations to the vertebrate gut. Importance Microbiome studies on a select few mammalian species (e.g., humans, mice, and cattle) have revealed a great deal of novel genomic diversity in the gut microbiome. However, little is known of the microbial diversity in the gut of other vertebrates. We studied the gut microbiome of a large set of mostly wild animal species consisting of mammals, birds, reptiles, amphibians, and fish. Unfortunately, we found that existing reference databases commonly used for metagenomic analyses failed to capture the microbiome diversity among vertebrates. To increase database representation, we applied advanced metagenome assembly methods to our animal gut data and to many public gut metagenome datasets that had not been used to obtain microbial genomes. Our resulting genome and gene cluster collections comprised a great deal of novel taxonomic and genomic diversity, which we extensively characterized. Our findings substantially expand what is known of microbial genomic diversity in the vertebrate gut.


56
The vertebrate gut microbiome comprises a vast amount of genetic diversity, yet 57 even for the most well-studied species such as humans, the number of microbial 58 species lacking a reference genome was recently estimated to be 40-50% 1 . Uncovering 59 this "microbial dark matter" is essential to understanding the roles of individual 60 microbes, their intra-and inter-species diversity within and across host populations, and 61 how each microbe interacts with each other and the host to mediate host physiology in 62 a myriad number of ways 2 . On a more applied level, characterizing novel gut microbial 63 diversity aids in bioprospecting of novel bioactive natural products, catalytic and 64 2 carbohydrate-binding enzymes, probiotics, etc., along with aiding in the discovery and 65 tracking of novel pathogens and antimicrobial resistance (AMR) 3 . 66 Recent advances in culturomic approaches have generated thousands of novel 67 microbial genomes 4-6 , but the throughput is currently far outpaced by metagenome 68 assembly approaches 7 . However, such large-scale metagenome assembly-based 69 approaches have not been as extensively applied to most non-human vertebrates. The 70 low amount of metagenome reads classified in some recent studies of the rhinoceros, 71 chicken, cod, and cow gut/rumen microbiome suggests that databases lack much of the 72 genomic diversity in less-studied vertebrates [8][9][10][11] . Indeed, the limited number of studies 73 incorporating metagenome assembly hint at the extensive amounts of as-of-yet novel 74 microbial diversity across the >66,000 vertebrate species on our planet. 75 Here, we developed an extensive metagenome assembly pipeline and applied it 76 to a multi-species dataset of microbiome diversity across vertebrate species comprising 77 5 classes: Mammalia, Aves, Reptilia, Amphibia, and Actinopterygii, with >80% of 78 samples obtained from wild individuals 12 combined with data from 14 published animal 79 gut metagenomes. Moreover, we also applied a recently developed gene-based 80 metagenome assembly pipeline to the entire dataset in order to obtain gene-level 81 diversity for rarer taxa that would otherwise be missed by genome-based assembly 13,14 .

82
Our assembly approaches yielded a great deal of novel genetic diversity, which we 83 found to be largely enriched in animals versus the environment, and to some degree, 84 enriched in particular animal clades.

86
Sample collection 87 Sample collection was as described in Youngblut and colleagues 12 . Table S1A shows

133
Animal gut metagenomes from a highly diverse collection of animals 134 We generated animal gut metagenomes from a breadth of vertebrate diversity 135 spanning five classes: Mammalia, Aves, Reptilia, Amphibia, and Actinopterygii (the 136 "multi-species" dataset; Figure 1). In total, 289 samples passed our read quality control, 137 with 3.4e6 ± 5e6 s.d. paired-end reads per sample, resulting in a mean estimated 138 coverage of 0.54 ± 0.14 s.d. (Figure S1). 180 animal species were represented, with up 139 to 6 individuals per species (mean of 1.6). Most individuals were wild (81%).

140
Our read-quality control pipeline included stringent filtering of host reads; some 141 samples contained high amounts of reads mapping to vertebrate genomes (up to 74%; 142 6 ± 17% s.d.; Figure 1). Gut content samples contained a significantly higher amount of 143 host reads (13.5 ± 21.6% s.d.) versus feces metagenomes (4.7 ± 12.7% s.d.; Wilcox, P 144 < 1.8e-7; Table S1A). We mapped all remaining reads to a custom comprehensive    We expanded our MAG dataset by applying our assembly pipeline to 14 170 publically available animal gut metagenome datasets in which no MAGs have been 171 generated by de novo metagenome assembly (Table S1B). Our metagenome selection 172 included 554 samples from members of Mammalia (dogs, cats, woodrats, pigs, whales, 173 rhinoceroses, pangolins, and non-human primates), Aves (geese, kakapos, and 174 chickens), and Actinopterygii (cod). We applied our assembly pipeline to each individual 175 dataset and generated a total of 5301 non-redundant MAGs ( Figure S3; Supplemental 176 Results). The substantially higher number of MAGs from these 14 datasets versus our 177 single multi-species dataset is likely due to the larger number of samples and the high 178 sequencing depth for many of those samples (e.g., we used 2 billion paired-end reads 179 from the dog gut microbiome dataset 25 ). 180 We combined all MAGs and de-replicated at 99.9 and 95% average nucleotide

188
We also assessed the novelty of our SGBs relative to UHGG, a comprehensive 189 human gut genome database, and found that only 31% of our SGBs had ≥95% ANI to 190 any of the 4644 UHGG representatives, and this overlap only increased to 34% at a 191 90% ANI cutoff.

192
Our SGB collection mostly consisted of MAGs assembled from a few species in 193 the multi-study dataset, suggesting that the SGBs may not be representative of taxa 194 found in other, more distantly related vertebrates. To assess the level of representation, 195 we determined the prevalence of all SGBs across all multi-species metagenomes 196 ( Figure S5). The host species with the highest number of observed SGBs did tend to be 197 those comprising the multi-study dataset (e.g., pigs and primates); however, SGBs were 198 frequently observed across the host phylogeny (41 ± 61 s.d. SGBs per host), indicating 199 that the SGB collection was generally representative of the vertebrate gut microbiome.

200
Integrating the 1522 SGBs into our custom GTDB Kraken2 database significantly 201 increased the percent reads mapped (paired t-test, P < 0.005; Figure S6). The percent 202 increase varied from <1 to 62.8% (mean of 5.3 ± 6.7 s.d.) among animal species but did 203 not appear biased to just pigs, dogs, or other vertebrate species in the multi-study 204 datasets that we incorporated ( Figure S7), which corresponds with our analysis of SGB 205 prevalence across vertebrate hosts ( Figure S5).   228 We investigated the traits of the host-and environment-enriched SGBs and 229 found many predicted phenotypes to be more prevalent in one or the other group 230 ( Figure 3C; Table S2C). A total of 67 traits were predicted based on genomic content of 231 certain pfam domains 26 . Almost all env-SGBs were aerobes (93%), which may aid in 232 transmission between the environment and host biomes. In contrast, 87% of host-SGBs 233 were anaerobes. Furthermore, all env-SGBs could generate catalase and were bile 234 susceptible, while both phenotypes were sparse in host-SGBs ( Figure 3C).

235
Carbohydrate metabolism also differed, with most host-SGBs predicted to consume 236 various tri-, di-, and mono-saccharides. In contrast, env-SGBs were enriched in 237 phenotypes associated with motility, nitrogen metabolism, and breakdown of 238 heterogeneous substrates (e.g., cellobiose metabolism). 239 We also compared SGB enrichment in mammals versus non-mammals in our 240 "multi-species" metagenome dataset and found 361 SGBs (24%) to be significantly  In contrast to our assessment of phenotypes distinct to host-or env-SGBs, we 248 did not observe such a distinction of phenotypes among SGBs enriched in Mammalia or 249 non-mammal gut metagenomes ( Figure S8). Certain phenotypes such as anaerobic 250 growth and lactose consumption were more prevalent among mammal species, but they 251 were not found to be significantly enriched relative to the null model.   Table S3A for all DESeq2 results.

275
MAGs reveal novel secondary metabolite diversity 276 We identified 1986 biosynthetic gene clusters (BGCs) among all 1522 SGBs. A 277 total of 28 different products were predicted, with the most abundant being non-

310
Large-scale gene-based metagenome assembly reveals novel diversity 311 We applied gene-based assembly methods to our combined metagenome

342
Biome enrichment of gene clusters from specific phyla 343 We mapped reads from our host-environment metagenome dataset to each 344 cluster and used DESeq2 to identify those significantly enriched (adj. P < 1e-5) in each  Figure 6B). Overall, these results suggest that both biomes select for these same 358 microbial functions, but the microbes involved often differ at coarse taxonomic scales. 359 We also assessed gene cluster enrichment in Mammalia versus non-Mammalia  (Figures 2 & S4). Moreover, we found little overlap (31%) between our 399 MAG collection and the extensive human microbiome genome catalogue comprising the 400 UHGG, which underscores its taxonomic novelty. We also showed substantial SGB   (Figures 2, 3, & S8). This is consistent with the hypothesis that mixed-mode 423 transmission, especially between environmental sources and hosts, is more  (Figures 2 & 4). 455 By combining our AMR marker screen with our SGB biome enrichment analysis, 456 we were able to characterize how AMR is associated with varying degrees of symbiosis 457 ( Figure S9), which is important for understanding AMR reservoirs 27,43 . Our findings 458 indicate that the AMR reservoir may be greater for free-living and facultatively symbiotic 459 taxa relative to microbes with stronger host associations ( Figure S9). Indeed, some of 460 the most abundant AMR markers were associated with metal resistance (e.g., ruvB, 461 tupC, and arsT), which may reflect a lifestyle in which the microbe is exposed to  This pattern largely remained true when we compared enrichment between the 492 Mammalia and non-mammals, suggesting that taxonomic differences prevail over 493 functional differences in regards to host specificity, at least over broad-scale vertebrate 494 evolutionary distances. While comparing function to taxonomy is challenging due to 495 differing levels of resolution, we do not believe that our findings are simply due to using 496 functional groupings that are coarser than taxonomy, given that i) we assessed multiple 497 functional grouping (COG, KEGG, and CAZy), which all showed similar patterns, even 498 though they differ in functional resolution, and ii) we assessed taxonomy at the very 499 coarse phylum level but still found stark taxonomic differences across biomes. gained insight into how taxonomy and function differ along the free-living to obligate 510 symbiosis lifestyle spectrum. We must note that our metagenome assembly dataset is 511 biased toward certain animal clades, which likely impacts these findings. As 512 metagenome assembly becomes more commonplace for studying the vertebrate gut 513 microbiome, bias toward certain vertebrates (e.g., humans) will decrease, and thus 514 allow for a more comprehensive reassessment of our findings.