Genomic characterization of the uncultured Bacteroidales family S24-7 inhabiting the guts of homeothermic animals

Our view of host-associated microbiota remains incomplete due to the presence of as yet uncultured constituents. The Bacteroidales family S24-7 is a prominent example of one of these groups. Marker gene surveys indicate that members of this family are highly localized to the gastrointestinal tracts of homeothermic animals and are increasingly being recognized as a numerically predominant member of the gut microbiota; however, little is known about the nature of their interactions with the host. Here, we provide the first whole genome exploration of this family, for which we propose the name “Candidatus Homeothermaceae,” using 30 population genomes extracted from fecal samples of four different animal hosts: human, mouse, koala, and guinea pig. We infer the core metabolism of “Ca. Homeothermaceae” to be that of fermentative or nanaerobic bacteria, resembling that of related Bacteroidales families. In addition, we describe three trophic guilds within the family, plant glycan (hemicellulose and pectin), host glycan, and α-glucan, each broadly defined by increased abundance of enzymes involved in the degradation of particular carbohydrates. “Ca. Homeothermaceae” representatives constitute a substantial component of the murine gut microbiota, as well as being present within the human gut, and this study provides important first insights into the nature of their residency. The presence of trophic guilds within the family indicates the potential for niche partitioning and specific roles for each guild in gut health and dysbiosis.


Background
The host microbiome has been firmly established as critical to host physiology. Evidence now supports the microbiome as influential in diverse processes ranging from infection susceptibility [1] to behavior [2]. Unique anatomical sites are occupied by microbiota of distinct composition [3], supporting alternative functions being carried out at each site. Of clear significance is the gut microbiome, as metabolic capacity is a product of the capabilities encoded within both the host and the microbiome. The typical vertebrate gut microbiome is dominated by the Firmicutes and Bacteroidetes, and the divergent nature of gut-associated genera in comparison to other phylum members not associated with this environment indicates host selection and evolution occurring over a long period [4]. The relationship is also dynamic, evidenced by shifts in the composition of the gut microbiota encountered with perturbations to a person's diet, as well as with many acute and chronic, noncommunicable diseases, such as inflammatory bowel diseases, or with their treatment (reviewed in [5,6]). Despite our advances in describing these fluctuations, many members of the communities have yet to be cultured and characterized. As such, it remains difficult to ascribe their contributions to gut and systemic function, and thereby, host health and well-being.
One such uncharacterized inhabitant of the gastrointestinal tract is a novel branch of the "Bacteroides group" first recognized in 2002 [7]. This branch was subsequently classified by Greengenes [8] and Silva [9] as an uncultured family of the order Bacteroidales, named after one of the earliest environmental clones belonging to the lineage, S24-7 (acc. AJ400263, [7]). Multiple studies have since reported the altered abundance of S24-7 family members in association with different environmental conditions, e.g., S24-7 is more abundant in diabetes-sensitive mice fed a highfat diet, in particular when chow is supplemented with gluco-oligosaccharides [10] and following treatmentinduced remission of colitis in mice [11]. Members of the S24-7 family are also differentiated by their degree of IgAlabeling [12,13] suggesting at least some members of the group are targeted by the innate immune system. While these observations are currently limited to murine-based studies, they do suggest that S24-7 is involved in hostmicrobe interactions that impact on gut function and health.
To further our understanding of the S24-7 family, we obtained 30 population genomes from four different hosts (human, mouse, koala, and guinea pig) and performed a comparative genomics analysis. The recovered genomes define a family that is most closely related to, but distinct from, the genera Barnesiella and Coprobacter. Analysis of 16S rRNA gene databases suggests a strong habitat preference for the homeothermic gut. Metabolically, we infer S24-7 are fermentative or nanaerobic [14] species, consistent with their environmental niche. We describe three trophic guilds within the family focusing on α-glucan, host glycan, or plant glycan-based carbohydrates suggesting the capacity for niche partitioning and/or divergent spatial organization of its members.

Results
"Candidatus Homeothermaceae" (S24-7) members are found almost exclusively in the guts of homeothermic animals The Silva database (release 119, [9]) contains over 3000 sequences designated as being within the S24-7 family. Notably, 98 % of these sequences originate from homeothermic animals, 99 % of which are associated with the gastrointestinal system (Fig. 1a). There is a single sequence obtained from the intestine of the anchovy Coilia mystus. We therefore propose the name "Candidatus Homeothermaceae" in reference to the homeothermic preference of the family. Similar, although not as extreme, trends are observed in other Bacteroidales families, both in terms of homeothermic hosts and gastrointestinal preference (Fig. 1a). The association of Bacteroidales with feces is well documented and has led to the establishment of host-specific detection of fecal contamination based on this order [15,16]. Of the 57 unique animal species identified as hosting "Ca. Homeothermaceae," the majority (96 %) are herbivores or omnivores, with many of the omnivores likely to consume a mostly herbivorous diet (e.g., chimpanzee, gorilla). The only two carnivorous hosts within the database are the dhole (Cuon alpinus), one of only four canids classified as hypercarnivores [17], and the sea lion. The substantial dominance of plant-based diets amongst "Ca. Homeothermaceae" hosts potentially reflects the metabolic capacity of the family.
"Ca. Homeothermaceae" population genomes We obtained 30 near complete "Ca. Homeothermaceae" draft genomes from fecal metagenomic datasets: 10 from human (genomes H1 to H10), 14 from mouse (Mus musculus, order: Rodentia, family: Muridae; genomes M1 to M14), four from guinea pig (Cavia porcellus, order: Rodentia, family: Caviidae; genomes GP1 to GP4), and two from koala (Phascolarctos cereus, order: Diprotodontia, family: Phascolarctidae; K1 and K10). Average genome size was 2.69 Mb, with a notable outlier from the koala gut of 4.46 Mb (Additional file 1: Table S1). Abundance of each population bin within their respective metagenomic dataset varied from 0.4 to 14.8 % indicating that members of this family represent large fractions of the gut community in some animal hosts (Additional file 1: Table S1). We selected the most complete genome with the lowest inferred contamination, M4 (99.4 % complete; 0.4 % contamination), as a representative of the "Ca. Homeothermaceae" family, for which we propose the name "Candidatus Homeothermus arabinoxylanisolvens." Identification of protein orthologs between each of the genomes revealed a core of 503 proteins present in at least 28 of the 30 assembled "Ca. Homeothermaceae" genomes with an average of 14 % unique genes (minimum 6 % (H9), maximum 28 % (K1), Additional file 2: Table S2). Unique genes were distributed throughout each genome, including the large genome obtained from koala, with only a small number clustered in apparent genomic islands (Additional file 3: Figure S1).
Genome-based phylogenetic classification of the "Ca.

Homeothermaceae"
The assembled "Ca. Homeothermaceae" population genomes were phylogenetically placed within the Bacteriodales order by generating a genome tree based on 120 single-copy marker genes within 300 Bacteroidales reference genomes obtained from NCBI (Fig. 1b). The closest characterized relatives of "Ca. Homeothermaceae" are members of the bacterial genera Barnesiella and Coprobacter. These genera are currently classified as members of the family Porphyromonadaceae but according to our genome-based inference, we propose that they be reclassified as a separate family, Barnesiellaceae fam. nov. (Fig. 1b and Additional file 4: Figure S2). Calculation of average nucleotide sequence identity (ANI) between "Ca. Homeothermaceae" genomes supports the dataset representing 27 different species, with only three examples of members of the same species having been sampled on two independent occasions (ANI >95 %, [18]). Two of these genome pairs were recovered from different hosts: M5 and H6, originating from mouse and human, and H1 and K10, originating from human and koala (Additional file 5: Figure S3). The third pair, H8 and H9, shares a human origin. Thus, "Ca. Homeothermaceae" species are not restricted to specific hosts. Above the species level, there are also five clades within the dataset sharing increased average amino acid identity that may represent distinct genera ( Fig. 1b and Additional file 5: Figure S3). Nine population genomes fall outside of both inferred species and genera indicating further diversity exists within the family.

Shared features of "Ca. Homeothermaceae" genomes
Annotation of each genome supports the family as being composed of primary fermenters capable of producing acetate, propionate, and succinate (Fig. 2). The "Ca. Homeothermaceae" cell envelope is that of a diderm (Gram-negative), as demonstrated by the absence of typical monoderm protein domains and the presence of the majority of typical diderm domains, consistent with the Bacteroidetes phylum (Additional file 6: Table S3, [19]). A number of the genomes encode putative capsular polysaccharide synthesis loci defined by the presence of the regulatory homolog UpxY in association with a number of glycosyltransferase genes, as seen in Bacteroides (Additional file 7: Figure S4, [20]). Predicted fimbrial . a Habitat source of 16S rRNA gene sequences described in the Silva database [9] belonging to members of the Bacteroidales order. Written percentages indicate the proportion of homeothermic animal samples which were derived from the gastrointestinal tract (GIT) for families dominated by homeothermic samples. Note that families in this panel have been reclassified according to the genome tree (panel b). b A maximum-likelihood tree of "Ca. Homeothermaceae" population genomes based on alignment of 120 concatenated marker genes. Bootstrap support derived from 100 replicates. Genome identifiers indicate host from which genome was recovered: human (H), mouse (M), guinea pig (GP), or koala (K). The metabolic focus, as determined by carbohydrate active gene annotations, is indicated by colored triangles following each population genome genes are present in 80 % of the genomes and carry the FimA domain (pfam06321), and/or the associated fimbrillin C domain (pfam15495), or, more commonly, the Mfa-like-1 (pfam13149) or Mfa2 (pfam08842) domain indicating "Ca. Homeothermaceae," may produce fimbriae resembling that of the periodontal pathogen Porphyromonas gingivalis where they have been shown to bind both host proteins and those of other bacteria, as well as promote inflammatory responses (reviewed in [21]). All "Ca. Homeothermaceae" contain at least one putative antibiotic efflux pump, and 60 % also encode a class A β-lactamase (Additional file 7: Figure S4), a profile that is less extensive than that of their close relatives Bacteroides (reviewed in [22]).
All genomes encode the capacity for the production of vitamins B 1 (thiamine), B 2 (riboflavin), B 3 (niacin), B 5 (pantothenate), B 7 (biotin), and B 9 (folate), a range which is consistent with other Bacteroidetes [23]. No homolog of PdxH, the final enzyme required for active vitamin B 6 production, was identified in any of the genomes, despite the remainder of the pathway being present. The lack of PdxH has also been noted in some Bacteroides species [23]. The complete vitamin B 12 (cobalamin) production pathway is absent from all "Ca. Homeothermaceae," Fig. 2 Inferred key metabolism of "Ca. Homeothermaceae." Model constructed using KEGG orthology annotations in addition to manual curation. Superscript numbers denote the number of population genomes encoding a given characteristic however, a subset encode a partial pathway originating from adenosyl cobyrinic acid. Vitamin B 12 transporters were identified within 28 of the 30 genomes (Fig. 2). Therefore, members of this family are predicted to rely on neighboring populations for the production of B 12 , an important cofactor, as is seen in other Bacteroidetes [24].
In addition to fermentation capacity, "Ca. Homeothermaceae" also encode elements of an electron transport chain indicating possible alternative modes of energy production. Complex I is found in the majority of "Ca. Homeothermaceae" genomes (25 of 30, Additional file 7: Figure S4) and comprises 11 of the 14 canonical subunits [25], lacking NuoEFG, the NADH dehydrogenase module. Such complexes are found in multiple phyla; however, their redox function is often unclear [26]. Of the remaining five genomes, four (GP3, GP4, M10, and M14) have loci adjacent to contig boundaries suggesting incomplete assembly as the reason for the missing elements. The remaining genome, H5, entirely lacks all components of complex I. While this genome has a relatively low level of completeness (85 %), other genomes with similar completion levels contain complex I, suggesting true absence in H5. The closest relative of this population, M8, contains a complete (11/14 subunit) complex, indicating this would be a recent loss from H5 (Fig. 1b). An F-type ATP synthase was identified in 26 "Ca. Homeothermaceae" genomes and is integrated within the genomic locus of complex I. Consequently, several genomes with incomplete complex I also harbor incomplete ATP synthases, including H5 in which all subunits are absent.
In addition to complex I, complex II (fumarate reductase/succinate dehydrogenase) is present in 28 genomes and is composed of three subunits: a flavoprotein, an iron-sulfur protein, and a single transmembrane protein, indicating a type B structure [27]. Both genomes with incomplete complex II gene sets, GP3 (missing two genes) and GP4 (missing all genes), were also missing elements of complex I, consistent with their lower completeness and increased fragmentation (Additional file 7: Figure S4 and Additional file 1: Table S1). The flavoprotein catalytic subunit of the complex resembles that of other related anaerobic Bacteroidales species suggesting fumarate as the terminal electron acceptor in an anaerobic respiratory chain, as is described in Bacteroides (Additional file 8: Figure S5, [28]). However, also like Bacteroides and most other members of the Bacteroidales order, the majority of "Ca. Homeothermaceae" genomes contain a vertically inherited aerobic reductase operon, cydAB (Additional file 9: Figure S6), a highaffinity bd-type oxidase induced under low oxygen conditions permitting growth in nanomolar concentrations of oxygen [29,30]. Thus, "Ca. Homeothermaceae" are likely nanaerobes, able to inhabit both anoxic and marginally oxic environments [14]. Four genomes lack the cydAB operon (H4, H9, M9, and GP4), one of which is 97 % complete, suggesting that not all "Ca. Homeothermaceae" have this capacity. We were unable to confirm the operons' absence through genomic context due to a lack of synteny in the surrounding regions despite close relatives in some instances (H8-H9, M9-GP2). If truly absent, as with complex I in H5, this would indicate relatively recent independent loss of this operon from multiple "Ca. Homeothermaceae" lineages. The variable presence of respiratory complexes in "Ca. Homeothermaceae" suggests a level of energetic flexibility within the family and potentially relatively recent purging of non-essential respiratory elements in the gastrointestinal environment.
In addition to the typical electron transport chain elements, the Na + translocating NADH:ubiquinone oxidoreductase Nqr complex was identified in 27 "Ca. Homeothermaceae" genomes (Additional file 7: Figure S4). Nqr permits the use of a Na + gradient for energy production and suggests NADH may be oxidized by this complex in "Ca. Homeothermaceae" rather than by complex I, which is missing the NADH dehydrogenase module. All six Nqr subunits are present within the 27 genomes, which include H5. Three genomes, H2, H9, and M11, are missing all six components. At least one H + /Na + antiporter is found within all "Ca. Homeothermaceae" genomes supporting the use of this system.
To support a prediction of "Ca. Homeothermaceae" as nanaerobes, we looked for proteins within the genomes that would provide oxidative stress protection. Superoxide dismutase (O 2 − to O 2 or H 2 O 2 ) was identified in 24 of the genomes, and eight of these also encode a catalase (H 2 O 2 to H 2 O + O 2 ) protein (Additional file 7: Figure S4). All eight catalase-positive genomes were obtained from mice. Peroxide reduction may also be achieved via several peroxiredoxins that are present within the "Ca. Homeothermaceae" genomes. Firstly, the alkyl hydroperoxide reductase, AhpC, and associated disulfide reductase, AhpF, were identified in nine genomes. A further six contained AhpC only, representing a separate cluster within generated gene trees (Additional file 7: Figure S4 and Additional file 10: Figure S7). There are also between one and four copies of rubrerythrin in all "Ca. Homeothermaceae" genomes. Finally, 25 genomes encode the thiol peroxidase bacterioferritin comigratory protein. Protein stability and reduced state regeneration during oxidative stress is supported by the presence of between two and six TRX family (group I) thioredoxins within each genome and a single copy of the thioredoxin reductase TrxB in all but three genomes: GP1 and M13 lack TrxB, while M12 contains two copies. In addition, 26 genomes contain a non-heme ferritin protein permitting the storage of excess iron. Overall, "Ca. Homeothermaceae" members appear well equipped to deal with oxidative stress, supporting potential microaerobic growth.
The "Ca. Homeothermaceae" secretome Proteins secreted by gut-inhabiting bacteria can influence interactions with both the host and other microbes. Approximately 15 % of "Ca. Homeothermaceae" proteins carry the general secretory pathway signal peptide, as predicted by SignalP [31], which is at the lower end of the range predicted for Gram-negative species (13 to 42 %) [32]. Within this secretome,~15 % of the proteins were annotated with carbohydrate-based activity and thus potentially provide nutrients for "Ca. Homeothermaceae" or neighboring populations. Several immune-related peptidases are also secreted: 20 of the 30 population genomes contain a metalloprotease belonging to peptidase family M6 (immune inhibitor A family) (Additional file 7: Figure S4). Members of this family have been demonstrated to degrade antimicrobial peptides [33] and components of the extracellular matrix [34] and may therefore play a role in invasiveness or persistence within the host (reviewed in [35]). In addition, 11 genomes contain an IgA degrading peptidase (peptidase family M64) that may assist with immune evasion by these populations [36].
Working in concert with the general secretory pathway, a type IX secretion system (T9SS) was also identified in the majority of the "Ca. Homeothermaceae" genomes. All 10 components of the system (PorK, PorL, PorM, PorN, PorP, PorT, PorU, PorV, PorW, and Sov) were present in 22 genomes (other genomes contained incomplete gene sets), in addition to the regulatory twocomponent sensor system, PorX (response regulator) and PorY (histidine kinase), responsible for the coregulation of a subset of T9SS genes (Additional file 7: Figure S4, [37]). No other secretion system was identified within "Ca. Homeothermaceae". Within the T9SS, PorU acts as a peptidase for proteins containing a conserved C-terminal domain (TIGR04183) that dictates the use of the system and is cleaved during translocation [38,39]. We identified 161 proteins containing this domain within the "Ca. Homeothermaceae" genomes, 75 % of which also carried a general secretory pathway signal peptide, supporting their movement to the periplasm and subsequent secretion by the T9SS. The majority of proteins within this group are annotated as hypothetical (60 %); however, there is a homolog of a characterized immune-related peptidase, streptopain (SpeB). SpeB, encoded by Streptococcus pyogenes, contains the peptidase C10 domain and is capable of degrading multiple components of the immune system (reviewed in [40]). Streptopain homologs are present in 26 "Ca. Homeothermaceae" genomes.
Potential metabolic guilds within "Ca. Homeothermaceae" Carbohydrate-active enzymes constitute~6 % of "Ca. Homeothermaceae" coding sequences, a level similar to that of other Bacteroidales families (Additional file 11: Table S4). GH13 is the most abundant glycoside hydrolase family and largely comprises ɑ-amylases, suggesting starch is a key resource of the family. In support of this, GH13 was found to be significantly (P = 1.52E-08) enriched in "Ca. Homeothermaceae" in comparison to related Bacteroidales families, as was the starch binding module CBM26 (P = 1.37E-06, Additional file 12: Table S5). The next most abundant glycoside hydrolase family is GH43 and is dominated by xylosidase and arabinosidase enzymes capable of degrading xylan and arabinan, respectively. However, this ability is not ubiquitous amongst "Ca. Homeothermaceae" as 12 genomes contain no genes in this category (Fig. 3). Differential abundance of such enzymes indicates potential niche partitioning, and comparative analysis across the population genomes suggests three trophic guilds with differential capacity for carbohydrate degradation: α-glucans, complex plant cell wall glycans (hemicellulose and pectin), or host-derived glycans (Figs. 1b and 3).
The α-glucan and plant glycan guilds constitute the majority of the "Ca. Homeothermaceae" genomes, comprising 13 and 12 genomes respectively, with the host glycan group composed of five members. We used a combination of indicator species identification and pairwise differential abundance analysis to confirm the enriched enzymes within each group, retaining those enzymes identified by both methods as defining the guild (Additional files 13 and 14: Tables S6 and S7). The αglucan guild is the most highly selective group, with only two significantly enriched enzyme categories, both starch related (Additional file 13: Table S6). The plant guild is equipped for the degradation of arabinan, xylan, and pectin, all plant cell wall constituents. Finally, the host glycan guild is enriched in β-hexosaminidases, capable of cleaving glucosamine and galactosamine residues, α-fucosidases, capable of cleaving fucose residues and comprises the only genomes to contain the sialic acid cleaving sialidase, supporting a capacity for host glycan degradation (other enzymes carrying the GH33 domain do not display homology to known sialidase enzymes, Additional file 15: Figure S8). Integration of trophic guild membership with phylogeny reveals the presence of some clades with a shared substrate focus, while others are mixed (Fig. 1b). Each guild is also mixed in terms of host distribution, with no foci found in only one host. There are, however, dominant guilds within both guinea pig and human samples: 70 % of human origin genomes are α-glucan focused and 75 % of guinea pig origin genomes are plant focused. This may reflect diet preference of these hosts; however, more genomes are required to provide support for this theory. We also predict that these guilds may occupy distinct spatial niches within the gut; the host glycan group primarily associating with the mucus layer, as has been demonstrated for the known mucin degrader Akkermansia muciniphila [41], and the plant and starch groups associating primarily with the digesta.
To provide contextual data, we generated carbohydrateactive enzyme (CAZy) profiles for a selection of microbial isolates from the gut and compared their glycoside hydrolase distribution with that of the "Ca. Homeothermaceae" population genomes (Additional file 15: Figure S8). The trophic guild division between "Ca. Homeothermaceae" genomes remained largely intact with this extended analysis; however, two genomes from the α-glucan guild, M6 and M11, became associated with members of an alternate guild, M6 with host glycan and M11 with plant glycan. Both these populations are positioned on the boundary between their two alternative guilds when broader analysis was conducted based on Clusters of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology annotation (Fig. 4a, b). In support of the predicted metabolic guilds, A. muciniphila, a known mucin degrader [42], was found to associate with the host glycan guild (Additional file 15: Figure S8). The plant guild associates with a single species, Prevotella copri, which has been demonstrated to increase in abundance following a controlled diet containing barley kernel-based bread [43]. Functional metagenomic analysis of fecal samples following this controlled diet revealed an increase in genes necessary for the degradation of polysaccharides within the bread, including xylan-degrading enzymes which were found to be enriched within the plant guild (Additional file 13: Table S6). Finally, the α-glucan guild associates with the acetogen Blautia hydrogenotrophica, which is known to harbor few glycoside hydrolases and consequently has been proposed to inhabit a syntrophic niche within the gut [28]. Therefore, the α-glucan guild may actually represent populations utilizing the metabolism of their neighbors and providing an alternative function in return.

Polysaccharide utilization loci
Carbohydrate degrading enzymes may be clustered together in polysaccharide utilization loci (PUL), first described in Bacteroides thetaiotaomicron [20] and now known to be typical of Bacteroidetes [44]. Such loci are defined by the presence of orthologs of the starch utilization system (Sus) components responsible for starch and maltooligosaccharide binding (SusD) and transfer to the periplasmic space (SusC) [45,46]. Multiple susCD-like gene pairs are found within all "Ca.  Table S6) Homeothermaceae" genomes, and most also contain the gene pair in association with carbohydrate-active enzymes (Additional file 16: Table S8). In addition, 10 % of "Ca. Homeothermaceae" PULs are located in close proximity to a hybrid two-component system protein, which have been demonstrated to regulate PUL expression (Additional file 16: Table S8) [47,48]. While homologs of extracytoplasmic function σfactors, also linked to PUL expression [49], were identified in many "Ca. Homeothermaceae" genomes, they were not located near PULs. The average number of both susCD pairs and pairs with associated carbohydrate enzymes is highest in the plant guild, suggesting these as requiring the most varied carbohydrate degradative machinery (Additional file 16: Table S8). Both susC and susD are present in large numbers throughout the Bacteroidetes with variable sequence identity existing both within and between species [20,50,51]. While initially described as a starch binding system, the association of susCD pairs with enzymes targeting a variety of substrates supports the broader use of this system, and the differential regulation of distinct susCD pairs in response to dietary changes has been demonstrated in B. thetaiotaomicron [52]. We constructed gene trees based on both genes (data not shown) to determine whether there was greater variability within a particular "Ca. Homeothermaceae" guild, which could reflect an ability to bind and transport a wider variety of substrates. However, average phylogenetic diversity scores [53] for all three guilds were similar for both susC (α-glucan:1.6, host glycan: 1.7, plant:1.5) and susD (α-glucan:1.8, host glycan:1.8, plant:1.7) suggesting that equivalent sequence diversity exists within each group and therefore potentially a similar, although not necessarily overlapping, diversity of substrates available to this system. Fig. 4 Metabolic guilds cluster together based on broad KEGG and COG orthology annotation. a PCA plot generated using KEGG annotation of "Ca. Homeothermaceae" population genomes. b PCA plot generated using COG annotation of "Ca. Homeothermaceae" population genomes. c COGs enriched in pairwise comparisons for each trophic guild (Additional file 17: Table S9) Broader functional comparative analysis of "Ca.

Homeothermaceae"
To determine whether the metabolic guilds extend to broader genome content, we annotated each genome using both KEGG and COG orthology detection. Ordination plots generated from both annotation systems clustered the metabolic guilds discretely, although with low levels of separation (Fig. 4a, b). Only a single KEGG orthology group was found to be significantly enriched; K12373 (hexosaminidase), which was increased within the host glycan guild, consistent with the previous CAZy-based observations. COG annotation yielded several enriched protein families within each guild (Fig. 4c and Additional file 17: Table S9), all of which are associated with carbohydrate-active enzymes, indicating these as the key differentiating characteristic of each group. In addition to functions noted previously, the host glycan guild is also enriched for arylsulfatase-related enzymes (COG3119), which plays a role in the degradation of host glycans and is therefore consistent with the guild focus.
Using COG-based annotations, we then extended the analysis to compare "Ca. Homeothermaceae" to other Bacteroidales families. "Ca. Homeothermaceae," Bacteroidaceae, and Prevotellaceae separated clearly based on COG annotations, while Porphyromonadaceae members were intermingled with other families, potentially reflecting the phenotypic diversity of this family (Additional file 18: Figure S9, [54]). The abundance of over 450 individual COGs was found to be significantly different within "Ca. Homeothermaceae" compared to the other families;~75 % of which were decreased (Additional file 19: Table S10). Out of those with increased abundance, several are of interest. Multiple urease-associated COGs were significantly enriched, and the urease gene cluster was subsequently identified in twelve genomes, suggesting a role in nitrogen recycling and a source of ammonia (Additional file 7: Figure S4). Gene trees produced for each of the urease subunits confirm that the presence of urease is unusual within the Bacteroidales and suggest lateral transfer of the gene cluster to the common ancestor of "Ca. Homeothermaceae" from an Alistipes-or Odoribacter-like ancestor and subsequent loss from a subset of the "Ca. Homeothermaceae" genomes (Additional file 20: Figure S10, data not shown). Putative formyl-CoA transferases (COG1804) were also enriched in "Ca. Homeothermaceae," revealing the presence of the oxalate degrading gene pair formyl-CoA transferase and oxalyl-CoA decarboxylase in 19 of the 30 genomes (Additional file 7: Figure S4). Oxalate is likely transported into the cell via a permease located adjacent to oxalyl-CoA decarboxylase in all 19 oxalate degraders (and which is not found in "Ca. Homeothermaceae" lacking the oxalate degrading gene pair) as suggested by genomic analysis of other oxalate degrading species [55]. Oxalate degradation appears to be linked to metabolic guilds, that is, 10 of 12 plant, and four of five host glycan-focused "Ca. Homeothermaceae" have oxalate degrading genes, whereas only five of the 13 α-glucanfocused guild have them. As with urease, gene trees confirm that this function is rare within the Bacteroidales (Additional file 21: Figure S11, data not shown).

Relative abundance and prevalence within the sampled mammalian hosts
A large portion of "Ca. Homeothermaceae" sequences within 16S rRNA databases originate from mice suggesting high prevalence of the family in the murine host. To determine whether this reflects true prevalence or database bias, we searched for evidence of "Ca. Homeothermaceae" in available gut metagenome datasets from both mice and humans. We found "Ca. Homeothermaceae," as represented by the analyzed population genomes, present in~50 % of mice samples, with an average relative abundance of 6 % of the gut community as estimated by read mapping (from~100 metagenomes, Additional file 22: Table S11). The prevalence in humans was lower,~20 %, with an average abundance of~2 % of the community (from~300 metagenomes). "Ca. Homeothermaceae" is therefore more common in mice than humans but is nonetheless likely to be present in a sizable fraction of the human population.
The most prevalent "Ca. Homeothermaceae" species was H8/H9 in 20 % of human datasets, and M6 in 35 % of murine datasets (Additional file 22: Table S11), both members of the α-glucan guild. We were interested to see if this prevalence was consistent across different dietary backgrounds as a potential source of support for our proposed guild structure. To do this, we subdivided the analyzed datasets according to available diet-related metadata. We found mice fed a high-fat diet carried a narrower range of "Ca. Homeothermaceae" populations, and these were present at a lower abundance than those fed a standard chow diet (Additional file 22: Table S11). However, no particular guild showed dominance, and M6 remained the most prevalent population overall. Within the public human datasets sampled, very few non-α-glucan guild representatives were detected (Additional file 22: Table S11) in agreement with the dominance of this guild within the "Ca. Homeothermaceae" genomes recovered from humans and suggestive of the human diet as most supportive of members of the α-glucan guild. We found a higher prevalence of "Ca. Homeothermaceae" in obese individuals (23 %) than in lean (10 %), suggesting that a higher energy diet may better support the family, although there was no dietary composition information available for this dataset. We also identified a substantial increase in the prevalence of "Ca. Homeothermaceae" within the Hadza huntergatherer population; 70 % of individuals were found to carry at least one population. Only two species were identified within this group of individuals: the typically prevalent H8/H9 and species H4, also a member of the α-glucan guild. The Hadza diet is heavily plant based, composed of tubers, meat, honey, foliage, and berries, with tubers being particularly important due to their constant availability [56]. Tubers consumed by the Hadza contain a large portion of indigestible fibers that are expectorated after chewing. As such, tubers provide a source of moisture, simple sugars, starch, and soluble fiber [56]. The increased prevalence of "Ca. Homeothermaceae" within the Hadza suggests this diet is particularly amenable to the maintenance of the identified species, although does not result in any increase in their overall abundance.

Discussion
The Bacteroidales family S24-7 is encountered frequently in culture-independent studies and is gaining recognition due to both its prevalence, particularly in murine-based datasets, and its fluctuating abundance in cross-sectional and intervention type studies. Increased abundance has been described in mice fed a low-fat diet and, in association with increased exercise [57], in diabetes-sensitive mice fed a high-fat diet [10] and following remission of colitis in a mouse model [11]. S24-7 is also the dominant family during hibernation of arctic ground squirrels [58]. The consequence of these fluctuations in the abundance of S24-7 is currently unknown, as they remain uncultured and no genomic studies have been undertaken. Here, we recover a set of S24-7 population genomes from metagenomic samples, enabling inference of their core metabolism, and propose the name "Ca. Homeothermaceae" reflecting an ecological distribution limited to the guts of homeothermic animals (Fig. 1a).
Carbohydrate composition and availability is known to be a primary driver of microbial community structure in gut ecosystems [48,[59][60][61]. We identified three trophic guilds within the "Ca. Homeothermaceae" based on their encoded carbohydrate-active enzymes (Fig. 3) that suggest the family has the capacity to occupy multiple niches within the gut, which may include spatial partitioning. Similar metabolic differentiation is observed in other gutinhabiting genera such as Bacteroides [48,50,62] and Bifidobacterium [61] where different species encode alternative carbohydrate utilization machinery. Some gut inhabitants may be able to occupy multiple carbohydratebased trophic niches: B. thetaiotaomicron displays a preference for diet-derived polysaccharides, such as xylan-, pectin-, and arabinose-based compounds, over host glycans; however, when such polysaccharides are scarce, host glycan degradation activity is upregulated [52]. "Ca. Homeothermaceae" guild representatives may also utilize this strategy; all analyzed population genomes encode starch-degrading ɑ-amylases and as such, while they may have a preference for their specialist carbohydrate source, starch can serve as a foundation carbohydrate for all family members.
We identified two characteristics within a subset of "Ca. Homeothermaceae" populations that were relatively unusual for members of the Bacteriodales: the capacity for oxalate degradation and the production of urease (Additional files 20 and 21: Figure S10 and S11). Plants are the primary source of dietary oxalate, which in excess contributes to the formation of renal stones by complexing calcium (reviewed in [63]). In agreement with this dietary source, the majority of "Ca. Homeothermaceae" populations within the plant-focused guild encode the necessary components for oxalate degradation. Oxalate degrading potential is also encoded within four of the five host glycan guild population genomes while only 40 % of α-glucan guild members contain the necessary genes (Additional file 7: Figure S4). Oxalobacter formigenes is the best characterized oxalate degrading gut bacterium, and colonization is associated with the decreased incidence of calcium oxalate stone formation [64]. While O. formigenes is dependent on the presence of oxalate and uses it as a sole carbon source [65], other oxalate degraders, such as lactic acid bacteria, require the presence of an additional carbon source [55,66]. "Ca. Homeothermaceae" oxalate degraders are likely in the latter category due to sporadic distribution of the trait across the family. This distribution appears to be the result of lineage-specific loss rather than multiple independent lateral acquisitions (Additional file 20: Figure S10). Urease releases ammonia from urea, which can then be incorporated into microbial amino acids [67] and contributes to nitrogen level stability of the host particularly when protein consumption is low [68]. Urease activity can therefore be advantageous for both host and microbe. However, urease-positive microbes can be detrimental in combination with elevated circulating ammonia levels associated with liver disease [69]. Urease is also a recognized virulence factor in both bacterial and fungal infection (reviewed in [70]). The abundance of urease genes within the gut microbiota in humans differs with age and geography and is potentially reflective of diet [71]. We identified urease positive "Ca. Homeothermaceae" populations in all four hosts (Additional file 7: Figure S4), and presence did not correlate with a particular trophic guild making it difficult to predict a role for urease within the group; however, a metabolic role for the enzyme is supported by the presence of glutamine synthetase in the majority of genomes (Fig. 2).
A key question relating to newly characterized members of the microbiota is whether they are friend or foe. "Ca. Homeothermaceae" representatives have been shown to be IgA coated [12]. IgA production is induced by both commensal and pathogenic intestinal inhabitants, and both are believed to be able to induce the production of highly specific IgA leading to microbial cell coating [72]. "Ca. Homeothermaceae" are found in both IgA+ and IgA− fractions of fecal microbiota [12,13], however, on the whole, are not highly coated with IgA in contrast to families known to be inflammatory in murine models of colitis, such as Prevotellaceae [12]. The presence within some "Ca. Homeothermaceae" genomes of an IgA protease (Additional file 7: Figure S4) may contribute to their identification within both IgA+ and IgA− community fractions, although the significance of IgA proteases in vivo remains unclear (reviewed in [73]). The majority of "Ca. Homeothermaceae" genomes also contain a homolog of SpeB (Additional file 7: Figure S4), a peptidase capable of degrading multiple immunologically relevant proteins (reviewed in [40]). SpeB homologs are found in other gut bacteria including Bacteroides fragilis and B. thetaiotaomicron [74,75] and in the periodontal pathogen Prevotella intermedia where the homolog interpain A is involved in the inhibition of the immune response via complement degradation [76]. The presence of multiple potential immune evasion peptidases within members of "Ca. Homeothermaceae" does not preclude a typically commensal relationship with the host; however, they may provide the capacity for opportunistic infection under the appropriate conditions [75].

Conclusions
Overall, this study provides the first genomic insights into the uncultured gut-inhabiting "Ca. Homeothermaceae" family through the comparative analysis of 30 draft genomes obtained from metagenomic datasets. We describe varied carbohydrate utilization mechanisms existing within the family in agreement with other genera occupying the same environmental niche. As a group that is particularly prevalent within a key experimental environment, the mouse gut (and also present in the human gut and potentially relevant to human health), further reports confirming the roles of "Ca. Homeothermaceae" in vivo are likely to appear in the future.
Description of "Ca. Homeothermaceae" The description is the same as for the genus "Ca. Homeothermus" with the following additions; -aceae ending to denote a family. Additional fermentation substrates include pectin, rhamnose, and trehalose. Three trophic guilds are proposed within the family based on the relative abundance of carbohydrate-active enzymes with different substrates: α-glucans, complex plant cell wall components, and host glycans. Type genus: "Ca. Homeothermus." Order: Bacteroidales. The description is the same as that given by Krieg et al. [54] with the following amendment. The genera, Barnesiella, Butyricimonas, Coprobacter, Dysgonomonas, Odoribacter, Paludibacter, Parabacteroides, Proteiniphilum, "Ca. Sanguibacteroides," and Tannerella have been removed as they do not form a monophyletic group with the type genus, Porphyromonas.
Description of Barnesiellaceae fam. nov.
Description of Dysgonamonadaceae fam. nov.
Description of Paludibacteraceae fam. nov.
Includes the genus Paludibacter (type genus). Description is the same as for the genus Paludibacter given by Ueki et al. [82]: -aceae ending to denote a family. Type genus: Paludibacter. Order: Bacteroidales.
Description of Tannerellaceae fam. nov.
Includes the genera Parabacteroides and Tannerella (type genus). Description is drawn from that of Tannerella given by Sakamoto et al. [83] and Parabacteroides given by Sakamoto et al. [84]: -aceae ending to denote a family. Cells are Gram-negative, non-motile, and obligately anaerobic. Type genus: Tannerella. Order: Bacteroidales.

Sample collection and sequencing
Fecal samples were obtained from six 12-week-old female C57BL/6 mice housed in accordance with the University of Newcastle Animal Care and Ethics Committee; reference number A-2013-303. DNA was extracted from feces using the PowerSoil DNA Isolation Kit (MO BIO Laboratories, CA, USA) according to the manufacturer's instructions. Library preparation was performed using the Nextera DNA Library Preparation Kit (Illumina, CA, USA). Libraries were sequenced at the Diamantina Institute, The University of Queensland, using the Illumina HiSeq 2500 platform generating~9 Gbp of 100 bp paired-end reads per sample.
Koala fecal samples originated from a 12-year-old male as previously described [85]. Public metagenomic datasets generated from fecal samples of both human and guinea pig were downloaded from the NCBI sequence read archive (SRA); human samples were obtained from multiple projects, specifically runs (ERR209459, ERR209707, ERR525737, ERR688517, ERR688528, ERR710427, ERR710429, SRR413598, SRR413599); guinea pig samples were obtained from a single project (BioProject: SRP012966).

Sequence assembly and population genome recovery
Reads from mouse fecal samples were adapter trimmed and merged using SeqPrep (https://github.com/jstjohn/ SeqPrep) then remaining pairs were quality trimmed using Nesoni v0.128 (https://github.com/Victorian-Bioinformatics-Consortium/nesoni) with a minimum Phred quality threshold of 20. Assembly of pooled reads was performed using CLC Genomics Workbench v7.0.4 (QIAGEN, Aarhus, Denmark) using a word size of 30 and bubble size of 1500. Scaffolding was performed during assembly, and reads were mapped back to contigs with default settings. Minimum contig length was 300. Mapping of reads to final assemblies was performed using BWA v0.7.10 [86] with default settings.
SRA data from guinea pig and human datasets was quality and adapter trimmed using Trimmomatic v0.3.2 [87] with default settings plus a head crop of 10 and minimum read length of 30. Trimmed reads were merged using BBMerge (https://sourceforge.net/projects/bbmap/). Assembly was performed either per individual run (human) or pooled (guinea pig) using CLC Genomics Workbench v8.5.1 (QIAGEN, Aarhus, Denmark) either using default settings (human) or using a word size of 30 and a bubble size of 1000 (guinea pig). Minimum contig length was 500. Scaffolding was performed during assembly, and reads were mapped back to contigs with default settings. Gap filling was performed on assemblies from human datasets using Abyss-sealer [88] with default settings. Mapping of reads to final assemblies was performed using BamM v1.5.1 (http://ecogenomics.github.io/BamM/) with default settings.

Phylogenetic resolution
A maximum-likelihood tree of "Ca. Homeothermaceae" population genomes in the context of the order Bacteroidales was generated based on an alignment of 120 concatenated single-copy, bacterial-specific, marker genes implemented within an in-house pipeline. Marker genes from 300 Bacteroidales and 37 Sphingobacteriales (outgroup) genomes were obtained from publicly available genomes within the NCBI database and aligned using Mafft v7.221 [92]. A maximum-likelihood tree was inferred using FastTree v2.1.7 [93] under the WAG + GAMMA model based on alignment positions containing a residue within ≥90 % of sequences (36,713 positions). Bootstrap support was derived from 100 replicates.
Individual gene trees were generated using FastTree v2.1.7 [93] implemented within Mingle v0.0.15 (https:// github.com/Ecogenomics/mingle) utilizing the BLAST [94] workflow, identifying homologs within the IMG database [95]. Default Mingle settings used for all genes except for oxalyl-CoA decarboxylase where percent identity was increased to 40 % due to an alignment length of <100 amino acids with default settings. Bootstrap support was derived from 100 replicates, also implemented within Mingle. Phylogenetic diversity scores for susC and susD gene trees were calculated using the picante R package [96]. All trees were visually inspected using ARB v6.0.2 [97].
Carbohydrate-active enzymes (CAZy) were identified using hmmscan from HMMER v3.1b2 [101] against the dbCAN database v4 [107]. Percentage abundance of each CAZy category was assessed against the total number of genes with CAZy annotation within each genome. Signal peptide sequences within CAZy annotated genes were predicted using SignalP v4.1 [31] and LipoP v1.0 [108] with a margin cutoff of 4 for LipoP. Prediction of membrane positioning was based on the presence of a transmembrane domain (SignalP) or a type II signal peptide (LipoP).

Differential abundance comparisons
Differential abundance of annotated features (COG, KO, CAZy) was analyzed using DESeq2 R package [109] based on count data. Heatmaps were generated using pheatmap [110] following variance stabilizing normalization of data by DESeq2. Indicator annotations were identified using the labdsv R package [111]. PCA plots were created using the vegan R package [112].