The Mouse Gastrointestinal Bacteria Catalogue enables translation between the mouse and human gut microbiotas via functional mapping

Summary Human health and disease have increasingly been shown to be impacted by the gut microbiota, and mouse models are essential for investigating these effects. However, the compositions of human and mouse gut microbiotas are distinct, limiting translation of microbiota research between these hosts. To address this, we constructed the Mouse Gastrointestinal Bacteria Catalogue (MGBC), a repository of 26,640 high-quality mouse microbiota-derived bacterial genomes. This catalog enables species-level analyses for mapping functions of interest and identifying functionally equivalent taxa between the microbiotas of humans and mice. We have complemented this with a publicly deposited collection of 223 bacterial isolates, including 62 previously uncultured species, to facilitate experimental investigation of individual commensal bacteria functions in vitro and in vivo. Together, these resources provide the ability to identify and test functionally equivalent members of the host-specific gut microbiotas of humans and mice and support the informed use of mouse models in human microbiota research.

In brief Beresford-Jones et al. find that while less than 3% of bacterial species are shared between human and mouse gut microbiotas, they can identify the closest functionally related species between these host-specific microbiotas using the bacterial genome catalog they developed. This will facilitate translation of microbiota-related research between humans and mice.

INTRODUCTION
The mammalian gastrointestinal tract hosts trillions of bacteria, known as the gut microbiota, that actively impact the health of the host. Variations in this bacterial ecosystem are associated with susceptibility to and outcomes of many human diseases (Armour et al., 2019), from adverse nutritional states (Chen et al., 2021;Sonnenburg and B€ ackhed, 2016) and autoimmunity (Li et al., 2018) to neurological pathologies (Cryan et al., 2020) and infection (Libertucci and Young, 2019). In order to characterize these microbial associations for therapeutic benefit, it is necessary to establish causal relationships between microbial factors and phenotypes (Neville et al., 2018). To this end, mouse models are essential tools in microbiota research, allowing controlled experimental studies in a physiologically and genetically tractable system.
A recognized limitation in using mice to study the human gut microbiota is that few bacterial species are shared between the gastrointestinal tracts of humans and mice (Chung et al., 2012;Xiao et al., 2015). In addition, many mouse-derived spe-cies remain unidentified or uncharacterized, hindering translation of microbiota research between hosts. One approach that has been used to improve the utility of mice for human microbiota research is the colonization of mice with human microbiotas (Park and Im, 2020). However, while it has been reported that the microbiotas of humans and mice are functionally comparable (Liu et al., 2020;Xiao et al., 2015), mice treated with a human microbiota exhibit compromised immune maturation and performance compared with mice harboring a mouse-derived microbiota (Chung et al., 2012;Lundberg et al., 2020), resulting in differences in susceptibility to infectious (Chung et al., 2012) and inflammatory (Surana and Kasper, 2017) diseases. This may be due in part to an inability of some human microbes to colonize well in the mouse gut (Aluthge et al., 2020;Lundberg et al., 2020). It is, therefore, likely that studying commensal bacteria in their endogenous hosts is the most physiologically valid approach. This requires the ability to identify functionally equivalent species between host microbiotas to perform mechanistic experimental studies in mice and translate microbiota findings between humans and mice. comprehensive and complete genomes (Meziti et al., 2021;Shaiber and Eren, 2019), and both culture-dependent and culture-independent approaches have been implemented to characterize the human Forster et al., 2019;Lagier et al., 2016;Nayfach et al., 2019;Pasolli et al., 2019;Poyet et al., 2019;Zou et al., 2019) and mouse (Lagkouvardos et al., 2019Lesker et al., 2020;Liu et al., 2020) gut microbiotas. Cultured isolates provide highly complete reference genomes and are essential for experimental validation of correlative findings (Neville et al., 2018;Surana and Kasper, 2017). Owing to current challenges in culturing the complete diversity of the gut microbiota, culture-independent approaches, most commonly metagenome-assembled genome (MAG) reconstruction from shotgun metagenome samples, have been leveraged to generate genomes for uncultured species and improve coverage in microbiome analyses (Forster et al., 2019). Gene catalogs have also been generated for improving metagenome analyses Qin et al., 2010;Xiao et al., 2015;Zhu et al., 2021); however, as these catalogs do not link genes to their genomes of origin, they cannot be used for genome-resolved functional analyses (Almeida et al., 2021).
Here, we complement high-throughput culturing of mouse gut bacteria with large-scale MAG synthesis to produce a comprehensive isolate and genome repository, the Mouse Gastrointestinal Bacteria Catalogue (MGBC). The MGBC includes 26,640 non-redundant, high-quality bacterial genomes representing 1,094 species and a publicly deposited collection of 223 cultured isolates for 132 species, including 62 species with no previously cultured representative. Through species-resolved functional mapping of the gut microbiotas of humans and mice, this resource enables the identification of the closest functionally related bacterial species between hosts and provides access to the taxonomic locations of shared functions, allowing the application and investigation of microbiota discoveries between hosts.

RESULTS
Expanding the cultured diversity of the mouse gut microbiota Bacterial culturing and isolate biobanking are essential for research and biotechnology applications by enabling the validation of in silico findings and investigation of underlying biological mechanisms. Nearly 10,000 isolate genomes exist for humanderived commensal bacteria (Almeida et al., 2021;Poyet et al., 2019), but fewer than 400 have been published for the mouse. To begin to address this disparity, we performed highthroughput culturing of feces from 30 conventionally housed, specific-pathogen-free (SPF) laboratory mice from the Wellcome Sanger Institute. We cultured and whole-genome sequenced 288 bacterial strains, of which 276 genomes passed our stringent quality control criteria to be included in our mouse culture collection (MCC; Figure 1A; Table S1). These isolates represent 132 species across 67 genera, 25 families, and 6 phyla and include 62 taxa with no previously cultured murine representative (i.e., they share <95% average nucleotide identity (ANI) with the closest related genome). The majority of these additional cultured bacterial species belong to the phylum Firmicutes_A (54/62), with the remaining isolates from the Bacteroidota (4/62), Firmicutes (3/62), and Desulfobacterota (1/62) phyla (Figure 1B). These previously uncultured isolates increase the cultured microbial diversity of the mouse gut microbiota by 32.5% at the species level and provide the first cultured representatives for 10 genera, as well as the first mouse-derived representative of the phylum Desulfobacterota. In order to establish the prevalence of these species, we compiled 2,398 publicly available and sequenced 48 new mouse gut metagenomes, representing mice from 63 institutes across 17 countries (Table S2). Using this global compilation, we found that 20 of our additional species are present in more than 80% of samples, while 39 were present in over 50% ( Figures 1B and 1C), indicating that our novel cultured isolates represent highly prevalent species of the mouse gut microbiota. To make these bacteria easily accessible to the scientific community, the 223 isolates that were successfully cryopreserved, are currently available upon request from the Wellcome Sanger Institute and are additionally being deposited with the German Collection of Microorganisms and Cell Cultures (DSMZ).
In addition, we compiled 319 publicly available mouse-derived bacterial isolate genomes from the NCBI, including genomes from the mouse intestinal Bacterial Collection (miBC, n = 53)  and the mouse Gut Microbial Biobank (mGMB, n = 120) (Liu et al., 2020), among other studies (Table S1). Following genome quality filtering, 288 public genomes were combined with our MCC to yield 564 cultured isolate genomes, representing 253 species that we included in our MGBC genome repository.

Construction of the MGBC
Multiple studies have resulted in over 100,000 high-quality MAGs for species of the human gut microbiota (Almeida et al., 2021), but fewer than 8,000 high-quality MAGs have been generated for the mouse (Lesker et al., 2020). To facilitate translation between the human and mouse gut microbiotas, we sought to construct a comprehensive, genome-resolved catalog for mouse gut bacterial species that would allow species-level functional comparisons with the gut bacteria of human hosts. As highly complete genomes are required for accurate functional analyses and identification of functionally equivalent species, we first compared binning methods for MAG generation using metrics of genome quality and completeness (Parks et al., 2015). Compared with using MetaBAT2, MaxBin2, or CONCOCT alone, the MetaWRAP pipeline (Uritskiy et al., 2018) yielded bins with the highest quality scores ( Figure S1A) and generated the highest quality MAGs for the most species (Figures S1B and S1C). Considering the most dominantly binned species of the mouse gut microbiota, MetaWRAP consistently yielded the highest quality MAGs ( Figure S1D) Figure S1E). Based on these results, we employed MetaWRAP to generate MAGs for the MGBC.
To produce a globally representative genome collection, we assembled and binned publicly available mouse gut metagenomes from the European Nucleotide Archive (ENA) in addition to the 48 new fecal samples that were sequenced from mice at the Wellcome Sanger Institute and the Babraham Institute. Of these metagenomes, 2,286 samples yielded R1 high-quality MAGs per sample; these samples represented 71 studies and 13 strains of mice from 58 institutes across 17 countries and 4 continents. In total, these efforts yielded 64,490 medium-plus quality MAGs (Table S3), of which 35,361 were defined as high-quality in line with previous studies (Almeida et al., 2021;Pasolli et al., 2019). To remove virtually identical genomes from our collection, we dereplicated our isolate genomes and MAGs using a Mash distance (Ondov et al., 2016) of 0.001 (equivalent to 99.9% ANI) to yield 26,640 high-quality, non-redundant bacterial genomes for the MGBC (Figure 2A; Table S3). Of all the species cultured in the MCC (n = 132), 74.2% are represented by at least one high-quality MAG. Isolate genomes were significantly more complete than   where a genome could not be assigned at species level, lowest taxonomic rank is indicated. Labels are colored by phylum, and the outer ring indicates genomes with no previously cultured representative. Tree distances were calculated from an alignment of 120 core genes using the BLOSUM45 amino acid similarity matrix. (B) Abundance and prevalence profiles for the 62 previously uncultured species of the MCC based on 2,446 mouse gut metagenomes. Each datapoint represents the percentage of reads assigned to a species for a single sample. Prevalence is calculated as the percentage of samples with species abundance R0.01%. Colors represent phyla. (C) Scatterplot of mean abundance against prevalence for all 132 species of the MCC. Color represents prior cultured status.

A B C
Phylum composition (%) Figure 2. Genomes of the Mouse Gastrointestinal Bacteria Catalogue (A) Maximum likelihood tree of representative genomes for the 1,094 species of the MGBC. Color range indicates whether a species cluster is represented by MAGs only (light red), isolates only (light green), or both (light blue). For each species, the innermost color ring represents phylum, the second ring indicates species that could not be assigned at a species level by GTDB-Tk (dark blue), the third ring denotes cultured status of each species (blue), and the outer ring indicates the 62 species that have been uniquely cultured in the MCC (brown). The circumferential bar plot (green) illustrates the number of high-quality genomes representing each species in the MGBC. Tree distances were calculated from an alignment of 120 core genes using the BLOSUM45 amino acid similarity matrix.  (Zhu et al., 2021). The MGBC represents 1,094 species of the mouse gastrointestinal tract across 16 phyla ( Figure 2A). Only 23.1% of these species have cultured representatives, of which 24.5% are contributed uniquely by our MCC. Only 23.5% of species could be assigned to a species-level taxonomy by the Genome Taxonomy Database Toolkit (GTDB-Tk) (Chaumeil et al., 2019), with 61.3% unassigned at the species level, 14.0% at the genus level, and 1.2% at the family level. Species of the Cyanobacteria, Spirochaetota, Firmicutes_B, Elusimicrobiota, and Firmicutes_C were represented entirely by taxa unassigned at the species level ( Figure 2B). In contrast to the MCC, the MGBC is highly representative of the average mouse microbiome at the phylum-level ( Figure 2C). These findings underpin our use of both isolates and MAGs to produce a comprehensive high-quality genome catalog of the mouse gut microbiota to provide a framework for species-resolved functional comparisons.
Benchmarking the MGBC against current mouse microbiota resources To maximize the reliability and accuracy of our analyses, we curated only our high-quality genomes into the MGBC ( Figures  3A and 3B). To benchmark our resource against currently available mouse microbiota resources, we compared the genomes of the MGBC with the high-quality genomes of the integrated mouse gut metagenome catalog (iMGMC) (Lesker et al., 2020). The 8,509 high-quality genomes of the iMGMC represent 805 species, of which 670 (83.2%) are shared with the MGBC (Figure 3C). Comparing the representative genomes for these shared species, 71.8% of species representatives from the MGBC had a higher quality score than their equivalent in the iMGMC ( Figure 3D), while 14.6% were equal in quality. The MGBC achieves significantly higher levels of metagenome read classification of independent samples, classifying 90.5% of reads on average, 8.4% more than with the iMGMC, and 56.9% more than with the medium-plus quality co-abundance gene groups (CAGs) of the MGCv1  ( Figure 3E). The MGBC represents an additional 407 species compared with the iMGMC, expanding the known taxonomic diversity of the mouse gut microbiota by 77 genera (30.6% increase), 25 families (31.8% increase), 15 orders (34.1% increase), 3 classes (15% increase), and 2 phyla (12.5% increase). Therefore, this resource, provides improved coverage and quality of representation of the mouse gut microbiota compared with the previous mouse microbiota resources, better facilitating species-resolved functional analyses and improving overall characterization of the mouse gut microbiota.
Previous studies have performed taxonomic profiling of the mouse gut microbiota using alternative methods and/or smaller sample sizes (Lesker et al., 2020;Xiao et al., 2015). Therefore, we applied the MGBC to generate species-resolved taxonomic analyses for the compilation of 2,446 global mouse gut metagenome samples used above. The abundance and prevalence profiles of mouse gut bacterial species are available in Table S4. In line with previous reports (Lesker et al., 2020), species of the phylum Firmicutes_A represented 15 of the top 20 most prevalent species in the mouse microbiota ( Figure S2A), while CAG-485 sp002362485 (recently proposed as ''Sangeribacter muris'' [Forster et al., 2021]) was the most prevalent species. As variation in the mouse gut microbiota is recognized as a confounding factor of mouse studies (Baker, 2016;Forster et al., 2021;Stappenbeck and Virgin, 2016), we assessed the effect of host genetic and environmental factors on compositional variation in the mouse gut microbiota ( Figure S2B; Table S5). The study itself was the most impactful factor, explaining 40.1% of the variance in microbiota composition ( Figure S2B), followed by the institute in which the study was performed (38.0%). We next assessed the microbiota of C57BL/6 ''control'' mice from different institutes ( Figure S2C). These institutes exhibited large differences in abundance of key microbial species, including known phenotypically important bacterial species such as Bacteroides thetaiotaomicron and CAG-485 sp002362485. We additionally considered the differences between the microbiotas of laboratory and wild mice. Species prevalence between laboratory and wild mouse cohorts correlated strongly ( Figure S2D) although species were found to be differentially abundant (Figure S2E). Taken together, these enhanced analyses using the MGBC indicate that while institutional housing environments represent a substantial source of variation and may in part explain the irreproducibility observed between mouse studies, the microbiota of laboratory mice is more similar to that of wild mice than previously suggested (Rosshart et al., 2019(Rosshart et al., , 2017. Taxonomic versus functional relationships between human and mouse gut bacteria Gene cluster analyses have estimated that taxonomic overlap between the human and mouse gut microbiotas is below 4% . Using a whole-genome approach, we compared the MGBC with the high-quality genomes of the Unified Human Gastrointestinal Genome (UHGG) collection (Almeida et al., 2021) to reveal that only 2.58% (103/3997) of species are shared between the human and mouse gut microbiotas ( Figure S3A). Of these shared species, 93.2% could be assigned a species rank by GTDB-Tk, and 55.3% have cultured representatives within the MGBC (Table S6).
While very few species are shared, it has previously been estimated using gene-level functional analyses that up to 95% of gut microbial functions are shared between humans and mice . To ascertain the degree of functional overlap between the human and mouse microbiotas, we created genome-linked protein catalogs from all high-quality, non-redundant genomes of the MGBC and UHGG. Using these catalogs, we assembled pangenomes for each species cluster of the MGBC and UHGG and annotated predicted protein-coding sequences using InterProScan (Jones et al., 2014) and eggNOG (Huerta-Cepas et al., 2019, 2017. In line with previous estimates, 84.5% of KEGG orthology (KO) groups and 82.1% of InterPro (IPR) protein families are shared between the gut ecosystems of the two hosts ( Figure S3B). However, due to the currently incomplete functional reference databases (Thomas and Segata, 2019), only 53.2% of proteins could be assigned a KO, and notable differences were observed in the annotation efficiency between phyla ( Figure S3C) as well as between host organisms ( Figure S3D). Taken together, these analyses confirm that while functionality is largely conserved, the taxonomic locations of these shared functions likely differ between mice and humans. However, previous resources have not been sufficient to map these taxonomic locations between hosts.
To begin to resolve these limitations for translating microbiota findings between humans and mice, we next quantified functional distances between species pangenomes using a combination of functional annotation schemes (KEGG [Kanehisa MAGs were defined as high-quality (blue) (R90% completeness, %5% contamination, metrics of genome fragmentation). Quality estimates were generated using CheckM. (B) Phylum-level distribution of high-quality and medium-plus MAGs. (C) Upset plot illustrating the intersections of species between the contributing isolate collections and MAGs of the MGBC (blue). The iMGMC has been included for comparison (gray). (D) Comparison of representative genome quality for shared species between the MGBC and iMGMC. Genome quality score: QS = Completeness À 5 3 Contamination. Color represents phylum. (E) Read classification rates of 64 independent mouse gut metagenome samples using different custom Kraken2 databases. Box plot color indicates the origins of the genomes used to build each database. Only genomes meeting high-quality criteria were used to build databases, except where indicated (purple). miBC, n = 43; mGMB, n = 100; public (combination of all mouse gut-derived isolates from NCBI), n = 288; MCC, n = 276; MCC+public, n = 564; MGCv1, n = 239; iMGMC, n = 8,509; MGBC, n = 26,640; mq iMGMC, n = 18,306; mq MGBC, n = 65,907; NCBI (standard database), n = 97,603; human (representative genomes of the UHGG), n = 3,006. Significance was determined for selected comparisons using paired t tests, ****p < 0.0001. et al., 2017], MetaCyc [Caspi et al., 2016], InterPro, CAZy [Lombard et al., 2014], and GO [Ashburner et al., 2000]) and examined the functional similarities between species in the context of their taxonomic relationships. Interpangenome functional and taxonomic relationships were significantly conserved at a broad phylogenetic scale ( Figure 4A). For each human bacterial species, we compared the identity of the closest mouse taxonomic species with the closest mouse functional species and found that these were only the same taxon in 47.0% of cases. We next stratified these analyses by the shared taxonomic rank of the closest mouse taxonomic species ( Figure 4B). Where the closest taxonomic mouse species was assigned to the same species as the human species (i.e., shared species), the closest functional mouse species was the same taxon in 99% of cases. The only exception was human Phocaeicola dorei (formerly Bacteroides dorei), a shared species in the mouse gut microbiota, which was functionally closer to mouse Phocaeicola vulgatus (formerly Bacteroides vulgatus) than mouse Phocaeicola dorei. The functional pathways that underly these differences include bacterial cell wall biosynthesis, pyocyanin biosynthesis, and vitamin B12 biosynthesis pathways (Table S10).
Where the closest taxonomic mouse species belonged to the same genus (e.g., human Bifidobacterium infantis and mouse B. globosum), the closest functional mouse species was only the same taxon in 57.5% of cases, dropping to 37.2% and 31.7% when the closest taxonomic mouse species was shared at the family-and order-level, respectively ( Figure 4B). Functional distances increase as a function of taxonomic distance ( Figure 4C), suggesting that the closest functional species at higher taxonomic ranks are not likely to be functionally identical but rather may represent the most likely species for recapitulating associated functions or phenotypes of interest. Divergent taxonomy-function relationships are not confined to any particular taxonomic clade but are present in every phylum of the gut microbiota ( Figure 4D). These findings indicate that the closest taxonomic neighbor, at every taxonomic rank, is not necessarily the closest related species functionally and, therefore, might not be the best candidate for investigating microbiota functions between hosts.
Taxonomic locations of drug metabolism genes in the microbiotas of mice and humans Global functional comparisons between species of the human and mouse gut microbiotas may serve as a starting point to translate microbial functions between hosts; however, it is likely more useful to identify functionally equivalent species between hosts at the level of an individual function or gene product. One biologically important example of this is drug metabolism by the gut microbiota, where the drug metabolizing capacity of a species can have implications for drug therapy (Collins and Patterson, 2020;Maini Rekdal et al., 2019). As mice are common models for preclinical pharmaceutical research, we leveraged the MGBC to examine the conservation status and taxonomic location of 34 experimentally validated drug metabolizing genes from the human gut microbiota (Haiser et al., 2013;Maini Rekdal et al., 2019;Ridlon et al., 2013;Zimmermann et al., 2019aZimmermann et al., , 2019b (Table S7). For 27 genes (79.4%), the most dominant species in the human gut microbiota was shared in the mouse, and the same gene product (R95% sequence identity) was found in both human-and mouse-derived species pangenomes ( Figures  5A and 5B). However, in 37% (10/27) of these cases, the most dominant species that encoded these genes in the murine host differed from the most dominant species in humans-for drug metabolizing genes described in human Phocaeicola dorei (formerly Bacteroides dorei), the most dominant gene-encoding species in the mouse microbiota is Phocaeicola vulgatus (formerly Bacteroides vulgatus) ( Figure 5B). Notably, our global species-resolved functional analyses had identified mousederived Phocaeicola vulgatus as the closest functional species to human-derived Phocaeicola dorei, instead of the more closely, taxonomically related mouse-derived Phocaeicola dorei.
Where the human-derived species was not shared with the mouse gut microbiota (7/34), an equivalent gene product could not be located in the high-quality genomes of the MGBC (Figure 5C), even at 50% sequence identity. While these analyses indicate that homologous gene products are unlikely to be encoded by dominant species of the mouse gut microbiota, it is possible that homologous genes might be encoded by subdominant species that have been sequenced with insufficient coverage to generate high-quality bins. Therefore, we searched the medium-quality genomes of our collection for these genes. Although no genes with R95% sequence identity were identified, we found hits for four drug metabolizing genes with R50% identity (Table S7). While experiments using cultured isolates are necessary to validate these in silico predictions, the MGBC provides the species-resolved taxonomic locations of functions of interest that enable further functional and phenotypic studies.
Validating butyrate synthesis by different human and mouse microbes Butyrate-producing species are also functionally important members of the human gut microbiota and are associated with clinical outcomes of diseases as diverse as inflammatory bowel disease (Parada Venegas et al., 2019) and depression (Caspani et al., 2019). In addition, butyrate is involved in regulating host metabolism (Donohoe et al., 2011), sleep (Szentirmai et al., 2019), and healthy cognitive functioning (Silva et al., 2020) and is important for induction of peripheral T regulatory cells (Atarashi et al., 2011;Furusawa et al., 2013). Faecalibacterium prausnitzii, a canonical butyrate-producing species, has been associated with improved outcomes for inflammatory bowel disease (Carlsson et al., 2013;Sokol et al., 2008), but it does not colonize well in the mouse gut (Aluthge et al., 2020;Lundberg et al., 2020), limiting the utility of mouse models to study these clinical associations. To demonstrate how the MGBC can be leveraged to potentially overcome such challenges, we identified the equivalent butyrogenic species of the mouse gut microbiota.
Butyrate is synthesized from dietary fiber or from amino acids such as glutamate and lysine, culminating in the conversion of butyryl-CoA to butyrate via either butyrate CoA-transferase (BCOAT; direct pathway) or a phosphotransferase pathway (PTB/BUK; indirect pathway) (Louis and Flint, 2017) (Figure S4A). To identify the butyrate-producing species of the mouse gut microbiota and compare the taxonomic locations of this function with the human gut ecosystem, we ranked species according to the number of genomes encoding terminal pathway genes between hosts ( Figures 6A and 6B (right) relationships between all species of human and mouse gut microbiota. Each data point represents a single species cluster, and point color denotes phylum. Functional analyses use Jaccard distances between pangenomic functional profiles of each species. Taxonomic distances represent phylogenetic branch lengths between species calculated from alignment of 120 core genes. Distance matrices used for ordination were compared using the Mantel test (r = 0.7416, p = 0.001).
(B) Taxonomy-function relationships between human-and mouse-derived bacterial species, stratified by shared taxonomic level. Bars indicate distribution of shared taxonomic rank between closest taxonomically related species. Colored bars and bar statistics indicate number and percentage, respectively, of paired species at each rank where the closest functionally related species is the same taxon as the closest taxonomic relative.
(C) Scatterplot comparing taxonomic distance with functional distance for each human-derived species and the closest taxonomically related mouse-derived species. Color indicates the shared taxonomic rank between these species. (D) Inverted maximum likelihood tree of the 4,100 species of the human and mouse gut microbiotas. External branches represent phylogenetic relationships between representative genome of each species. Internal connections illustrate closest functionally related species between hosts. Connections are only shown when the closest taxonomically and functionally related taxa differ. Clade color represents phylum of each species, and the inside color bar denotes the host.
Color of internal connections indicates shared taxonomic rank of the closest functionally related species.
from the human gut ( Figure 6A, top), while members of the Coprococcus genus were among the top 20 PTB/BUK-encoding species ( Figure 6A, bottom). Importantly, all the most dominant butyrate-producing species from both host organisms were host-specific. While no human butyrate producers were previously uncharacterized as determined by GTDB, in the mouse, 17 of the top 20 BCOAT-encoding species (85%) and 11 of the top 20 PTB/BUK-encoding species (55%) could not be assigned to a species-level taxonomy ( Figure 6B). As 94.2% of predicted butyrate-producing species belong to the Firmicutes_A phylum, we considered the taxonomic locations within this phylum of the butyrate terminal pathway genes between hosts ( Figure 6C). Both pathways are largely conserved within the same taxonomic clades between hosts ( Figure 6C); however, the most dominant species in each host for each pathway do not represent the closest species phylogenetically, suggesting that there are likely to be host-specific niche factors affecting species dominance.
Cultured isolates are essential for validating in silico functional predictions from genomic data. Utilizing our MCC, we identified isolates that were predicted to produce butyrate via each terminal pathway according to genomic functional annotations. We selected three isolates for each pathway that ranked in the top 30 most abundant predicted butyrate-producing species (Figure S4B) based on the species abundance profiles generated from 2,446 mouse gut metagenomic samples (Table S2). We then tested the capacity of these isolates to synthesize butyrate in broth monoculture. Isolates for known butyrate-producing species from the human gut microbiota, Coprococcus eutactus and Eubacterium rectale (Sorbara et al., 2020), were included as positive controls, and an isolate from the MCC that lacked predicted butyrate terminal pathway genes was included as a negative control. All predicted isolates produced butyrate (Figure 6D; Table S9), although Kineothrix sp000403275 grew poorly in broth culture (OD < 0.1) and, therefore, exhibited only a lowlevel increase in butyrate compared with the negative control isolate Lachnospiraceae_NOV MGBC000113. Together, these findings demonstrate that the MGBC and MCC resources we have generated enable identification and experimental investigation of functionally equivalent gut bacterial species between hosts.

DISCUSSION
In this study, we generated the MGBC as a resource for translating microbiota findings between the host-specific microbiotas of humans and mice by providing access to the taxonomic locations of functions of interest in the gut microbiotas of both hosts. We demonstrated the utility of our resource with butyrate synthesis and drug metabolism as applied examples; however, these represent just two of many biologically and medically important functions of the gut microbiota. For example, bile acid metabolism is largely divergent between human and mouse gut microbiotas and has been shown to play a key role in susceptibility to enteric infection (Buffie et al., 2015). The ability to further explore this metabolic pathway in mouse models could yield important mechanistic insights for targeted therapeutics. The MGBC is therefore accompanied by a toolkit (https://github. com/BenBeresfordJones/MGBC-Toolkit) to allow users to query any function of interest. It provides users the ability to look for functions via annotation ID (KEGG, InterPro, COG, eggNOG, and GO) and the ability to screen for gene products at the sequence level, facilitating identification of nonannotated genes and predicted functions of interest. Our bacterial catalog also enables improved correlative analyses of mouse gut metagenomes through increased coverage of the mouse gut microbiome and expands the study of causation in the mouse gut microbiota through increased availability of cultured isolates. Our toolkit and analyses improve our understanding of the taxonomy-function relationships of bacterial species from the microbiotas of different hosts and reveal that the closest taxonomically related species is not necessarily the most functionally equivalent.
Understanding the functional potential encoded within microbial genomes is key to understanding the implications of taxonomically divergent microbiotas between host organisms (Thomas and Segata, 2019). Our results, as well as those of other studies (Almeida et al., 2021), indicate that over 40% of predicted proteins have no representation in functional reference databases. These functional unknowns limit the resolution of interhost functional analyses and remain one of the main limiting factors for the microbiome field at large (Thomas and Segata, 2019). In order to partly mitigate these limitations and using

OPEN ACCESS
Resource drug metabolism genes as examples, we used taxonomically contextualized protein catalogs to perform sequence-level interhost functional comparisons for nonannotated functions. However, this approach does not enable prediction of functional capacity from genomic data. Future studies combining genomelevel data with experimental functional characterization, enabled by large-scale genome and isolate catalogs (Sorbara et al., 2020), will therefore be required to uncover this unexplored functionality. Notably, nearly 77% of species represented in the MGBC lack a cultured representative, hindering experimental validation of associated phenotypes and functions. Genomic insights can be used to facilitate targeted culturing techniques and improve the representation of cultured species (Browne et al., 2016;Cross et al., 2019). Future studies that increase cultured diversity will also be highly valuable to advancing the therapeutic potential of the microbiota field. Highly complete genomes with low levels of contamination are required for accurate functional analyses, necessitating the use of stringent quality control thresholds and validated binning methods when curating genome catalogs. In addition, a consistent level of genome quality is required to combine both isolate genomes and MAGs in analyses (Almeida et al., 2021). However, as lower quality genomes, often representing minor or rare species of the gut microbiota (Pedron et al., 2019), are excluded from these analyses, there is a loss of covered taxonomic and functional diversity, which is then inaccessible for comparison. While minor species of the gut microbiota may not be sequenced with sufficient coverage to facilitate complete genome binning, future microbiome studies implementing long-read or deeper sequencing and improved binning methods will be useful for expanding the known taxonomic and functional diversity of the mouse gut microbiota. In addition, standard tools for estimating genome completeness exhibit bias against certain taxonomic clades and represent technical challenges in the compilation of fully comprehensive genome collections. For example, genomes of the large clade of uncultured bacteria, candidate phyla radiation (CPR), cannot be assigned a high-quality genome status using current single copy number gene methods (Brown et al., 2015). Studies aiming to address these technical limitations, such as the recently published Genome UNClutterer (GUNC) tool (Orakov et al., 2021), will drive further improvements to the comprehensiveness and quality of genome catalogs.
Bacteria represent the most dominant members of the mouse gut ecosystem, contributing up to 90% of the genetic material in metagenomic samples according to our analyses; however, other domains including fungi, microeukaryotes, and viruses remain understudied in the context of host-specific microbiotas. It is likely that these biomes will contribute additional functional and phenotypic complexities to modeling human disease in mice, and further study of these domains may yield additional avenues for clinical intervention. Targeted microbiota therapies, live bacterial therapeutics, rationally designed drugs, and microbial biomarkers of disease are just some of the clinical developments that microbiota research aims to deliver to improve outcomes of human health and disease. Mouse models will continue to play a central role in this research, and therefore, the isolate, genome, and protein catalogs presented in our study represent important developments in understanding and tackling the obstacles posed by host-specific microbiota and provide a starting point toward efficient and informed translation of gut microbiota research between humans and mice.

STAR+METHOD
Detailed methods are provided in the online version of this paper and include the following:

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources, reagents and software should be directed to and will be fulfilled by the lead contact, Virginia A. Pedicord (vap33@cam.ac.uk).

Materials availability
Isolates of additional culturable species generated in this study are being deposited at the Leibniz Institute DSMZ-German Collection of Microorganism and Cell Cultures (DSMZ), and accession numbers are available under https://github.com/BenBeresfordJones/ MGBC. Isolates generated in this study that have not been deposited at DSMZ will be made available without restriction upon request. The MGBC-Toolkit is available at https://github.com/BenBeresfordJones/MGBC-Toolkit. All code produced by this project has been additionally deposited in Zenodo, and DOIs are provided in the STAR methods. d Any additional information required to reanalyse the data reported in this paper is available from the lead contact upon request.

EXPERIMENTAL MODEL AND SUBJECT DETAILS
Faecal samples from mice Mice were maintained under specific pathogen-free conditions at a Home Office-approved facility in accordance with the United Kingdom Animals (Scientific Procedures) Act of 1986. Faecal samples from 30 mice aged between 5 to 8 weeks were used to generate the Mouse Culture Collection. We used both male and female representatives from 10 different mouse colonies at the Wellcome Sanger Institute. Colony genotypes are detailed in Table S8. For shotgun metagenome sequencing, faecal samples were obtained from 48 8-week-old male C57BL/6N mice.

Bacterial culture
Fresh faecal samples were collected from mice into sterile 1.5mL Eppendorf tubes using aseptic technique. Sample processing and culturing were performed under anaerobic conditions (80% nitrogen, 10% carbon dioxide, 10% hydrogen) in a Don Whitley A95 anaerobic workstation. Faeces were homogenised in sterile, pre-reduced PBS (100mg/mL) and a 10-fold 1:10 dilution series performed. 200mL of each dilution was plated onto pre-reduced 140mm agar plates and incubated at 37 C. A range of agars were employed to maximise culturing yields and diversity (Table S8). After 2 days, individual colonies were picked and re-streaked onto fresh plates. This was then repeated until purity was achieved. Colonies were identified using 16S rRNA gene sequencing. Single bacterial colonies were scraped into 2mL screw cap tubes containing glass beads (acid-washed 425-600mm) and 500mL sterile PBS, and then physically lysed by 30 seconds shaking at speed 6.0 using a FastPrep Instrument (MPBio). After centrifugation at 14,000rpm for 5 minutes, 1mL of supernatant was taken to carry out a 16S PCR using the standard 7F and 1510R bacterial primers (Browne et al., 2016) and GoTaq Hot Start reagents (Promega). PCR products were sequenced by an external supplier (Eurofins Genomics) and mothur (Schloss et al., 2009) used to align the resulting sequences and create OTUs representing clusters of R97.8% sequence identity. For each OTU, a single sequence was taxonomically classified using NCBI BLAST (Johnson et al., 2008) and a single isolate selected for further culturing and whole-genome sequencing. 10mL of BHI or YCFA broth was inoculated for each new isolate identified and left to grow overnight. 500mL of the overnight culture was mixed with 500mL of 50% glycerol in a cryotube (performed in quadruplicate) and these were frozen at À80 C. The remaining overnight culture was centrifuged at 4,000rpm, and the cell pellet then washed with 10mL sterile PBS. For isolates that did not grow in broth, 2mL sterile PBS was pipetted onto mono-inoculated agar plates and colonies were dissolved using a bacterial scraper. Plate supernatants were then used in place of overnight cultures. Genomic DNA was extracted from the washed pellet using the 'MasterPure Complete DNA and RNA Purification Kit' (Lucigen). Genomic DNA was kept at 4 C until sequencing. While 276 strains were cultured to purity and sequenced, only 223 were subsequently recovered successfully from cryopreservation for banking at the Wellcome Sanger Institute and DSMZ.

METHOD DETAILS
Curation of public mouse gut-derived isolate genomes To build a comprehensive genome collection, we curated 319 publicly available mouse-derived isolate genomes from the ENA, including the genomes of previously published mouse gut isolate collections (Lagkouvardos et al., 2019Liu et al., 2020). As genome assemblies for the isolates of the miBC  had not been made available, we assembled these from raw reads according to our standard genome processing pipeline. Only genomes that passed our quality control thresholds (n=288; ''Quality control of genome assemblies'') were included in the analyses for this study. The metadata for these public mouse isolate genomes are included in Table S1.

Curation of public human gut-derived bacterial genomes
To perform taxonomic and functional analyses between mouse and human gut bacterial species, we curated 204,939 non-redundant human gut microbial genomes from the Unified Human Gastrointestinal Genome (UHGG) catalogue (Almeida et al., 2021). We applied the same quality control and taxonomy assignment pipelines to the UHGG genomes as with the MGBC. In total, 100,456 non-redundant, high-quality human gut microbial genomes, representing 3,006 species, were curated for comparison with the MGBC. Metadata for these genomes are provided in Table S3.

Mouse gut metagenome cohort curation
To create a comprehensive mouse gut-derived metagenome catalogue for MAG synthesis and quantification of species global abundance and prevalence profiles, we utilised the Advanced Search functionality of the ENA (Harrison et al., 2021) to identify all WGS raw read samples with an NCBI Taxonomy metadata value of ''mouse gut metagenome'' (taxid: 410661; last accessed February 2021). In total, 8,418 samples were identified which were then manually assessed according to our exclusion criteria. We additionally performed a review of the available literature to identify further studies that our ENA search might have overlooked due to constraints with manual metadata entry. Samples were technically excluded if they were unpublished, i.e., where no publication listed the study accession number, or if they represented 16S rRNA amplicon sequencing datasets. Samples were biologically excluded if mice had been exposed to antibiotics, had an active gastrointestinal infection, or had received faecal microbiota transplantation derived from a non-mouse host. Furthermore, we additionally excluded samples that would disrupt the validity of species prevalence and abundance analyses, e.g., following enrichment for viral particles, or deriving from an ex-germfree mouse reconstituted with a simplified microbiota. Details of included and excluded studies are provided in Table S2. Of the included metagenomes, 64 were kept aside as independent samples for read classification analyses. Metadata for these samples are included in Table S2.
These publicly available samples were supplemented with 48 newly sequenced faecal samples from 8-week-old male C57BL/6N mice. In total, 2,913 metagenome samples were used to generate MAGs; data for these samples are supplied in Table S2. 2,446 samples yielded R1 MAG and were used to generate abundance and prevalence profiles of mouse gut microbial species. The metadata for these samples are provided in Table S2.
Whole-genome sequencing and assembly Bacterial genomic DNA was sequenced using the Illumina Hi-Seq Ten platform at the Wellcome Sanger Institute with library fragment sizes of 200-300 bp, a read length of 150 bp and a target read depth of 100x. Annotated assemblies were produced using a previously described pipeline (Page et al., 2016). Briefly, multiple assemblies were generated from sequence reads using Velvet v1.2 (Zerbino and Birney, 2008) and VelvetOptimiser v2.2.5. An assembly improvement step was applied to the assembly with the best N50 (Page et al., 2016), and contigs were scaffolded using SSPACE v2.1.1 (Boetzer et al., 2011) and sequence gaps filled using GapFiller (Boetzer and Pirovano, 2012). Automated annotation was performed using Prokka v1.14.5 (Seemann, 2014).
Faecal sample collection and shotgun metagenomic sequencing Faecal samples were collected directly from C57BL/6N mice using aseptic techniques, and immediately stored at À80 C until DNA extraction. DNA was extracted using the 'FastDNA Spin Kit for Soil' (MPBio) according to manufacturer's instructions and stored at À20 C until metagenomic sequencing. DNA samples were quantified using a Qubit 4 Fluorometer (Thermo Fisher), and samples with R100 ng DNA material proceeded to paired-end (23150 bp) shotgun metagenomic sequencing on the HiSeq 4000 platform.
Workflow for metagenome-assembled genome (MAG) synthesis MAGs were generated using a custom in-house pipeline that leveraged MetaWRAP v1.2.3 (Uritskiy et al., 2018) for single sample assembly, binning and bin refinement. First, metagenomes were quality controlled using KneadData v0.7.3 with default settings. Host reads were removed from samples using the GRCm39 reference genome and Bowtie2 v2.3.5 . In addition, reads were aligned to the phi-X174 genome and removed. MetaSPAdes v3.10.1 (Nurk et al., 2017) was used for the assembly of paired-end samples. In line with other reports, runtime with MetaSPAdes was excessively long (>48 hours) for some samples (Kim et al., 2020). In these cases, and for samples with only unpaired reads, MEGAHIT v1.1.1-2-g02102e1 (Li et al., 2016) was used for assembly. To generate genome bins, we utilised the 'binning' module from MetaWRAP to run MetaBAT2 v2.9.1 (Kang et  Quality control of genome assemblies Both isolate genomes and MAGs were subjected to stringent genome quality criteria to be included in our analyses. Completeness, contamination, and genome fragmentation were estimated using CheckM v1.1.2 (Parks et al., 2015). Genome assemblies with R90% completeness, <5% contamination, maximum contig count %500 (Browne et al., 2016), maximum genome size %8 Mb (Zou et al., 2019), N50 R10,000 kb (Nayfach et al., 2019) and mean contig length R5 kb (Zou et al., 2019), were defined as high-quality genomes, in line with guidelines and previous studies (Bowers et al., 2017;Parks et al., 2017;Pasolli et al., 2019). Any isolate genome that did not meet these thresholds was excluded from analysis. For MAGs, we additionally defined medium-plus quality genomes as those assemblies with R50% completeness, <5% contamination and a quality score R50, where QS = Completeness À ð5 3 ContaminationÞ . This definition exceeds the medium-quality thresholds as defined by MIMAGs (Bowers et al., 2017). Only highquality genomes were included in our analyses, unless otherwise indicated. Data for MAG yields from included samples are provided in Table S2.
Additional quality metrics were calculated to complement the contamination scores generated with CheckM. For the non-redundant, high-quality genomes of the MGBC genome chimerism was quantified using GUNC v1.0.4 (Orakov et al., 2021) and strain heterogeneity was assessed using CMseq v1.0.3 . Quality data for these genomes are provided in Table S3.

Taxonomic classification of genomes and species clustering
To remove redundancy from our collection of high-quality genomes, we used dRep v2.5.4 (Olm et al., 2017) to remove conspecific genomes that shared R99.9% ANI (options: -pa 0.999 -SkipSecondary). We taxonomically classified our genomes using the 'clas-sify_wf' workflow from GTDB-Tk v1.3-r95 (Chaumeil et al., 2019). To generate species clusters for genomes that could not be assigned to a species-level taxonomy using GTDB-Tk, we clustered these genomes at R95% ANI using a two-step genomic distance analysis implemented by dRep (options: -comp 50 -con 5 -pa 0.9 -sa 0.95 -nc 0.6). Previously calculated quality data from CheckM were supplied for each genome with the -genomeInfo flag to reduce computation time. Genomes that shared R95% ANI at 0.6 alignment fraction were considered the same species (Jain et al., 2018;Nielsen et al., 2014;Varghese et al., 2015).
To determine genome representatives for each species cluster, we ranked each genome according to a modified quality score mQS = Completeness À ð5 3 ContaminationÞ + logðN50Þ and used the highest scoring genome from each species as the representative.

Determining prior cultured status of isolates
We inferred the cultured status of our isolates using a two-step approach. First, we compared our isolate genomes to the publicly available mouse-derived isolate genomes of the MGBC. Isolates were considered to be the same species if (1) they were designated as the same species by GTDB-Tk, or (2) they shared R95% ANI across an alignment fraction of R0.6, in the case of a non-specieslevel classification. Next, we searched our isolates that were not represented in the public mouse gut isolate collections against NCBI RefSeq release 205 (O'Leary et al., 2016) using Mash v2.2.2 (Ondov et al., 2016), after which the most similar RefSeq genome to each isolate was then compared using FastANI v1.3 (Jain et al., 2018). As RefSeq excludes metagenome-derived genomes, an isolate was designated as ''previously uncultured'' if it shared <95% ANI with the closest related genome.

Benchmarking binner performance for production of high-quality MAGs
We compared high-quality bins generated from a subset of 2,303 publicly available mouse gut metagenomes by MetaBAT2, MaxBin 2.0 and CONCOCT, as well as the consolidated bins generated by MetaWRAP refinement. Binner performance was compared across four metrics: i. CheckM estimates of genome quality (completeness, contamination, quality score) across all high-quality bins ii. Taxonomic coverage of high-quality bins (i.e., the number of species represented) iii. Average quality score of high-quality bins on a per species basis iv. Core genome conservation by high-quality bins Akkermansia muciniphila (Am), Bifidobacterium globosum (Bg), Ligilactobacillus murinus (Lm), and Lactobacillus johnsonii (Lj) were selected for comparing the conservation of the core genome between binners as these species ranked among the top 10 most commonly binned species across all four binning tools, and each have R50 isolate genomes with which to build a robust baseline core genome. In addition, these species represent three phyla, reducing the potential for any taxonomic bias. Isolate genomes were compiled from RefSeq and the MGBC, and high-quality isolate genomes that were designated as the correct species by GTDB-Tk were annotated using Prokka v1.14.5 with default settings. For each species, Panaroo v1.2.4 (Tonkin-Hill et al., 2020) was used to build an 'isolate-only' core genome and an 'isolate+bins' core genome for each binner using the following options: Àclean-mode strict -core_threshold 0.99. The number of isolate genomes used for each species was as follows: Am, 136; Bg, 62; Lm, 58; Lj, 54. For each species, a standardised number of subsampled bins was used to build the 'isolate+bins' core genome for each binner: Am, 90; Bg, 35; Lm, 150; Lj, 60. To quantify core genome conservation, 100 iterations of bin subsampling and core genome analysis Taxonomic comparison of the mouse and human gut microbiotas For taxonomic comparison, species were considered shared between the UHGG and the MGBC if they were annotated as the same species by GTDB-Tk, or, if they could not be assigned at a species-level, the representative genomes shared R95% ANI across a minimum alignment fraction of 0.6.
To assess global functional overlap between host gut microbiotas, we calculated the number of InterPro (IPR) protein families and KEGG orthology (KO) groups that were shared in total between human and mouse pangenomes.
Taxonomic and functional distance analyses Taxonomic distances were calculated from the human-mouse representative genome phylogeny (''Construction of phylogenetic trees''). Phylogenetic branch lengths between the representative genomes of each human and mouse species were calculated using the 'cophenetic.phylo' function from ape v5.5 (Paradis and Schliep, 2019).
To maximise the resolution of functional comparisons, functional annotations from multiple schemes -including InterPro, KEGG (Kanehisa et al., 2017), MetaCyc (Caspi et al., 2016), CAZy (Lombard et al., 2014), Reactome (Jassal et al., 2020), and Gene Ontology (Ashburner et al., 2000) -were considered together (all functions) in the context of each species pangenome. To facilitate functional distance analyses, pangenome-feature matrices were constructed where each functional annotation was scored according to the fraction of genomes per pangenome that were annotated with that feature. Inter-pangenomic functional distances were then calculated using the Jaccard Index.
A global comparison of the interspecies taxonomic and functional relationships was performed using a Mantel test. The taxonomic and functional distance matrices were ordinated using the 'cmdscale' function, and the two most dominant principal coordinates were visualised using the R packages ggpubr v0.4.0 and ggplot2 v3.3.5.

Functional comparison of Phocaeicola species
To identify the functional pathways that underpin the divergence between human and mouse Phocaeicola dorei (PdH, PdM) and the relative similarity of human P. dorei and mouse P. vulgatus (PvM), InterProScan v5.39-77.0 and Genome Properties v2.0.1 (Richardson et al., 2019) were run on all genomes for these species (PdH,n=1,954;PdM,n=15;PvM,n=177). The tabular outputs were concatenated and the proportion of 'complete pathway'-encoding genomes for each species was calculated for each property. The proportion of property-encoding genomes was then compared between each species pair using two-proportion z-tests with Yates' continuity correction. The data from these analyses are provided in Table S10.

Drug metabolism analyses
The protein sequences of genes validated for drug metabolism by the human gut microbiota from four independent studies (Table S7) were accessed via UniProt (The UniProt Consortium, 2021). Each gene was queried against a UHGG-MGBC combined protein catalogue (pre-clustered at 100% sequence identity) using BLAST+ v2.7.1 (Camacho et al., 2009). The 'hm_blast' module of the MGBC-Toolkit (https://github.com/BenBeresfordJones/MGBC-Toolkit) was utilised for these analyses. Hits with R95% sequence identity were considered functionally equivalent, and the genomes of origin for each hit were identified. For bt_4096, additional hits with sequence identity down to 50% were considered.

Butyrate synthesis analyses
The 'feature_search' module of the MGBC-Toolkit was applied to search bacterial pangenomes for the IPR family identifiers of the terminal pathways of butyrate synthesis: IPR023990 (BCOAT), IPR011245 (BUK), IPR014079 (PTB). Only species encoding both BUK and PTB were considered as butyrate producers using the PTB/BUK terminal pathway. The fraction of genomes per pangenome that were predicted to encode terminal pathway genes was calculated and based on these data a threshold of 70% was used to define butyrate-producing species.
Validating butyrate terminal pathway predictions using isolate cultures From the functional annotations of their genomes, isolates from the Mouse Culture Collection were identified that encoded either the BCOAT pathway, the PTB/BUK pathway or neither. For each pathway, three isolates that ranked in the top 30 most abundant predicted butyrate-producing species according to the global mouse gut metagenome catalogue curated above (''Metagenome classification and analysis'') were selected for culturing. Isolates for known butyrate-producing species of the human gut microbiota (Agathobacter rectalis, Coprococcus eutactus_A) were included as positive controls, and an isolate from the MCC that lacked predicted butyrate terminal pathway genes (Lachnospiraceae_NOV MGBC000113) was included as a negative control.
Under anaerobic conditions and using pre-reduced reagents, isolates were streaked onto YCFA (Duncan et al., 2002) agar and single colonies picked into 10 mL YCFA broth. Broth cultures were incubated at 37 C for 48 hours. Culture turbidity and pH were measured to confirm growth, and 16S rRNA gene sequencing was performed to check for contamination. Bacterial cultures were centrifuged at 3,600rpm for 5 minutes to pellet cells. The supernatant was pipetted off and sterile-filtered before being immediately stored at -80 C until analysis. These experiments were performed in triplicate.

Short chain fatty acid quantitation by GC-MS
Bacterial media samples (100 mL) were extracted using 400 mL methanol containing acetate-d3, propionate-d5, butyrate-d7 and valerate-d9 as internal standards (Cambridge Isotope Laboratories). After vortexing, samples were incubated at -80 C for >1 h to promote protein precipitation, then centrifuged for 20 min at 20,000g at 4 C. 100 mL of the resulting supernatant was added to 100 mL of 100 mM borate buffer (pH 10). Subsequently, 400 mL of 100 mM pentafluorobenzyl bromide (Thermo Scientific) diluted in acetonitrile (Fisher) and 400 ml of cyclohexane (Acros Organics) were added and reaction vials were sealed. Samples were derivatised by heating to 65 C for 1 h with agitation, then cooled to room temperature and centrifuged at 2,000g for 2 min to promote phase separation. 100 mL of the cyclohexane (upper) phase was transferred to a fresh autosampler vial and diluted 1:100 with cyclohexane prior to analysis. Gas chromatography-mass spectrometry (GC-MS) was performed using an Agilent 7890A gas chromatograph and Agilent 5975C MS detector operating in negative chemical ionisation mode. A 1mL splitless injection was made onto a VF-1701ms column (30 m 3 0.25 mm, 0.25 mm; Agilent Technologies). Helium (1.2 mL/min) was the carrier gas and methane (2 mL/min) was used as the chemical ionisation reagent gas. For SCFA quantification, the peak areas of acetate (m/z 59) and propionate (m/z 73) were normalized to acetate-d3 (m/z 62) and propionate-d5 (m/z 78) internal standards respectively; the C4 compounds butyrate and isobutyrate (m/z 87) were normalized to butyrate-d7 (m/z 94) and the C5 compounds 2-methylbutyrate, valerate and isovalerate (m/z 101) were normalized to valerate-d9 (m/z 110). Calibrator and quality control (QC) samples were prepared in borate buffer and derivatised using the same procedure covering the range 0.05-125 mM. All data analyses were performed with Agilent MassHunter quantitative analysis software (version 10.1, Agilent Technologies) and QC samples were confirmed to be within ±15% accuracy. Butyrate production data are provided in Table S9.

QUANTIFICATION AND STATISTICAL ANALYSIS
All statistical analyses were performed using R v4.0.2. Statistical significance was verified using tests as reported in the text, STAR methods and figure legends. Tests for correlation utilised Pearson coefficients unless otherwise stated. A p-or q-value %0.05 was considered statistically significant. Experimental 'n' is reported throughout the results and methods, as well as in figure legends. Summary statistics, including measures of center and dispersion, are reported in the results section where appropriate. Graphs were generated using the ggpubr and ggplot2 packages in R v4.0.2.