Enterosignatures define common bacterial guilds in the human gut microbiome

The human gut microbiome composition is generally in a stable dynamic equilibrium, but it can deteriorate into dysbiotic states detrimental to host health. To disentangle the inherent complexity and capture the ecological spectrum of microbiome variability, we used 5,230 gut metagenomes to characterize signatures of bacteria commonly co-occurring, termed enterosignatures (ESs). We find five generalizable ESs dominated by either Bacteroides, Firmicutes, Prevotella, Bifidobacterium, or Escherichia. This model confirms key ecological characteristics known from previous enterotype concepts, while enabling the detection of gradual shifts in community structures. Temporal analysis implies that the Bacteroides-associated ES is "core" in the resilience of westernized gut microbiomes, while combinations with other ESs often complement the functional spectrum. The model reliably detects atypical gut microbiomes correlated with adverse host health conditions and/or the presence of pathobionts. ESs provide an interpretable and generic model that enables an intuitive characterization of gut microbiome composition in health and disease.

In brief Frioux et al. introduce enterosignatures as microbial guilds whose assemblies are accurate descriptors of the human gut microbiome composition. Enterosignature composition relates to changes and perturbations in the microbiome over a lifetime. The model generalizes to diverse human populations and can be used to detect anomalies in the gut microbiome.

INTRODUCTION
The gastrointestinal tract (GIT) microbiome typically consists of 200-300 different species, which, at the strain level, are mostly unique to their human host. 1 The taxonomic composition of this ecosystem is generally in a stable equilibrium shaped by external and internal processes, such as diet, immune system, and keystone species enabling ecosystem function. 2,3 Due to its complexity, decomposition of the gut microbiome into a few, universally applicable characteristics is a key goal for enabling accessible medical gut microbiome research. Taxonomically informed decompositions such as summarizing the microbiome at the phylum level have been used in the past to determine the ratios of dominant phyla (Bacteroidota:Firmicutes), which correlate with some physiological host characteristics 4 ( Figure 1A).
However, taxonomic decompositions do not consider intrinsic ecosystem information, such as bacteria preferentially co-occurring or growing under similar conditions. Such ecologically informed decompositions of the gut microbiome can be obtained using ordinations 5 or by clustering genus-level composi-tions into enterotypes (ETs). 6 Two popular clustering algorithms used to calculate ETs are partitioning around medoids (PAMs) or Dirichlet multinomial mixture models (DMMs). PAM typically identifies three ETs enriched in key taxa Bacteroides (ET-B), Prevotella (ET-P), and Ruminococcus (Firmicutes-enriched, ET-F), 6 whereas the DMM typically identifies four ETs, recovering a second Bacteroides-dominated cluster (ET-Bact2). 7 Additionally, some studies concluded that either no ETs or only two ETs (Bacteroides and Prevotella dominated) exist in gut microbiomes. 8,9 These clustering-based approaches enforce discrete divisions, 9 implicitly assuming that each community in a sample derives from one type ( Figure 1A). For example, the physiologically relevant ratio of Bacteroidota:Firmicutes 4 only reflects the presence of either a Firmicutes or Bacteroides/Prevotella ET and lacks the resolution to capture fine-grained co-existence of bacterial guilds. Instead, defining compositions of bacterial guilds encapsulates both ecologically informed bacterial assemblages and the proportional representation of these within single samples. Such assemblages can be calculated using a different form of multivariate analysis algorithm, where the latent variables are continuous and not discrete, such as latent Dirichlet allocation (LDA) or non-negative matrix factorization (NMF), which have been applied to microbial ecosystems ranging from the gut to lake metagenomes. [10][11][12][13][14] NMF has the added advantage that pre-calculated signatures of bacterial assemblages can be reapplied to even a single metagenome, removing the need for large cohort sizes as required in ordinations (e.g., PCA) or clustering-based approaches (e.g., ET) capturing microbiome variation.
Using NMF, 15 we propose an ecologically informed decomposition of human gut microbiomes into five microbial signatures or ''enterosignatures'' (ESs). Cross-validation suggests that these five ESs were able to ubiquitously describe variation in gut metagenomes from western as well as non-western (NW) fecal metagenomes of all ages. The model is provided as a community resource at https://enterosignatures.quadram.ac.uk/. Combinations of signatures were inherent to most gut microbiomes; commonly, two or three ESs best described fecal metagenomes. We demonstrated, using metabolic modeling, the increased functional potential of ES combinations and that Bacteroides-ES (ES-Bact) has a central role in establishing and maintaining core gut functionality. Our ES model also enabled the detection of atypical fecal microbiomes associated with infant birth mode and antibiotic use in adults, thereby allowing the detection of potentially dysbiotic gut microbiomes.

RESULTS
The gut microbiome composition is accurately described by combinations of enterosignatures To investigate if the human fecal microbiome could be generalized by NMF signatures, we used a diverse set of 5,230 fecal metagenomes from the study of Hildebrand et al. 1  Article data model. This dataset is referred to as gut microbiome reference (GMR) and represents human populations from 13 countries (three continents) and different age groups (2,169 samples from infants < 3 years old; 1,943 from adults R 16 years old, and 1,059 > 60 years old classified as elders). To identify recurrent signatures, NMF was applied to the relative abundances of genera ( Figure S1A). The algorithm identified signatures that represent guilds of co-occurring bacterial genera that we termed enterosignatures (ESs) ( Figure 1A) and describing variance in the original genus-level composition space.
With increasing numbers of ESs, the NMF model reconstructed increasingly more variance of genus-level abundances (from 37.9% at k = 2 to 79.9% at k = 10). To mitigate the risk of overfitting, we used a 333 bi-cross validation to identify generalized ESs that represented a balance between explained microbial variance in GMR and applicability to new datasets. This approach identified five ESs (Figures S1A-S1D; STAR Methods) and explained 64% of, reaching 0.8 cosine similarity to, the original genus abundances.
Each ES is a weighted combination of several genera: some ESs are strongly dominated by one or two genera, whereas others are combinations of multiple genera with weaker associations ( Figures 1B, 2B, and S1E; see Table S1 in Data S1). The most prominent ESs were found at lower signature numbers (k < 5, Figure S1G); specifically, we identified in order of emergence: (1) ES-Bact (mostly characterized by the genera Bacteroides and Phocaeicola), (2) ES-Bifi (Bifidobacterium and Streptococcus), (3) ES-Prev (Prevotella), (4) ES-Esch (Escherichia and Citrobacter), (5) ES-Firm (genera in the phylum Firmicutes) (Figure 1B). At higher cluster numbers, additional ESs emerged: bacterial guilds dominated by Enterobacteriaceae or Staphylococcus were found at k = 6 and k = 7, respectively, and a second Bacteroidota signature, mostly represented by the genus Phocaeicola at k = 8 ( Figure S1H). Although the latter ESs represent bacterial guilds observed frequently and could therefore be relevant, our cross-validation results indicated these to be specific to our GMR dataset.
The NMF approach models coexisting ESs that represent together a particular fecal microbiome ( Figure S1A; see Table S1 in Data S1), rather than assigning each sample to a single cluster type. Nonetheless, we recovered three signatures among the five ESs likely representing the community types first described as adult 6 ETs, indicating the strong biological signal exhibited by these guilds: ES-Bact, ES-Firm, and ES-Prev. The difference between assigning each fecal sample to an ET or representing a combination of ESs seems relevant, as the majority of fecal samples required two (47%) or three (30%) ESs to capture community variation ( Figure 1C). Relative ES abundance varied strongly among individual samples ( Figure S1F), demonstrating an additional level of ecological complexity captured in the ES model. As expected, adult ESs (ES-Bact, ES-Prev, and ES-Firm) were the dominant, or ''primary ES'' in most adult microbiomes but usually co-occurring with other ESs (85.4% of all adult samples, Figure 1C; see Table S2 in Data S1). In infant samples, this was different; ES-Bifi and ES-Esch were more often observed as the sole ES (>0.9 relative abundance in 9.4% and 17.4% of the samples, respectively). The presence of ES-Esch is of interest as it could reflect a pathological state in preterm babies where Escherichia coli is associated with necrotizing enterocolitis 16 ; 79% of preterm samples were solely represented by ES-Esch.
Despite combinations of ESs frequently being observed in fecal samples, single genera were mostly associated with a single ES ( Figure S1F; see Table S2 in Data S1). The median association strength between genera and their most strongly associated ES ranged from 90% (ES-Firm genera) to 100% (ES-Esch genera), indicating strong ''loyalty'' of taxa to their respective ES. This strong association is indicative of constituent bacteria in each ES forming predictable guilds. This suggests metabolic dependencies or mutual benefits among ES bacterial members, which could have co-evolved over time.
Enterosignatures are generic representations of microbial communities with strong host-phenotype associations To compare ESs with classical ET models, we defined three ES abstraction levels: the relative abundance of each ES in a sample, the presence/absence of one or more ESs per sample, and the primary ES only that describes the strongest ES signature per sample-the latter most closely resembling classical ET concepts. Because both ET models-PAM (N = 3 clusters) and DMM (N = 4 clusters)-were only defined for adult fecal metagenomes, we restricted our comparisons to adult samples in the GMR dataset. We evaluated how well the original information in the genus matrix was represented by each approach (see Table S6 in Data S1). ET-like sample assignments via either DMM, PAM, or primary ES explained similar levels of variance in the original genus abundances (r = 0.27, r = 0.41, and r = 0.41, respectively), whereas discrete ES assignments and ES relative abundance captured up to twice the variance (r = 0.61 and r = 0.73, respectively) ( Figure 2A), despite being restricted to a similar number of components.
To investigate the generalizability of our 5-ES model to a cohort from a similar background, we applied the pretrained model to the Metacardis body mass index spectrum (BMIS) cohort (n = 888 fecal metagenomes of European origin). 17 For comparison, ESs were also de novo calculated, as were PAM and DMM ETs. This showed a similar assignment of samples to either ET or primary ES ( Figure S2A). Comparison between the two ES models showed that only a little additional information was captured in an ES model de novo trained on the BMIS cohort (median model fit 0.91 compared with 0.86, Figure 2B). The pre-trained 5-ES model was therefore retained in all following analyses. Following our GMR-based analysis, we calculated correlations between each type of ES and ET assignments and the original genus-level BMIS data and, as with our previous observations, found that more variance was captured by ES than ET models (Figure 2A).
The gut microbiome is correlated to different host physiological processes; thus, we hypothesized that ES and their relative abundance are correlated to different host-derived metadata variables that can explain their functional roles. In both BMIS and GMR datasets, ES explained more variation in host metadata, although most of the significantly correlated variables in these different models were similar ( Figure 2C; see Table S6 in Data S1). In contrast to host physiological metadata, microbiome-intrinsic variables (gene richness and fecal microbial load) correlated more strongly with DMM (18.4%) and PAM (14.8%) ETs than ES (12%) and genus (8%) abundances   Figure 2B). It is worth noting that both ES and PAM can explain more metadata variance than the original genus abundance data ( Figure 2C). Similar effects have been observed before, where data reduction models can smooth observed predictions and thus lead to better predictions. 18 The increased variance in both community composition and physiological metadata that is explained by ESs demonstrates their usefulness for interpreting gut microbial datasets; the level of abstraction simplifies complex microbiomes to a few factors, thus enabling impactful health data interpretation.
For example, ES-Bact and ES-Prev were, respectively, positively and negatively correlated to antibiotic intake, in both adult and infant metagenomes ( Figure 2D). It is noteworthy that genus Prevotella or Prevotella ET prevalence typically increases in rural communities and/or NW countries, linked to a diet rich in agrarian products 1,9 ; our data suggest that this could alternatively be associated with reduced antibiotic intake in these communities. ES-Firm appeared to associate well with markers of host health, based on the anti-correlation with BMI, intake of various medications, fat mass, inflammation markers, blood sugar levels, and positive correlation with healthy eating index and host age. Furthermore, ES-Firm was the only ES positively correlated with fecal microbial cell counts, indicative of a thriving ecosystem and congruent with previous Firmicutes ET observations 19 ( Figure S2C). In the GMR meta-cohort, all correlations were corrected for the study effect, but the metadata for adults was sparse; the few significant associations found largely agreed with the BMIS observations. However, in infants, some notable associations with ESs were detected: ES-Bact was correlated with vaginal birth, in line with previous findings 20 ; there were negative correlations between ES-Esch and age and birth weight, in line with our other observation that this ES is potentially pathogenic.

Enterosignatures represent diverse and complementary metabolic landscapes
The functional potential present in the bacterial guilds represented in each ES may be vastly different, offering specific services to their host and the gut ecosystems. To investigate this, we reconstructed the bacterial genomes of key taxa in the much larger GMR dataset 1,21 (see STAR Methods; Figure S3A). We first analyzed metabolism-related annotations of the genomes (Figures 3A-3D and S3; see Table S3 in Data S1) and found that dominant metabolic processes varied among the five ESs. For example, ES-Bifi genomes were enriched in amino acid, carbohydrate, lipid, and secondary metabolic functions, whereas ES-Esch was enriched in energy-related functions ( Figures S3F-S3J). To understand possible nutrient fluxes, we concentrated our analysis on annotations of carbohydrateactive enzymes (CAZymes) ( Figures S3A-S3D and S3K) and found that overall ES-Bact genomes were enriched in CAZymes, particularly those associated with degradation of plant-and animal-based carbohydrates. It could suggest a key role of ES-Bact in the primary degradation of host-ingested nutrients. ES-Bifi genomes were enriched in CAZymes that targeted alpha-glucans, which was consistent with previous analyses on the CAZyme content of Bifidobacterium. 22 By contrast, ES-Esch had a larger number of CAZymes that targeted the degradation of bacterial cell wall carbohydrates ( Figure S3K), indicating a possible role in predation on other microbes. Next, we investigated if ESs are redundant or rather complementary in their functions, hypothesizing that the first could lead to competitive exclusion and the latter to higher coincidences of the respective ES. For this, gut communities were in silico simulated using genome-scale metabolic networks (GSMNs) of our metagenomic reconstructed genomes (MGS). 1,21 On average, per-species metabolic profiles were similar to the metabolic profile of their associated ES (permutational multivariate analysis of variance, perMANOVA F = 5.81, p < 0.001, Figure 3B), indicating that the metabolic roles exhibited by each ES might be conserved. Although this could be due to the taxonomic relatedness of bacteria within each ES, it seems to be partly decoupled from taxonomy. For example, the predicted metabolic profiles of Proteobacteria associated with ES-Bifi are more similar to other ES-Bifi bacteria (perMANOVA F values ranging from 11.34 to 16.51), than to Proteobacteria associated with ES-Esch (F value 23.62); this was also observed, to a lesser extent, for Bacteroidota (see Table S3 in Data S1). We interpret this as each ES having a specific metabolic niche that is collectively enabled by its members.
To understand whether these metabolic niches overlapped or complemented each other, we again simulated in silico gut communities but with all five ESs present. Metabolic potential is still clustered by ES association to its constituent species, but the metabolic potential became more similar between ES ( Figure 3C; see Table S3 in Data S1). The complementarity between ESs was generally the highest for ES-Bact and, when in combination with other ES, enabled the greatest increase in metabolic diversity ( Figure 3D); this may explain why ES-Bact was frequently observed in combination with other ESs (Figures 1 and S4A-S4D; see Table S2 in Data S1). By contrast, ES-Esch, which was most commonly found as a single-dominant ES ( Figure 1C), benefitted the least from the presence of other ESs in our simulations ( Figure 3D). As all five ESs were rarely observed together in a single sample ( Figure 1C), we also simulated the metabolic gain with stepwise increases in the number of ESs (from 1 to 5) per microbiome. Metabolic potential increased with increasing numbers of ESs but became saturated at three ESs with little further gain thereafter ( Figure S3C). This observation, in conjunction with increased competition for metabolic niches with more ES, could explain why combinations of more than three ESs were rarely observed ( Figure 1C).
Overall, although species within an ES exhibit a similar metabolism, the five ESs have diverse and often complementary metabolic potentials. Metabolic interactions between ESs strengthen the core metabolic functions shared by all ESs, permitting functional redundancy in metagenomes comprised of multiple ESs, likely diversifying potential responses of an ecosystem and therefore increasing resilience.
Enterosignatures capture intra-individual short-and long-term changes in the fecal microbiome To better understand the establishment, ecological roles, and interactions between ESs in the human host, we investigated temporal ES dynamics in the GMR cohort: (1) cross-sectionally from infants to elders and (2) longitudinally using time series of n = 1,239 individuals. Consistent with our understanding of the developing human microbiome, 23 ES-Esch dominated preterm and early infant samples (p < 2.2eÀ16, Figures 4A and S4L) and was typically superseded by ES-Bifi in infants % 1 year of age. From 1 year and older, and likely coinciding with weaning, adult ESs (ES-Firm, ES-Bact, and ES-Prev) were observed more frequently. It is noteworthy that ES-Bact often not only occurred in samples from infants younger than 6 months old but also decreased in prevalence in elder hosts.
The number of ESs present in metagenomes correlated strongly with genus-and species-level diversities (Spearman r = 0.556 and r = 0.535, respectively, p < 2.2eÀ16). As with bacterial diversity, 24 ES diversity also increased with age (r = 0.374, p < 2.2eÀ16): the frequency of samples colonized by a single ES decreased steadily from infants to adults to elders (26%, 12%, and 4% of the samples, respectively) and the community became more complex; two ESs was the most common state across ages (48%, 45%, and 61%, respectively, Figure 4A). In adults, particularly, we commonly observed three ESs (39%). Samples from adults were usually dominated by ES-Firm, ES-Bact, and ES-Prev (termed adult ES), although often in combination with the remaining two ESs (44% of cases). The most interconnected ES was ES-Firm, rarely occurring alone in a sample but most frequently being the dominant ES when in combination with other ESs (1.2% and 45% of adult samples, respectively). Faeces dominated by ES-Firm alone are likely unusual gut ecosystems, as 5 of the 24 ES-Firm monodominated adult samples were collected following a strong antibiotic intervention (see following section). This suggests that ES-Firm has a strong ecological and/ or functional complementarity with other ESs, particularly ES-Bact (present in 80% of all ES-Firm combinations), and is likely limited in forming a stable community when occurring on its own.
To further elucidate interactions among ESs, we explored short-term intra-individual ES transitions using longitudinal time series samples from 1,239 individuals, usually covering several weeks or months (see Table S4 in Data S1). Tracking changes in primary ESs can inform on temporal attractors, i.e., ES states that the system tends to coalesce into with time. ES-Bact was an attractor in samples from all ages ( Figure 4C) but especially in those from infants. In later life, a steady state between ES-Bact and ES-Firm dominance seemed to establish. This pattern might be explained by the persistence of ESs in longitudinal samples: although ES-Bact and ES-Prev as primary ll OPEN ACCESS

Article
ESs have high persistence in infants and adults, ES-Firm is only persistent as the most dominant ES in elders (estimates based on survival statistics, Figure 4B; see Table S4 in Data S1). Apart from ES-Bifi, ES persistence increased with age; particularly persistence of ES-Esch increased drastically in samples from elders. One hypothesis for this observation is that onset of immunosenescence 25 could lead to a reduced gut ecosystem control by the host.
Antibiotic treatments can have devastating effects on the gut microbiome 26 ; we hypothesized that during recovery, typical ecosystem succession strategies can be discovered. Our GMR data suggest that ES-Bact is important in the recovery of disturbed gut ecosystems because its temporal attractor status was even stronger in antibiotic-treated individuals ( Figures S4A-S4H). Following antibiotic treatments, ES-Bact relative abundance increased by 15% and 17% in samples from infants (excluding preterms) and adults (n = 1,811 and n = 1,904, respectively, p < 1e-15, Figure S4I), also observed in the BMIS cohort ( Figure 2D). This is at the expense of ES-Esch and ES-Bifi in samples from infants, which decreased by 9% and 7%, respectively ( Figures S4K and S4L), and ES-Firm and ES-Prev in samples from adults, which decreased by 5% and 9%, respectively, after antibiotic treatments (Figures S4J and S4M).
Collectively, these results demonstrate that both short-and long-term changes in the gut microbiome can be effectively captured using ESs at an easily interpretable level.
Atypical enterosignature composition as a marker of microbial perturbations Not all gut microbiomes are equal, but we hypothesized that some compositions could deviate more strongly from the expected compositions the model was trained on, such as dysbiotic gut microbiomes. Mathematically, the 5-ES model was able to accurately describe the vast majority of GMR fecal microbiomes, measured by the cosine similarity between the 5-ES model and genus abundances (mean 0.80 ± 0.22, median 0.89, Figure 2B; see Table S1 in Data S1) that we refer to as ES model fit. However, among the 5,230 samples evaluated, there were 394 with an ES model fit < 0.40 (see STAR Methods), which we termed ''ES-atypical'' samples. Indeed, these samples had a lower evenness (Kruskal-Wallis: chi 2 = 32.593, p = 1.14eÀ08) and an over-representation of potential pathobionts,  Table S5 in Data S1). Low evenness is indicative of monodominance (a single taxon at >60% relative abundance 26 ), as would be expected in microbiomes overgrown by an invading pathogen, such as observed in Salmonella, C. difficile, or E. coli intestinal infections. 30 In 21.8% of ES-atypical samples, a species was monodominant, although this was only observed in 3.3% of ''ES-typical'' samples with good ES model fit. ES-atypical samples include both low-and high-diversity metagenomes indicative that a second-order correlation might better describe the relation of ES model fit to alpha diversity ( Figures S2D and S5A); thus, ES model fit appears to be complementary rather than redundant to alpha diversity measures. We found several potentially detrimental host factors that could be associated with lowered ES model fit scores in their fecal microbiomes. This was observed in metagenomes from infants born preterm (Kruskal-Wallis: chi 2 = 205.53, p < 2.2eÀ16), infants born via C-section (excluding preterm infants, chi 2 = 16.29, p = 5.4eÀ5), and infants with a formula-supplemented diet (excluding preterm and C-section born, chi 2 = 3.95, p = 0.047) ( Figures 5B  and 5C), in line with previous reports of perturbed gut micro-biomes in such cases. 23 Maturation of the fecal microbiome translates into a gut microbial composition that fits better to known compositions, as infant age correlated positively to model fit during the first year of life (r = 0.3, p < 2.2eÀ16, Figure 5B).
In antibiotic-treated individuals, we would expect to observe partially dysbiotic gut microbiomes; indeed, lower ES model fits were observed 1-5 days after treatment ( Figure 5D). In the time series data for these individuals, we found that the range of ES model fits was greater in individuals treated with antibiotics, being on average both higher and lower during the time series, than in reference samples ( Figure 5E). We interpret this as the microbiome first becoming ES-atypical due to antibiotic exposure and then ES-typical in the recovery phase due to reduced species diversity and only the gut core composition being present. We further investigated antibiotic treatments in two sub-cohorts, one specifically treated with antibiotics to provoke dysbiosis 31 and a case study that followed one individual for 3 years with two separate antibiotic treatments. 26 In both datasets, microbiomes were strongly atypical after antibiotic treatment (mean ES model fit values after the first treatment 0.21, 0.03, respectively, Figures S5B and S5F). ES composition remained skewed even after recovery, with ES-Bact usually dominating microbiomes for >180 days ( Figure S5C). This was a general   Figure 5F). These observations are consistent with correlations observed on the BMIS dataset ( Figure 2D), with ES-Bact becoming a strong temporal attractor in antibiotic-treated individuals (Figures S4E and  S4F). Consistent with our previous findings, ES-Bact seems to have a core role in the western gut microbiome-recurrent and stable after perturbation-whereas ES-Firm and ES-Prev act as accessory ESs, and their relative abundances are likely to decrease in disrupted ecosystems.
Enterosignatures are generalizable to non-western cohorts To further test the generalizability of our 5-ES model to samples diverse in age and geography, we evaluated 1,152 fecal metage-nomes from 12 non-western (NW) countries that were not represented in the original GMR cohort, representing infants, children, and adults (n = 119, 278, and 740, respectively) 32,33 ( Figure 6A). The NW samples were distinct in their genus-level composition from GMR samples, as expected 33 ( Figure 6B). Applying the GMR-trained 5-ES model showed upon initial inspection an average ES model fit significantly lower in NW (0.64) compared with GMR (0.8) metagenomes (Kruskal-Wallis p < 1e-15). Investigating ES model fit within countries represented in the GMR and NW cohorts, we discovered that samples from China, 32 Madagascar, 34 and Guinea-Bissau 32,35 had substantially reduced values. In only these cohorts, faeces were collected in ethanol with sometimes incomplete cold-chains, a collection method known to bias community compositions. 36 Furthermore, 47.3% of the individuals from the Guinea-Bissau cohort were diagnosed with helminth infections, 35 leading to microbiomes that could be dysbiotic and expected to have lowered ES model fits. When excluding these samples from further analysis, the average ES model fit for the remaining NW metagenomes-not collected in ethanol-was similar to GMR ES model fit (mean 0.78, 0.8, respectively, p = 0.073) ( Figure 6C), demonstrating the generalizability of the 5-ES model to human samples independent of age or country.
ES-Prev was the most dominant ES in the reduced NW dataset (present in 81.3% of samples, excluding China, Madagascar, and Guinea-Bissau samples), being primary ES in 66.4% of the NW samples, having a prominent role reminiscent of ES-Bact's prominence in the GMR cohort ( Figure S6A). ES-Prev established in NW infants much earlier than in GMR infants ( Figures 4A, 6D, and S6B), appearing dominant as early as from 6 months of age in NW samples. In line with this, ES-Prev dominated samples often harbor ES-Firm as the second most abundant ES (49.5%) but rarely ES-Bact (6.1%). Of note is that in both GMR, BMIS, and NW cohorts, ES-Bact and ES-Prev are strongly anticorrelated (r = À0.63, À0.77, p < 2.2eÀ16 and r = À0.11, p = 0.01, respectively, Table S6 in Data S1), indicating competitive exclusion between these two bacterial guilds, plausible due to their strongly redundant functional profiles that could indicate similar ecological roles (Figure 4). ES-Prev might colonize a similar ecological niche to ES-Bact but reflect a NW lifestyle: for example, the Kazakhstan samples in the GMR cohort, arguably NW, were dominated by ES-Prev ( Figures 6D and 6E).

Limitations of the study
We specifically chose an algorithm capable of decomposing complex microbial communities into assemblies of underlying microbial guilds that explained most of the gut microbial variance. Two classes of algorithms for this were previously evaluated on microbiome data: the linear algebra approach of NMF and probabilistic approaches such as generalization of DMM into LDA models. These works challenged existing methodologies 37 and demonstrated that assemblages of microbes could be captured from metagenomes, 10,13 even recovering bacterial assemblages reminiscent of ETs using LDA. 12 Here, we used NMF because the correct scaling of observed proportions to read counts in the multinomial noise term is not obvious for metagenome data, and the simpler model parametrization of NMF potentially reduces overfitting. Based on our 5-ES model, we propose a unified description of the human gut microbiome that has stronger associations with metadata and explains more variance of the original composition, than ET models. ESs rely on genus-level compositions, a taxonomic level chosen as a trade-off between the generalizability of the conserved patterns, the relevance of the taxa for functional inference, and the interpretability and number of recovered signatures.
The 5-ES model was trained on the GMR dataset consisting mostly of western individuals (74.2% western countries, 25.8% Israel, Russia, and Kazakhstan), potentially limiting generalizability of five ESs. We undertook two validation cohort studies relying on the BMIS (European samples) and NW (no-western) cohorts to evaluate generalizability: on both datasets, the GMR-trained 5-ES model generalized well. However, it is conceivable that we are missing ESs associated with specific cohorts or conditions and therefore not fully representing the spectrum of human gut ESs. Therefore, we provide the mathematical tools to identify gut metagenomes not fitting to the 5-ES model, using ES model fit scores. The pre-trained model and scripts for either applying or de novo calculating ES are available at https://enterosignatures. quadram.ac.uk. We explicitly want to acknowledge that additional bacterial guilds might be an important part of the healthy or diseased gut ecosystem, which can be cataloged and curated on our provided website as a community effort.
An accessible perspective on the gut microbial ecosystem The proposed ESs augment our understanding of the gut microbial ecology, as ES profiles provide a granular classification of globally observed gut bacterial guilds. In fecal samples from most adults and many infants, multiple ESs were coexisting; therefore, using a single community type to describe a GIT sample is probably incompletely capturing the underlying gut ecosystem. Despite ESs being highly persistent in individuals, the relative ES composition rapidly changed following perturbations, indicating the model being sensitive to external and internal forces. However, the often-observed switches in dominant ES are not necessarily indicative of ecological niche changes when the overall mixture of ESs remains stable. Rather, we propose to monitor the dynamic interplay between ES combinations in future applications. For example, ES-Prev and ES-Bact are strongly co-excluding signatures, despite-or because of-highly similar ecological niches: they share metabolic functions, occur at a similar frequency in fecal samples of healthy adults, either ES often co-occurs with ES-Firm, but can also dominate a microbiome solely.
Of note was that these two co-excluding signatures dominate either NW (ES-Prev) or western (ES-Bact) gut microbiomes. We do not know why this is exactly, but features associated with a western lifestyle, such as ultra-processed foods, hygiene, and physical activity levels, are likely contributing to this. 9 However, we note that after antibiotic interventions, ES-Bact was often increasing in prevalence, at the expense of both ES-Firm and ES-Prev. Therefore, antibiotic usage, more widespread in western societies, 38 could be a contributing factor to the puzzling ES-Bact dominance in western samples.
Both ES-Bact and ES-Prev appear to provide a core functionality to the healthy gut microbiome. Although the healthy microbiome is, in both western and NW datasets, likely established by ES-Bifi, our results highlight that the latter is being replaced early in life, by either ES-Bact (western) or ES-Prev (NW). The establishment of these adult microbiomes in infants is likely supported by vaginal birth as we found a correlation between ES-Bact and vaginal birth mode (reported also in Song et al. 20 ) for western samples. ES-Firm is most likely successional, 26 but in contrast to ES-Prev or ES-Bact, almost never occurring on its own. Therefore, we hypothesize that ES-Firm provides complementary functionality and relies on biotic interactions provided by either ES-Bact or ES-Prev. Although ES-Bifi, ES-Bact, ES-Prev, and ES-Firm are part of a healthy human gut microbiome, ES-Esch is often dominant in samples from preterm infants and adults undergoing drastic interventions, frequently being a single-dominant ES. It seems plausible to classify ES-Esch as an unhealthy, potentially pathogenic ES.
Based on these temporal observations, we propose the following successional model of the human gut microbiome: the ll OPEN ACCESS Article ES-Esch bacterial guild is a temporary colonizer in mostly dysbiotic infant microbiomes, succeeded by ES-Bifi and then by ES-Bact, or ES-Prev, at weaning. ES-Bact/ES-Prev forms the stable core of the healthy human microbiome, and with time, ES-Firm can complement ecosystem functionality. In the long term, ES-Prev replaces ES-Bact even in western samples, by superseding its ecological niche. The relative abundances of both ES-Firm and ES-Prev are more likely to decrease in disrupted ecosystems and the recurrence and stability of ES-Bact after perturbation is noteworthy; Bacteroides has already been identified as a dominant profile associated with disease, 39 Bacteroides/Prevotella were described as tenacious (highly persistent) taxa, 1 and the low diversity, low cell count associated Bact2 ET was prevalent in hosts with diseases 17,40 ; ES-Bact could be the remaining core of a disrupted ecosystem in the western gut microbiome.
A unified approach to defining atypical fecal microbiomes requiring further investigation Defining typical compositions of the healthy human gut microbiome can help to classify and quantify healthy states; conversely, it can also be used to identify microbiomes deviating from typical compositions observed in healthy hosts, thereby defining microbial dysbiosis. Detecting dysbiosis usually requires a suitably large reference dataset to allow for outlier detection in ordinations 41 or using abundances of preselected bacterial species 42 to train machine learning models. 43 Here, we demonstrated how the 5-ES model can be used to detect ''atypical'' compositions in fecal samples, outside of the expected ES signature spectrum. This classification of potentially dysbiotic samples relies on the ''absence of evidence for normal states'' and is thus fundamentally different from models that rely on the ''presence of dysbiotic evidence.'' Because the 5-ES model generalizes well to diverse gut metagenomes and is applicable to single samples, the detection of atypical samples allows for further investigation of a wide spectrum of potentially detrimental host states.
Fecal metagenomes deviating from the 5-ES model are not necessarily dysbiotic, but rather atypical with respect to the training dataset. However, ES atypical metagenomes were often associated with detrimental conditions in the host: ES model fit decreased after antibiotic interventions, classical examples of dysbiotic microbiomes, 31 and ES atypical microbiomes were often dominated by potential pathobiont species. In addition to ES composition correlating with health-associated metadata, we demonstrated how ESs can identify detrimental changes in the gut microbiome composition and provide hypotheses about their significance. In gut microbiomes from the general population, ESs could therefore be a valuable tool to detect and assess pathological conditions, discern the effect of treatments, and monitor disease progression with greater sensitivity and interpretability by researchers and medical professionals alike.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:    . These cohorts were selected because they gather mostly healthy individuals while accounting for some physio-pathological conditions that are known to be associated to microbiome perturbations such as preterm birth and antibiotic treatment. Sequences were processed with the MATAFILER pipeline as described in Hildebrand et al. 1 and Bahram et al. 93 The Metacardis Body Mass Index Spectrum (BMIS) raw metagenomes were retrieved from Vieira-Silva et al. 17 (ENA accession number PRJEB37249) and processed with MATAFILER using the same settings. It is a cross-sectional cohort consisting of 888 samples. Metadata was retrieved from Vieira-Silva et al. 17 and Forslund et al. 94 We additionally gathered cohorts of individuals with a non-western (NW) lifestyle. The Indian cohort was obtained from Vishnu Prasoodanan et al. 33 74 and reads mapped onto assemblies using Bowtie2 v2.3.4.1 with parameters ''-end-to-end '' 76 , genes predicted with Prodigal v2.6.1 with parameters ''-p meta'' 75 and a gene catalogue clustered at 95% nt identity using MMseqs2. 80 MAGs (metagenomic assembled genomes) were binned using Metabat2 v2.15 72 and combined in MATAFILER to MGS (metagenomic species), relying on canopy clustering. 96 Matrix operations were carried out using rtk. 71 The full MATAFILER pipeline is available at https://github.com/hildebra/MATAF3 . MGS phylogenies were calculated de novo based on the amino acid (AA) sequences of at most 40 marker genes 97 present in each MGS and subsequently aligned using MAFFT v v7.464 82 with default options; the multiple sequence alignment (MSA) was then trimmed and translated from nucleotide to aminoacid sequences using TrimAl 83 (options ''-keepheader -ignorestopcodon -gt 0.1 -cons 60''), from these IQ-TREE v1.6.3.a 84 with parameters ''-m GTR+F+I+G4 -B 1000'' was used to reconstruct a phylogeny, that was visualized with iTOL. 85 Non-negative matrix factorisation Genus-level relative abundance matrices were normalized by the sum of abundances per sample after removal of unassigned organisms at domain level. NMF was performed with Scikit-Learn v0.24.1 67 using the multiplicative update solver, Kullback-Leibler divergence as a beta-loss function, random initialisation, and a maximal number of iterations of 2000. Nine-fold bicross validation was performed as described in Owen and Perry, 98 Kanagal and Sindhwani, 99 and Eng et al. 100 in order to determine the number of signatures and the regularization ratio ( Figure S1B). Briefly, for each fold, 1/9 of the observations was withheld for validation. The basis matrix W and mixture coefficient matrix H of the validation set were calculated using the matrices obtained for the training set. 100 repetitions of this process were performed in total, each time after shuffling the matrix. All runs were performed for each number of clusters k, ranging from 2 to 10. The quality of decomposition for the validation was calculated as previously described 101,102 with the explained variance: The higher the EV value, the more accurately the decomposition represents the input compositional matrix X. The quality of the decomposition was further estimated using a cosine similarity between the reconstructed matrix and the original abundance matrix, or a sample microbial composition profile and its prediction, referred to as model fit score in the manuscript. For a contiguous flattened (ravel function from the Python numpy package 103 ) matrix (whole dataset) or a vector (sample) x and its estimation b x: The cosine similarity ranges from 0 to 1, higher values indicating greater similarity between x and b x. The optimal number of signatures was chosen by assessing the gain of cosine similarity (median) when increasing the number of signatures: when the gain no longer increases, the optimal number of signatures is reached.
L1 regularisation parameter a applied to both W and H matrices was estimated by running the same procedure of bicross-validation with the a coefficients ranging from 0 to 100. The regularisation parameter was selected based on the method from Eng et al. 100 Briefly, we defined a threshold equal to the mean explained variance (MEV) value at a = 0 minus the standard deviation of the mean.
We selected the a parameter value leading to the model whose associated MEV was greater than the threshold. We selected for ES computation an alpha value 1.0, and regularization was performed on both H and W matrices.
In order to reapply the GMR ESs to the Metacardis BMIS and non-western datasets, we reused the W matrix obtained from GMR together with the BMIS normalized relative abundance matrix as inputs to the Scikit-Learn 'non_negative_factorization' function, with the 'update_H' parameter set to 'False' and 'init' to 'custom'. Running the computation of H only consists in solving a non-negative least square problem. Prior to computation, we homogenized the taxa contents of the BMIS relative abundance matrix and W matrix by adding zero-filled rows for the genera that were absent in W or in the abundance matrix respectively.

Enterosignature analyses
The enterosignature results were calculated using the whole dataset and the parameters (number of ESs, regularization parameters) obtained with bicross-validation. The matrices W and H resulting from the NMF algorithm represent the weight of genera in enterosignatures and the presence of enterosignatures in samples, respectively ( Figure 1A). Normalizing W by its columns or its rows informs on the general composition of ESs in genera, and on the association strength of genera to each ES, respectively. 101 By normalizing the H matrix column-wise, we obtained the relative abundance of ES in each sample. We define as primary ES the ES of a sample with the highest relative abundance.
We used the normalized H matrix to determine the number of ESs needed to best describe a sample. To that purpose, we defined an arbitrary cut-off of 90% accumulated relative abundance to describe all major ES components of a sample. Thus, our operational definition posed that the set of ESs describing a sample are ESs ordered by decreasing relative abundance, whose cumulated relative abundance is greater or equal to 0.9. This is as well the approach we used to discretely assign the composition of ESs in samples. Additionally, we describe as a ''monodominant ES'' an ES whose relative abundance is greater than 0.9, i.e., samples that are best described by a single ES according to the above definition.
The diversity of ESs in samples is also estimated, using the Shannon diversity index. The Enterosignature Shannon Diversity (ESSD) is calculated using the diversity function (Shannon index) of the vegan R package v2.5-7 91 on the normalized H matrix.
We calculated the persistence of ESs in individuals sampled longitudinally using survival analysis. We distinguished the survival of the primary ES over time from the survival of the presence of an ES. For the former case, survival analysis per signature was performed after assigning to patients in longitudinal studies the most representative signature for each of their samples. For the latter case, we considered as present all ESs with a relative abundance greater than 1% and analyzed their time of disappearance in longitudinal samples. Persistence of ES and primary ES were calculated using the Survival R package v3.2-7. 104 Statistical tests for the comparison of survival predictions between groups are log rank Chi^2 tests from the Survival and Ezfun (v0.1.3) R packages. 105 We determined outlier samples regarding their cosine similarity to the original genus-level abundances. Such atypical samples exhibited a cosine model fit score lower than 0.4; the threshold was selected by looking at the distribution of cosine values and the interquartile range (IQR): 25 th percentile -(1.5 * IQR).

Functional analysis and metabolic network reconstruction
Consensus genomes of metagenomic species (MGS) and references genomes 106 corresponding to the GMR dataset were retrieved according to the data in Hildebrand et al. 1 Proteomes were functionally-annotated using EggNOG-mapper v2.1.6 89 based on eggNOG orthology data. 107 Sequence searches were performed using Diamond v2.0.14. 90 CAZymes and Kegg Orthology (KO) annotations were retrieved from EggNOG-mapper annotation files. Higher-level categories associated to KO were downloaded from the KEGG database. 108 For each KO found in a genome, we incremented the number of the corresponding higher level metabolic categories and we ultimately compared the metabolic functions of genomes grouped by their ES assignments. CAZymes 109 were associated to their respective substrates according to Cantarel et al. 110 and The CAZypedia Consortium 111 and genomes were compared after grouping by ES. In both CAZyme and KEGG annotations, values by genome were normalized by the number of genes in the genome.
Genome-scale metabolic networks (GSMNs) were reconstructed based on the annotated proteomes using mpwt v0.7.0 69 with Pathway Tools v25.5. 68 Analyses of metabolic pathways in the networks relied on Metacyc v25.1 database. 112 We retrieved the genera accounting for more than 3% of an enterosignature's activity and considered them as representatives of their respective enterosignature. This threshold was chosen because it allowed us to uniquely assign genera to an ES. Then, by grouping GSMNs assigned to genus-level representatives, a group of representative GSMNs was associated to each enterosignature. Of 1,737 total species, 55, 81, 60, 149 and 43 MGS were stably associated to ES-Bact, ES-Firm, ES-Prev, ES-Bifi and ES-Esch, respectively.
The metabolic potential and metabolic complementarity between enterosignatures were determined with Metage2Metabo v1.5.0. 69 The software simulates the qualitative individual and collective metabolic potentials of GSMNs provided a set of nutrients. A set of nutrients consisting of molecules described in a western diet 113 was selected. Three experiments were tested. For the first one, metabolic potential of each ES in isolation was calculated, running Metage2Metabo on GSMNs associated to each ES. For the second experiment, we computed the metabolic potential of pairwise combinations of all ES as well as all observed discrete combinations of ES in the samples. Finally, for the third experiment, all representative GSMNs for all ES were combined in the Metage2Metabo run. For each run of Metage2Metabo, the individual metabolic potential (molecules predicted to be producible by metabolic networks considered individually), the collective metabolic potential (molecules predicted to be producible by the GMR data the effect introduced by the cohort was accounted for by including 'Study' as a blocking factor. The p-values were adjusted for multiple testing according to the Benjamini-Hochberg method. All data with an adjusted p-value < 0.1 have been visualized using the ggplot2 package.