Directional Gaussian Mixture Models of the Gut Microbiome Elucidate Microbial Spatial Structure

ABSTRACT The gut microbiome is spatially heterogeneous, with environmental niches contributing to the distribution and composition of microbial populations. A recently developed mapping technology, MaPS-seq, aims to characterize the spatial organization of the gut microbiome by providing data about local microbial populations. However, information about the global arrangement of these populations is lost by MaPS-seq. To address this, we propose a class of Gaussian mixture models (GMM) with spatial dependencies between mixture components in order to computationally recover the relative spatial arrangement of microbial communities. We demonstrate on synthetic data that our spatial models can identify global spatial dynamics, accurately cluster data, and improve parameter inference over a naive GMM. We applied our model to three MaPS-seq data sets taken from various regions of the mouse intestine. On cecal and distal colon data sets, we find our model accurately recapitulates known spatial behaviors of the gut microbiome, including compositional differences between mucus and lumen-associated populations. Our model also seems to capture the role of a pH gradient on microbial populations in the mouse ileum and proposes new behaviors as well. IMPORTANCE The spatial arrangement of the microbes in the gut microbiome is a defining characteristic of its behavior. Various experimental studies have attempted to provide glimpses into the mechanisms that contribute to microbial arrangements. However, many of these descriptions are qualitative. We developed a computational method that takes microbial spatial data and learns many of the experimentally validated spatial factors. We can then use our model to propose previously unknown spatial behaviors. Our results demonstrate that the gut microbiome, while exceptionally large, has predictable spatial patterns that can be used to help us understand its role in health and disease.


Reviewer comments:
Reviewer #1 (Comments for the Author): In their paper Pasarkar et al., 2021 design a Bayesian network-derived Gaussian mixed model, in order to establish generative classifiers and model for the topographical biogeography of gut microbiome using MaPS-seq data.
The paper is well written and establishes a theoretical modelling structure that shows some relative similarity to microbiome biogeography. I have some minor comments: 1. Please make sure to release the github repo prior to sharing submitting your article to any journal. The current link does not re-direct to a public repo, as most probably the author (github user: amepas) has kept the repo as private. If not at least provide coding structure in the form of a txt file as supplemental information. In the revision, please ensure the link is functional.
2. Perhaps I have not understood or missed this, but why did you use only 20 initializations on simulated vs. 200 on real data? Did you reach the threshold of the log-likelihood being <10^-4? Should you not use similar initializations, does the difference reflect a lack of complexity in the simulated data and therefore preclude its need? Please explain.
3. Although you have run a comparison of the model yet somehow the projected data seems to show strong divergence from observed data. What is missing for me is formal hypothesis testing using a statistical test to evaluate Kullback-Leibler divergence of projected and observed spatial compositional data. Although you have used the AIC method to evaluate your model, a formal test is missing to compare projected and observed outcomes in real-world model deployment. There are major discrepancies in the data presented not 'some'. Not only Actinobacteria, but also Erysipelotrichaceae, Lachnospiraceae, Ruminococcaceae and in fact also Lactobacillaceae show deviation from observed data. Please explain? Please provide metrics showing how much the projected and observed data diverge, perhaps by using Pearson's residuals or another metric of multivariate comparison.
Reviewer #2 (Comments for the Author): Pasarkar and colleagues present a method to retrieve the spatial structure of the gut microbiome based on Metagenomic Plot Sampling by sequencing (MaPS-seq) data. This technique allows to infer the microbial composition using barcoded 16S rRNA amplification of specific locations in the gastrointestinal (GI) tract, however, the specific location of each sample is lost. The authors propose a directional Gaussian Mixture Model (GMM) that has a spatial structure that mimics the locations from which samples were taken. The method is illustrated on simulated and real-world data, from which the latter stems from the original publication in which MaPS-seq was proposed (Sheth et al., 2019).
The proposed method is relevant, understandable and could be an interesting tool for future gut microbiome research.
I have some questions, mostly minor and concerning the methodology, and a few remarks: 1. How do you see your method being incorporated in future gut microbiome research? Can you maybe give one or two concrete applications for which the method might be beneficial? 2. As the fitting of the GMM is a stochastic process, how stable is the choice of one-and two-dimensional models based on the AIC? Also, the number of evaluated topologies seems rather small, especially for the two-dimensional models (from 2x2 to 2x4). Why not go higher?
3. How do you choose the covariance matrix of each mixture? Are these fully independent from each other, or are some of the coefficients shared? 4. Why do you focus on 95% of most abundant taxa over all three datasets instead per dataset? 5. How were the reads preprocessed? At which taxonomic level did you analyze them? If I am not mistaken this is the OTU level, but results for Fig. 6  10. A suggestion (especially as you have quite some space left): as the Materials & Methods section comes in the end, the paper might be more accessible if you repeat some of the essential elements from it in the Results section. Some things (e.g. the simulation study) only became understandable after reading the paper as a whole. Another thing I would add is that each mixture represents a composition, from which you use the average to inspect changes in composition between different (estimated) locations in the GI.
11. The link to the repository with code (and data?) does not seem to work.
There is a typographical error and a few inconsistencies: -l22: seem → seems -Names of datasets are sometimes written with and sometimes without capital letter. Please be consistent.
-Please introduce the meaning of all symbols, such as Q, Q0, μ 0, and zb and upon first mentioning, such as ∆ and ∆⊥.
1. Please make sure to release the github repo prior to sharing submitting your article to any journal. The current link does not re-direct to a public repo, as most probably the author (github user: amepas) has kept the repo as private. If not at least provide coding structure in the form of a txt file as supplemental information. In the revision, please ensure the link is functional.
The github repository has been made publicly available.
2. Perhaps I have not understood or missed this, but why did you use only 20 initializations on simulated vs. 200 on real data? Did you reach the threshold of the log-likelihood being <10^-4? Should you not use similar initializations, does the difference reflect a lack of complexity in the simulated data and therefore preclude its need? Please explain.
We have modified the methodology to use 200 initializations on both the simulated and real data (lines 290-291). The corresponding results have been updated.
We have further clarified the terminating threshold used (line 294): an initialization will terminate once it reaches the convergence threshold of a relative increase in the log-likelihood being less than 10^-4. If this does not occur within 500 iterations, the initialization is restarted.

Although you have run a comparison of the model yet somehow the projected data seems
to show strong divergence from observed data. What is missing for me is formal hypothesis testing using a statistical test to evaluate Kullback-Leibler divergence of projected and observed spatial compositional data. Although you have used the AIC method to evaluate your model, a formal test is missing to compare projected and observed outcomes in realworld model deployment. There are major discrepancies in the data presented not 'some'. Not only Actinobacteria, but also Erysipelotrichaceae, Lachnospiraceae, Ruminococcaceae and in fact also Lactobacillaceae show deviation from observed data. Please explain? Please provide metrics showing how much the projected and observed data diverge, perhaps by using Pearson's residuals or another metric of multivariate comparison.
The noticeable discrepancies are not surprising due to the overall complexity of the system. However, there are some general trends that are shared between projected and observed data, specifically on Lactobacillaceae.
To your point, we have performed two hypothesis tests of the divergence (lines 152-161). First, we determined if the proposed latent clusters still accurately describe the data. We compared the likelihood of MaPS-seq droplets in their assigned latent cluster against the highest likelihood of all other clusters (line 356). Note, that this is a conservative test, as it compares against an optimal other assignment, rather than a random one. The likelihood was calculated using the proposed compositions and the same covariance matrices learned by the model. If the proposed compositions are extremely far from the observed compositions, then we would expect that many MaPS-seq droplets would likely have a higher likelihood for other clusters. We observe a pvalue that is only barely higher than the traditional threshold for statistical significance.
We also test using the KL divergence metric. On each region of the ileum, we have calculated the KL divergence between projected and observed compositions, and an associated p-value from a permutation test (line 159). The permutations are on the ordering of OTUs in the observed data. This allows us to test if the OTUs are contributing similarly to the observed and projected compositions.
In short, we find that the projected compositions still maintain cluster assignments reasonably well and that OTU compositions are similar between projected and observed data as per the KL divergence.

Reviewer #2 (Comments for the Author):
I have some questions, mostly minor and concerning the methodology, and a few remarks:

How do you see your method being incorporated in future gut microbiome research?
Can you maybe give one or two concrete applications for which the method might be beneficial?
We have added one such application at line 218 to the discussion.

2.
As the fitting of the GMM is a stochastic process, how stable is the choice of one-and two-dimensional models based on the AIC? Also, the number of evaluated topologies seems rather small, especially for the two-dimensional models (from 2x2 to 2x4). Why not go higher?
Because each mixture component would have a full covariance matrix, each mixture would have over 800 associated parameters. We felt that evaluating larger topologies would lead to severe overfitting due to the presence of far more parameters than data points.

Why do you focus on 95% of most abundant taxa over all three datasets instead per dataset?
This is a conservative evaluation of performance, as it does not take advantage of low-abundance datasetspecific taxa that would allow for distinction between regions in a dataset without computational sophistication. We wanted to be confident that our model was capturing global spatial dynamics in its choice of model dimension instead of just modelling low-abundant taxa.

5.
How were the reads preprocessed? At which taxonomic level did you analyze them? If I am not mistaken this is the OTU level, but results for Fig. 6 are displayed on the phylum level. Did you find similar patterns at the OTU level?
Reads were preprocessed in Sheth et. al, 2019. The data provided was only the relative abundances of the taxa. Reads were analyzed at the genus level. In Figure 6, we aggregate the abundances of various genuses into a single abundance of their respective phylum (the gray bars). We also show the aggregated abundances of specific families within the Firmicutes phyla because there is some knowledge in science literature of their expected behavior along pH gradients.

Are the reads available on a public repository?
The reads from Sheth et. al are available at this github repository: http://github.com/ravisheth/mapseq

l241: What is the Inverse-Wishart distribution?
The Inverse-Wishart distribution is a conjugate prior of the multivariate normal distribution when sampling covariance matrices. It is a standard prior used in contemporary Bayesian inference. We now refer to the relevant textbook in the manuscript text (reference 15).
8. Did you use other software packages to implement the method? If so, please reference them.