MiMultiCat: A Unified Cloud Platform for the Analysis of Microbiome Data with Multi-Categorical Responses

Kim, Jihun; Jang, Hyojung; Koh, Hyunwook

doi:10.3390/bioengineering11010060

Open AccessArticle

MiMultiCat: A Unified Cloud Platform for the Analysis of Microbiome Data with Multi-Categorical Responses

by

Jihun Kim

,

Hyojung Jang

and

Hyunwook Koh

^*

Department of Applied Mathematics and Statistics, The State University of New York (SUNY), Incheon 21985, Republic of Korea

^*

Author to whom correspondence should be addressed.

Bioengineering 2024, 11(1), 60; https://doi.org/10.3390/bioengineering11010060

Submission received: 2 December 2023 / Revised: 21 December 2023 / Accepted: 31 December 2023 / Published: 8 January 2024

Download

Browse Figures

Versions Notes

Abstract

:

The field of the human microbiome is rapidly growing due to the recent advances in high-throughput sequencing technologies. Meanwhile, there have also been many new analytic pipelines, methods and/or tools developed for microbiome data preprocessing and analytics. They are usually focused on microbiome data with continuous (e.g., body mass index) or binary responses (e.g., diseased vs. healthy), yet multi-categorical responses that have more than two categories are also common in reality. In this paper, we introduce a new unified cloud platform, named MiMultiCat, for the analysis of microbiome data with multi-categorical responses. The two main distinguishing features of MiMultiCat are as follows: First, MiMultiCat streamlines a long sequence of microbiome data preprocessing and analytic procedures on user-friendly web interfaces; as such, it is easy to use for many people in various disciplines (e.g., biology, medicine, public health). Second, MiMultiCat performs both association testing and prediction modeling extensively. For association testing, MiMultiCat handles both ecological (e.g., alpha and beta diversity) and taxonomical (e.g., phylum, class, order, family, genus, species) contexts through covariate-adjusted or unadjusted analysis. For prediction modeling, MiMultiCat employs the random forest and gradient boosting algorithms that are well suited to microbiome data while providing nice visual interpretations. We demonstrate its use through the reanalysis of gut microbiome data on obesity with body mass index categories. MiMultiCat is freely available on our web server.

Keywords:

microbiome data analysis; cloud computing; human microbiome; multi-categorical response; microbiome association testing; microbiome prediction modeling

1. Introduction

The field of the human microbiome is rapidly growing due to the recent advances in high-throughput sequencing technologies (e.g., 16S rRNA amplicon sequencing [1,2] and shotgun metagenomics [3]). Investigators currently seek to discover novel biomarkers that are crucial to human health or disease (e.g., cancer [4,5,6], diabetes [7,8], obesity [9,10], intestinal disease [11,12], oral disease [13], nasal disease [14]) through the human microbiome.

There have also been many new analytic pipelines, methods and/or tools developed for microbiome data preprocessing and analytics [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]. They are usually focused on the microbiome data with continuous (e.g., body mass index) or binary responses (e.g., diseased vs. healthy). However, in reality, multi-categorical responses that have more than two categories are also common. Multi-categorical responses in human microbiome studies are, for example, (i) stages I, II, III and IV for cancer/diabetes, (ii) underweight, normal, overweight and obese for obesity, (iii) extremely preterm, very preterm and moderate to late preterm for preterm birth and so forth. Nevertheless, currently, there is no well-planned routine software for the analysis of microbiome data with multi-categorical responses.

In this paper, we introduce a new unified cloud platform, named MiMultiCat, for the analysis of microbiome data with multi-categorical responses. The two main distinguishing features of MiMultiCat that we stress are as follows: First, MiMultiCat streamlines a long sequence of microbiome data preprocessing and analytic procedures extensively on user-friendly web interfaces. Microbiome data are highly complex, requiring a broad spectrum of expertise to deal with them. It is also time-consuming and laborious to write computer codes on command-line interfaces. Yet, MiMultiCat resolves all such hassles based on ‘easy-to-use’ and ‘step-by-step’ web environments and, thus, can benefit many people in various disciplines (e.g., biology, medicine, public health). Second, MiMultiCat performs both association testing and prediction modeling extensively. For association testing, MiMultiCat handles both ecological (e.g., alpha and beta diversity) and taxonomical (e.g., phylum, class, order, family, genus, species) contexts through covariate-adjusted or unadjusted analysis. Investigators seek not only to comprehend the microbiome as a microbial community or ecosystem but also seek to dissect it into upper- and lower-level microbial taxa. Covariate-adjusted analytics are also necessary to prevent spurious discoveries especially for observational studies, in which many potential confounding factors (e.g., age, sex) can be involved. Here, we emphasize that MiMultiCat is designed to satisfy such analytic demands extensively. For prediction modeling, MiMultiCat employs the random forest [32] and gradient boosting [33,34] algorithms that are well suited to microbiome data to account for possibly highly non-linear patterns of relationships. They are also easy to understand and interpret while providing nice visual representations to rank microbial taxa in importance and illustrate the delicate patterns of the relationships to human health or disease responses.

We organize the rest of the paper as follows: In the following Materials and Methods section, we describe the underlying statistical methods, web server architecture and example data for MiMultiCat. Then, in the Results section, we describe all the modules of MiMultiCat along with its applications to gut microbiome data on obesity with body mass index (BMI) categories [35]. Finally, in the Discussion section, we summarize and finish with concluding remarks.

2. Materials and Methods

2.1. Association Testing

Alpha diversity and taxonomic analysis (Table 1): For a nominal response variable, MiMultiCat employs the multinomial logistic regression model for covariate-adjusted analysis and (i) the parametric analysis of variance (ANOVA) F-test for global testing and Tukey’s honestly significant difference test for pairwise comparisons [36], (ii) the non-parametric Kruskal–Wallis test [37] for global testing along with Dunn’s test [38] for pairwise comparisons and (iii) multinomial logistic regression for unadjusted analysis. For an ordinal response variable, MiMultiCat employs the proportional odds regression model [39] for both covariate-adjusted and unadjusted analysis.

Beta diversity analysis (Table 2): For a nominal response variable, MiMultiCat employs the microbiome kernel association test for multi-categorical outcomes (MiRKAT-MC) [40] for covariate adjusted analysis and non-parametric multivariate analysis of variance (PERMANOVA) [41,42] and MiRKAT-MC [40] for unadjusted analysis. For an ordinal response variable, MiMultiCat employs MiRKAT-MC [40] for both covariate-adjusted and unadjusted analyses.

2.2. Prediction Modeling

For prediction modeling, MiMultiCat employs the random forest [32] and gradient boosting [33,34] algorithms. Microbiome data are highly complex with high dimensionality, sparsity, compositionality and phylogenetic relationships; as such, the patterns of the relationship between microbial taxa and health or disease responses can be highly discrete or irregular. Thus, tree ensemble algorithms, such as the random forest [32] and gradient boosting [33,34] algorithms, can be well suited while robustly accounting for non-linear patterns and decorrelating compositional/phylogenetic relatedness [31].

For the random forest algorithm [32], MiMultiCat trains it to tune the number of randomly selected taxa for each tree in the ensemble through cross-validation based on the loss of Gini impurity. MiMultiCat reports two main outputs for the random forest [32] using (i) a variable importance plot to rank microbial taxa in classification ability and (ii) a partial dependence plot to display the relationship patterns between microbial taxa and muti-categorical responses.

For the gradient boosting algorithm [33,34], MiMultiCat trains it slowly using a small learning rate with or without

L_{2}

regularization [34]. MiMultiCat tunes the number of trees (i.e., the number of iterations) and the number of partitions for each tree in the ensemble (also known as the interaction depth) simultaneously through cross-validation based on the loss of cross-entropy. For gradient boosting [33,34], a popular software package, XGBoost [34], was employed for fast C++ computations. MiMultiCat reports two main outputs using (i) a Shapley additive explanation (SHAP) plot to rank microbial taxa in classification ability and (ii) a partial dependence plot to display the relationship patterns between microbial taxa and multi-categorical responses.

2.3. Web Server Architecture

We wrote MiMultiCat using R language and constructed its app interfaces using R shiny 1.7.5. We deployed it onto the web (http://mimulticat.micloud.kr, access date: 26 December 2023) using ShinyProxy 2.6.1 (https://www.shinyproxy.io, access date: 1 November 2023) and Apache2 (https://httpd.apache.org, access date: 1 November 2023). Our web server runs on the operating system Ubuntu 20.04 (https://ubuntu.com, access date: 1 November 2023) and the computing device ThinkCentre Neo 50S Gen3 (Lenovo, Quarry Bay, Hong Kong) that comes with the specifications of an Intel Core i9-12900 (16-core) processor (Intel, Santa Clara, CA, USA) and 64 GB DDR4 memory (Samsung, Seoul, Republic of Korea). MiMultiCat can accommodate up to ten concurrent web server users. When the server is busy, users can alternatively run it using their local computer through our GitHub repository (https://github.com/jkim209/mimulticatgit, access date: 26 December 2023).

2.4. Data Availability

We describe each module in the following Results section using public gut microbiome data on obesity (see Application notes) published in [35], where BMI was used as a measure of obesity. We categorized it into three levels, normal (18.5 ≤ BMI < 25), overweight (25 ≤ BMI < 30) and obese (30 ≤ BMI). The 16S raw sequence data are publicly available from the European Bioinformatics Institute (EMBL-EBI) database with access numbers ERP006339 and ERP006342 [35]. We processed them using QIIME 1.7.0 [15,16] and FastTree [43] based on the GreenGenes 12.10 database (https://greengenes.secondgenome.com, access date: 1 November 2023) to construct the feature table, taxonomic table and phylogenetic tree. The final processed microbiome data together with the meta/sample data are also available as example data in the Data Input module so that our users can easily comprehend suitable data formats.

3. Results

3.1. Data Processing

The Data Processing module is composed of three sub-modules, Data Input, Quality Control and Data Transformation as in [29,30,31].

First, in the Data Input module, users can upload microbiome data using a unified format, called phyloseq [25], or using separate files. Then, in the Quality Control module, they can perform quality controls with respect to (i) the kingdom of interest (default: Bacteria); (ii) minimum library size (i.e., total read count) to remove the individuals that have low library sizes (default: 3000); (iii) minimum mean proportion (i.e., relative abundance) to remove the features (i.e., operational taxonomic units (OTUs) or amplicon sequence variants (ASVs)) that have low mean proportions (default: 0.002%); and (iv) errors in taxonomic names to be removed from the taxonomic table. MiMultiCat (i) shows the sample size and the numbers of features, phyla, classes, orders, families, genera and species and (ii) creates interactive histograms and box plots for the library sizes across individuals and the mean proportions across features. Then, finally, in the Data Transformation module, MiMultiCat computes ecological indices (i.e., nine alpha diversity indices: Observed, Shannon [44], Simpson [45], Inverse Simpson [45], Fisher [46], Chao1 [47], abundance-based coverage estimator (ACE) [48], incidence-based coverage estimator (ICE) [49] and phylogenetic diversity (PD) [50] and five beta-diversity indices: Jaccard dissimilarity [51], Bray–Curtis dissimilarity [52], Unweighted UniFrac distance [53], Generalized UniFrac distance [54], Weighted UniFrac distance [55]). MiMultiCat also normalizes taxonomic abundances using the widely used methods of centered-log ratio (CLR) [56], rarefaction [57], proportion and arcsine root transformation.

Application notes: First, we uploaded the example gut microbiome data on obesity through the Data Input module. Second, we applied the default quality control settings, and then rescued 968 individuals for 484 features (7 phyla, 15 classes, 20 orders, 29 families, 44 genera, 29 species) through the Quality Control module. Finally, we computed ecological indices and normalized taxonomic abundances through the Data Transformation module.

3.2. Data Analysis: Association

The purpose of the Data Analysis: Association module is to conduct microbiome association testing with respect to ecological (e.g., alpha and beta diversity) and taxonomical (e.g., phylum, class, order, family, genus, species) contexts; as such, it is composed of three sub-modules, Alpha Diversity, Beta Diversity and Taxonomic Analysis. For this, users need to select a multi-categorical response variable and set up its variable type as nominal or ordinal. Then, they can reorder/rename the categories of the selected response variable. This is to set up the reference category, to change the orders of the categories and/or to change the names of the categories to be better displayed on the output plots. Then, users need to select covariates (e.g., age, sex) for covariate-adjusted analysis or not for unadjusted analysis. Then, they are supposed to select a statistical method from the list of available statistical methods as in Table 1 and Table 2.

For taxonomic analysis, users additionally need to select a data format such as CLR [56] (default), proportion, arcsine root or rarefied counts [57] and taxonomic ranks to be surveyed as ‘from phylum to genus (16S)’ or ‘from phylum to species (metagenomics)’. MiMultiCat applies the Benjamini–Hochberg procedures [58] to control for false discovery rate (FDR) per taxonomic rank.

Application notes: We selected the categorical BMI variable as the response variable and set up its variable type as ordinal. Then, we changed the names of the categories to Normal (level 1), Over (level 2) and Obese (level 3). Then, we selected age and sex as covariates to be adjusted. Then, we selected the proportional odds model [39] for alpha diversity analysis and taxonomic analysis and MiRKAT-MC [40] for beta diversity analysis. For taxonomic analysis, we selected CLR [56] for the data format and ‘from phylum to genus (16S)’ for the taxonomic ranks to be surveyed. Then, we observed no significant association between alpha diversity and the BMI level with respect to any alpha diversity indices at the significance level of 0.05 (Figure 1). However, we observed significant disparities in beta diversity with respect to the Jaccard dissimilarity [51], Bray–Curtis dissimilarity [52], Unweighted UniFrac distance [53], Generalized UniFrac distance [54] and Weighted UniFrac distance [55], across the levels of BMI at the significance level of 0.05 (Figure 2). Finally, we found three phyla (Fimicutes, Tenericutes, Actinobacteria), five classes (Bacilli, Mollicutes, RF3, Clostridia, Actinobacteria), five orders (ML615J-28, RF39, Clostridiales, Bifidobacteriales), four families (Streptococcaceae, Veillonellaceae, Lachnospiraceae, Christensenellaceae) and three genera (Streptococcus, Oscillospira, Blautia) to be significantly associated microbial taxa at a false discovery rate of 0.05 (Figure 3).

3.3. Data Analysis: Prediction

The purpose of the Data Analysis: Prediction module is to conduct microbiome prediction modeling using the random forest [32] and gradient boosting [33,34] algorithms. For this, users first need to select a data format such as CLR [56] (default), proportion, arcsine root or rarefied counts [57]. Then, they need to select a multi-categorical response variable and can reorder/rename the categories of the selected response variable.

For the random forest algorithm [32], users need to select (i) 5-fold or 10-fold cross-validation, (ii) the number of trees to be aggregated in the ensemble (default: 1000), (iii) the maximum number of taxa to be displayed in the variable importance and partial dependence plots (default: 20) and (iv) taxonomic ranks to be surveyed ‘from phylum to genus (16S)’ or ‘from phylum to species (metagenomics)’. Note that we set up the number of trees to be aggregated in the ensemble as 1000 for quick check-ups, but the random forest algorithm [32] has no overfitting issue with an increase in the number of trees. Instead, an increase in the number of trees is only favorable in making the error rate stable with a sufficient convergence [32]. Thus, we recommend increasing the number of trees as much as possible (e.g., 10,000), but this comes with the cost of slow computation.

For the gradient boosting [33,34], users need to select (i) 5-fold or 10-fold cross-validation, (ii) the maximum number of trees in the boosting process (default is 1000), (iii) the learning rate (default: 0.005), (iv) the number of taxa to be displayed in the variable importance and partial dependence plots (default: 20) and (v) taxonomic ranks to be surveyed ‘from phylum to genus (16S)’ or ‘from phylum to species (metagenomics)’. Note that we set up the maximum number of trees in the boosting process as 1000 and the learning rate as 0.005. First, the gradient boosting [33,34] may have an overfitting issue with an increase in the number of trees, but as we described in Materials and Methods: 2.2 Prediction modelling, MiMultiCat tunes the number of trees through cross-validation; as such, we can avoid overfitting. We wrote it as the maximum number of trees, not the number of trees; as such, it is only about the capacity of candidate numbers of trees to be tuned. If it is small, the gradient boosting [33,34] can stop early, leading to underfitting. Thus, we recommend increasing the maximum number of trees as much as possible (e.g., 10,000), but this comes with the cost of slow computation. Second, the learning rate is the rate of newly fitted trees reflected in the update [33,34]. If it is small, the tree grows slowly; as such, we can fine-tune it, making the error rate stable with sufficient convergence [33,34]. Thus, we recommend a small learning rate (e.g., 0.001) but, again, this comes with the cost of slow computation.

Note that if any of the above training settings for the random forest [32] or gradient boosting [34] algorithms change, the fitted model will change, leading to different results. This is an issue, but it is not unique to the random forest [32] or gradient boosting [34] algorithms. Indeed, there is no machine learning algorithm that provides closed-form results; as such, the results can vary across different runs. However, since we set up a seed number, MiMultiCat provides the same results under the same training settings; as such, the same result can be reproduced under the same training settings.

Application notes: We selected CLR [56] as the data format and the categorical BMI variable as the response variable. Then, we changed the names of the categories to Normal (level 1), Over (level 2) and Obese (level 3).

For the random forest algorithm [32], we selected (i) 5-fold cross-validation, (ii) the number of trees to be aggregated in the ensemble as 10,000, (iii) the maximum number of taxa to be displayed as 20 and (iv) taxonomic ranks to be surveyed as ‘from phylum to genus (16S)’. Here, we selected the maximum possible number of trees (i.e., 10,000) since we intended to gladly endure some slow computation to obtain better results. Then, we found, at the genus level, G25: Lachnospira; G15: Oscillospira; G14: SMB53; G20: Roseburia; G22: Dorea; G28: Akkermansia; G30: Bilophila; G11: Acidaminococcus; G38: Butyricimonas; G1: Streptococcus; G27: Collinsella; G26: Bifidobacterium; G44: Sutterella; G31: Prevotella; G33: Paraprevotella; G41: Haemophilus; G23: Anaerostipes; G4: Eubacterium; G21: Ruminococcus; and G10: Phascolarctobacterium as the top 20 most important microbial taxa in predicting levels of BMI (Figure 4). We can also see their relative abundances are larger for lower levels of BMI (Figure 5). Thus, these genera might be beneficial microbes in preventing obesity. This may indicate in a clinical sense that we can lower obesity levels by increasing their relative abundances using dietary supplements or therapeutics.

For the gradient boosting algorithm [33,34], we selected (i) 5-fold cross-validation, (ii) the maximum number of trees in the boosting process as 10,000, (iii) the learning rate as 0.001, (iv) the number of taxa to be displayed as 20 and (v) taxonomic ranks to be surveyed as ‘from phylum to genus (16S)’. Here, we selected the maximum training capacities (i.e., the maximum number of trees: 10,000; the learning rate: 0.001) since we intended to gladly endure some slow computation to obtain better results. Then, we found, at the genus level, G20: Roseburia; G14: SMB53; G25: Lachnospira; G15: Oscillospira; G22: Dorea; G28: Akkermansia; G33: Paraprevotella; G11: Acidaminococcus; G26: Bifidobacterium; G1: Streptococcus; G36: Parabacteroides; G13: Sarcina; G27: Collinsella; G21: Ruminococcus; G19: Blautia; G35: Alistipes; G10: Phascolarctobacterium; G4: Eubacterium; G38: Butyricimonas; and G44: Sutterella as the top 20 most important microbial taxa in predicting levels of BMI (Figure 6). We can also see their relative abundances are larger for lower levels of BMI (Figure 7). Thus, these genera might be beneficial microbes in preventing obesity. Again, in a clinical sense, we might be able to lower obesity levels by increasing their relative abundances using dietary supplements or therapeutics.

4. Discussion

In this paper, we introduced a cloud platform, MiMultiCat, for the analysis of microbiome data with multi-categorical responses. We summarize the two main features of MiMultiCat as follows: (i) it is user-friendly, streamlining a long sequence of microbiome data preprocessing and analytic procedures on a step-by-step web environment, and (ii) it is comprehensive, performing both association testing and prediction modeling extensively. We also demonstrated the use of MiMultiCat through the reanalysis of gut microbiome data on obesity with BMI categories [35].

The field of the human microbiome is rapidly emerging, and many people from various disciplines (e.g., biology, medicine, public health) conduct human microbiome studies to discover novel microbial biomarkers that are important in human health or disease. However, there are many different preprocessing and analytic procedures that are involved in the analysis of microbiome data; hence, it is demanding, time-consuming and laborious. Yet, we stress again that MiMultiCat is easy to use and, thus, can benefit many human microbiome researchers as a well-planned routine software for the analysis of microbiome data with multi-categorical responses.

However, we note that association or prediction does not necessarily imply causation. To discover causal microbial biomarkers, investigators need to study further the key underlying mechanisms, such as immunologic or metabolic regulations and digestive processes, of the human microbiome. In addition, scientific knowledge, experimental design, bioengineering technology, bioinformatic or statistical methodology and so forth all together can aid in human microbiome research. Yet, we could not achieve all such goals in this research.

MiMultiCat is freely available on our web server (http://mimulticat.micloud.kr). When the server is busy, users can alternatively run it using their local computer through our GitHub repository (https://github.com/jkim209/mimulticatgit). We monitor our web server and GitHub repository periodically. We are committed to maintaining MiMultiCat at the highest quality. If you have any question or problem using MiMultiCat, you can report it on our GitHub page (https://github.com/jkim209/mimulticatgit/issues) or email the maintainer, Jihun Kim ([email protected]).

Author Contributions

J.K. wrote the programs, developed the web server and GitHub repository, performed data analysis and contributed to visualization and design. H.J. wrote the programs and contributed to the web server development, visualization and design. H.K. conceptualized and initiated the study, curated the data, contributed to the methodological aspects and overall architecture and wrote the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (2021R1C1C1013861).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The 16S raw sequence data are publicly available from the European Bioinformatics Institute (EMBL-EBI) database with access numbers ERP006339 and ERP006342 [35]. We processed them using QIIME 1.7.0 [15,16] and FastTree [47] based on the GreenGenes 12.10 database (https://greengenes.secondgenome.com, access date: 1 November 2023) to construct the feature table, taxonomic table and phylogenetic tree. The final processed microbiome data together with the meta/sample data are also available as example data in the Data Input module so that our users can easily comprehend suitable data formats.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Woese, C.R.; Fox, G.E. Phylogenetic Structure of the Prokaryotic Domain: The Primary Kingdoms. Proc. Natl. Acad. Sci. USA 1977, 74, 5088–5090. [Google Scholar] [CrossRef] [PubMed]
Woese, C.R.; Kandler, O.; Wheelis, M.L. Towards a Natural System of Organisms: Proposal for the Domains Archaea, Bacteria, and Eucarya. Proc. Natl. Acad. Sci. USA 1990, 87, 4576–4579. [Google Scholar] [CrossRef] [PubMed]
Thomas, T.; Gilbert, J.; Meyer, F. Metagenomics—A Guide from Sampling to Data Analysis. Microb. Inform. Exp. 2012, 2, 3. [Google Scholar] [CrossRef] [PubMed]
Matson, V.; Fessler, J.; Bao, R.; Chongsuwat, T.; Zha, Y.; Alegre, M.L.; Luke, J.J.; Gajewski, T.F. The Commensal Microbiome Is Associated with Anti–PD-1 Efficacy in Metastatic Melanoma Patients. Science 2018, 359, 104–108. [Google Scholar] [CrossRef] [PubMed]
Gopalakrishnan, V.; Spencer, C.N.; Nezi, L.; Reuben, A.; Andrews, M.C.; Karpinets, T.V.; Prieto, P.A.; Vicente, D.; Hoffman, K.; Wei, S.C.; et al. Gut Microbiome Modulates Response to Anti–PD-1 Immunotherapy in Melanoma Patients. Science 2018, 359, 97–103. [Google Scholar] [CrossRef]
Limeta, A.; Ji, B.; Levin, M.; Gatto, F.; Nielsen, J. Meta-Analysis of the Gut Microbiota in Predicting Response to Cancer Immunotherapy in Metastatic Melanoma. JCI Insight 2020, 5, e140940. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.S.; Li, J.; Krautkramer, K.A.; Badri, M.; Battaglia, T.; Borbet, T.C.; Koh, H.; Ng, S.; Sibley, R.A.; Li, Y. Antibiotic-Induced Acceleration of Type 1 Diabetes Alters Maturation of Innate Intestinal Immunity. eLife 2018, 7, e37816. [Google Scholar] [CrossRef] [PubMed]
Livanos, A.E.; Greiner, T.U.; Vangay, P.; Pathmasiri, W.; Stewart, D.; McRitchie, S.; Li, H.; Chung, J.; Sohn, J.; Kim, S. Antibiotic-Mediated Gut Microbiome Perturbation Accelerates Development of Type 1 Diabetes in Mice. Nat. Microbiol. 2016, 1, 16140. [Google Scholar] [CrossRef]
Turnbaugh, P.J.; Ley, R.E.; Mahowald, M.A.; Magrini, V.; Mardis, E.R.; Gordon, J.I. An Obesity-Associated Gut Microbiome with Increased Capacity for Energy Harvest. Nature 2006, 444, 1027–1031. [Google Scholar] [CrossRef]
Ridaura, V.K.; Faith, J.J.; Rey, F.E.; Cheng, J.; Duncan, A.E.; Kau, A.L.; Griffin, N.W.; Lombard, V.; Henrissat, B.; Bain, J.R.; et al. Gut Microbiota from Twins Discordant for Obesity Modulate Metabolism in Mice. Science 2013, 341, 1241214. [Google Scholar] [CrossRef]
De Palma, G.; Lynch, M.D.J.; Lu, J.; Dang, V.T.; Deng, Y.; Jury, J.; Umeh, G.; Miranda, P.M.; Pigrau Pastor, M.; Sidani, S.; et al. Transplantation of Fecal Microbiota from Patients with Irritable Bowel Syndrome Alters Gut Function and Behavior in Recipient Mice. Sci. Transl. Med. 2017, 9, eaaf6397. [Google Scholar] [CrossRef] [PubMed]
Johnsen, P.H.; Hilpüsch, F.; Cavanagh, J.P.; Leikanger, I.S.; Kolstad, C.; Valle, P.C.; Goll, R. Faecal Microbiota Transplantation versus Placebo for Moderate-to-Severe Irritable Bowel Syndrome: A Double-Blind, Randomised, Placebo-Controlled, Parallel-Group, Single-Centre Trial. Lancet Gastroenterol. Hepatol. 2018, 3, 17–24. [Google Scholar] [CrossRef] [PubMed]
Park, B.; Koh, H.; Patatanian, M.; Reyes-Caballero, H.; Zhao, N.; Meinert, J.; Holbrook, J.T.; Leinbach, L.I.; Biswal, S. The Mediating Roles of the Oral Microbiome in Saliva and Subgingival Sites between E-Cigarette Smoking and Gingival Inflammation. BMC Microbiol. 2023, 23, 35. [Google Scholar] [CrossRef] [PubMed]
Zhao, N.; Khamash, D.F.; Koh, H.; Voskertchian, A.; Egbert, E.; Mongodin, E.F.; White, J.R.; Hittle, L.; Colantuoni, E.; Milstone, A.M. Low Diversity in Nasal Microbiome Associated with Staphylococcus Aureus Colonization and Bloodstream Infections in Hospitalized Neonates. Open Forum Infect Dis 2021, 8, ofab475. [Google Scholar] [CrossRef] [PubMed]
Caporaso, J.G.; Kuczynski, J.; Stombaugh, J.; Bittinger, K.; Bushman, F.D.; Costello, E.K.; Fierer, N.; Peña, A.G.; Goodrich, J.K.; Gordon, J.I. QIIME Allows Analysis of High-Throughput Community Sequencing Data. Nat. Methods 2010, 7, 335–336. [Google Scholar] [CrossRef] [PubMed]
Bolyen, E.; Rideout, J.R.; Dillon, M.R.; Bokulich, N.A.; Abnet, C.C.; Al-Ghalith, G.A.; Alexander, H.; Alm, E.J.; Arumugam, M.; Asnicar, F.; et al. Reproducible, Interactive, Scalable and Extensible Microbiome Data Science Using QIIME 2. Nat. Biotechnol. 2019, 37, 852–857. [Google Scholar] [CrossRef] [PubMed]
Zhao, N.; Chen, J.; Carroll, I.M.; Ringel-Kulka, T.; Epstein, M.P.; Zhou, H.; Zhou, J.J.; Ringel, Y.; Li, H.; Wu, M.C. Testing in Microbiome-Profiling Studies with MiRKAT, the Microbiome Regression-Based Kernel Association Test. Am. J. Hum. Genet. 2015, 96, 797–807. [Google Scholar] [CrossRef] [PubMed]
Koh, H.; Blaser, M.J.; Li, H. A Powerful Microbiome-Based Association Test and a Microbial Taxa Discovery Framework for Comprehensive Association Mapping. Microbiome 2017, 5, 45. [Google Scholar] [CrossRef]
Koh, H. An Adaptive Microbiome α-Diversity-Based Association Analysis Method. Sci Rep. 2018, 8, 1–12. [Google Scholar] [CrossRef]
Koh, H.; Zhao, N. A Powerful Microbial Group Association Test Based on the Higher Criticism Analysis for Sparse Microbial Association Signals. Microbiome 2020, 8, 63. [Google Scholar] [CrossRef]
Mandal, S.; Van Treuren, W.; White, R.A.; Eggesbø, M.; Knight, R.; Peddada, S.D. Analysis of Composition of Microbiomes: A Novel Method for Studying Microbial Composition. Microb. Ecol. Health Dis. 2015, 26, 27663. [Google Scholar] [CrossRef]
Dhariwal, A.; Chong, J.; Habib, S.; King, I.L.; Agellon, L.B.; Xia, J. MicrobiomeAnalyst: A Web-Based Tool for Comprehensive Statistical, Visual and Meta-Analysis of Microbiome Data. Nucleic Acids Res. 2017, 45, W180–W188. [Google Scholar] [CrossRef] [PubMed]
Arndt, D.; Xia, J.; Liu, Y.; Zhou, Y.; Guo, A.C.; Cruz, J.A.; Sinelnikov, I.; Budwill, K.; Nesbø, C.L.; Wishart, D.S. METAGENassist: A Comprehensive Web Server for Comparative Metagenomics. Nucleic Acids Res. 2012, 40, W88–W95. [Google Scholar] [CrossRef] [PubMed]
Yoon, S.H.; Ha, S.M.; Kwon, S.; Lim, J.; Kim, Y.; Seo, H.; Chun, J. Introducing EzBioCloud: A Taxonomically United Database of 16S rRNA Gene Sequences and Whole-Genome Assemblies. Int. J. Syst. Evol. Microbiol. 2017, 67, 1613–1617. [Google Scholar] [CrossRef] [PubMed]
McMurdie, P.J.; Holmes, S. Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE 2013, 8, e61217. [Google Scholar] [CrossRef] [PubMed]
Weber, N.; Liou, D.; Dommer, J.; MacMenamin, P.; Quiñones, M.; Misner, I.; Oler, A.J.; Wan, J.; Kim, L.; McCarthy, M.C.; et al. Nephele: A Cloud Platform for Simplified, Standardized and Reproducible Microbiome Data Analysis. Bioinformatics 2018, 34, 1411–1413. [Google Scholar] [CrossRef] [PubMed]
Gonzalez, A.; Navas-Molina, J.A.; Kosciolek, T.; McDonald, D.; Vázquez-Baeza, Y.; Ackermann, G.; DeReus, J.; Janssen, S.; Swafford, A.D.; Orchanian, S.B.; et al. Qiita: Rapid, Web-Enabled Microbiome Meta-Analysis. Nat. Methods 2018, 15, 796–798. [Google Scholar] [CrossRef] [PubMed]
Mitchell, K.; Ronas, J.; Dao, C.; Freise, A.C.; Mangul, S.; Shapiro, C.; Moberg-Parker, J. PUMAA: A Platform for Accessible Microbiome Analysis in the Undergraduate Classroom. Front. Microbiol. 2020, 11, 584699. [Google Scholar] [CrossRef] [PubMed]
Gu, W.; Moon, J.; Chisina, C.; Kang, B.; Park, T.; Koh, H. MiCloud: A Unified Web Platform for Comprehensive Microbiome Data Analysis. PLoS ONE 2022, 17, e0272354. [Google Scholar] [CrossRef]
Jang, H.; Park, S.; Koh, H. Comprehensive Microbiome Causal Mediation Analysis Using MiMed on User-Friendly Web Interfaces. Biol. Methods Protoc. 2023, 8, bpad023. [Google Scholar] [CrossRef]
Kim, J.; Koh, H. MiTree: A Unified Web Cloud Analytic Platform for User-Friendly and Interpretable Microbiome Data Mining Using Tree-Based Methods. Microorganisms 2023, 11, 2816. [Google Scholar] [CrossRef] [PubMed]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: San Francisco, CA, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Goodrich, J.K.; Waters, J.L.; Poole, A.C.; Sutter, J.L.; Koren, O.; Blekhman, R.; Beaumont, M.; Van Treuren, W.; Knight, R.; Bell, J.T. Human Genetics Shape the Gut Microbiome. Cell 2014, 159, 789–799. [Google Scholar] [CrossRef] [PubMed]
Tukey, J.W. Comparing Individual Means in the Analysis of Variance. Biometrics 1949, 5, 99–114. [Google Scholar] [CrossRef] [PubMed]
Kruskal, W.H.; Wallis, W.A. Use of Ranks in One-Criterion Variance Analysis. J. Am. Stat. Assoc. 1952, 47, 583–621. [Google Scholar] [CrossRef]
Dunn, O.J. Multiple Comparisons among Means. J. Am. Stat. Assoc. 1961, 56, 52–64. [Google Scholar] [CrossRef]
McCullagh, P. Regression Models for Ordinal Data. J. R. Stat. Soc. Ser. B Methodol. 1980, 42, 109–127. [Google Scholar] [CrossRef]
Jiang, Z.; He, M.; Chen, J.; Zhao, N.; Zhan, X. MiRKAT-MC: A Distance-Based Microbiome Kernel Association Test with Multi-Categorical Outcomes. Front. Genet. 2022, 13, 841764. [Google Scholar] [CrossRef]
Anderson, M.J. A New Method for Non-parametric Multivariate Analysis of Variance. Austral Ecol. 2001, 26, 32–46. [Google Scholar] [CrossRef]
McArdle, B.H.; Anderson, M.J. Fitting Multivariate Models to Community Data: A Comment on Distance-Based Redundancy Analysis. Ecology 2001, 82, 290–297. [Google Scholar] [CrossRef]
Price, M.N.; Dehal, P.S.; Arkin, A.P. FastTree: Computing Large Minimum Evolution Trees with Profiles Instead of a Distance Matrix. Mol. Biol. Evol. 2009, 26, 1641–1650. [Google Scholar] [CrossRef] [PubMed]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Simpson, E.H. Measurement of Diversity. Nature 1949, 163, 688. [Google Scholar] [CrossRef]
Fisher, R.A.; Corbet, A.S.; Williams, C.B. The Relation between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population. J. Anim. Ecol. 1943, 12, 42–58. [Google Scholar] [CrossRef]
Chao, A. Nonparametric Estimation of the Number of Classes in a Population. Scand. J. Stat. 1984, 11, 265–270. [Google Scholar]
Chao, A.; Lee, S.-M. Estimating the Number of Classes via Sample Coverage. J. Am. Stat. Assoc. 1992, 87, 210–217. [Google Scholar] [CrossRef]
Lee, S.M.; Chao, A. Estimating Population Size via Sample Coverage for Closed Capture-Recapture Models. Biometrics 1994, 50, 88–97. [Google Scholar] [CrossRef]
Faith, D.P. Conservation Evaluation and Phylogenetic Diversity. Biol. Conserv. 1992, 61, 1–10. [Google Scholar] [CrossRef]
Jaccard, P. The Distribution of the Flora in the Alpine zone. New Phytol 1912, 11, 37–50. [Google Scholar] [CrossRef]
Bray, J.R.; Curtis, J.T. An Ordination of the Upland Forest Communities of Southern Wisconsin. Ecol. Monogr. 1957, 27, 326–349. [Google Scholar] [CrossRef]
Lozupone, C.; Knight, R. UniFrac: A New Phylogenetic Method for Comparing Microbial Communities. Appl. Environ. Microbiol. 2005, 71, 8228–8235. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Bittinger, K.; Charlson, E.S.; Hoffmann, C.; Lewis, J.; Wu, G.D.; Collman, R.G.; Bushman, F.D.; Li, H. Associating Microbiome Composition with Environmental Covariates Using Generalized UniFrac Distances. Bioinformatics 2012, 28, 2106–2113. [Google Scholar] [CrossRef] [PubMed]
Lozupone, C.A.; Hamady, M.; Kelley, S.T.; Knight, R. Quantitative and Qualitative β Diversity Measures Lead to Different Insights into Factors That Structure Microbial Communities. Appl. Environ. Microbiol. 2007, 73, 1576–1585. [Google Scholar] [CrossRef] [PubMed]
Aitchison, J. The Statistical Analysis of Compositional Data. J. R. Stat. Soc. Ser. B Methodol. 1982, 44, 139–160. [Google Scholar] [CrossRef]
Sanders, H.L. Marine Benthic Diversity: A Comparative Study. Am. Nat. 1968, 102, 243–282. [Google Scholar] [CrossRef]
Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 1995, 57, 289–300. [Google Scholar] [CrossRef]

Figure 1. The results from alpha diversity analysis on the association between each alpha diversity index and the levels of BMI adjusted for age and sex based on the proportional odds model.

Figure 2. The results from beta diversity analysis on the association between each beta diversity index and the levels of BMI adjusted for age and sex based on MiRKAT-MC.

Figure 3. The results from taxonomic analysis on the association between each microbial taxon and the levels of BMI adjusted for age and sex based on the proportional odds model. Q-value represents FDR-adjusted p-value.

Figure 4. The variance importance at the genus level from random forest algorithm. G25: Lachnospira; G15: Oscillospira; G14: SMB53; G20: Roseburia; G22: Dorea; G28: Akkermansia; G30: Bilophila; G11: Acidaminococcus; G38: Butyricimonas; G1: Streptococcus; G27: Collinsella; G26: Bifidobacterium; G44: Sutterella; G31: Prevotella; G33: Paraprevotella; G41: Haemophilus; G23: Anaerostipes; G4: Eubacterium; G21: Ruminococcus; G10: Phascolarctobacterium.

Figure 5. The partial dependence plot at the genus level from random forest algorithm. G25: Lachnospira; G15: Oscillospira; G14: SMB53; G20: Roseburia; G22: Dorea; G28: Akkermansia; G30: Bilophila; G11: Acidaminococcus; G38: Butyricimonas; G1: Streptococcus; G27: Collinsella; G26: Bifidobacterium; G44: Sutterella; G31: Prevotella; G33: Paraprevotella; G41: Haemophilus; G23: Anaerostipes; G4: Eubacterium; G21: Ruminococcus; G10: Phascolarctobacterium.

Figure 6. The SHAP variable importance plot at the genus level from gradient boosting algorithm. G20: Roseburia; G14: SMB53; G25: Lachnospira; G15: Oscillospira; G22: Dorea; G28: Akkermansia; G33: Paraprevotella; G11: Acidaminococcus; G26: Bifidobacterium; G1: Streptococcus; G36: Parabacteroides; G13: Sarcina; G27: Collinsella; G21: Ruminococcus; G19: Blautia; G35: Alistipes; G10: Phascolarctobacterium; G4: Eubacterium; G38: Butyricimonas; G44: Sutterella.

Figure 7. The partial dependence plot at the genus level from gradient boosting algorithm. G20: Roseburia; G14: SMB53; G25: Lachnospira; G15: Oscillospira; G22: Dorea; G28: Akkermansia; G33: Paraprevotella; G11: Acidaminococcus; G26: Bifidobacterium; G1: Streptococcus; G36: Parabacteroides; G13: Sarcina; G27: Collinsella; G21: Ruminococcus; G19: Blautia; G35: Alistipes; G10: Phascolarctobacterium; G4: Eubacterium; G38: Butyricimonas; G44: Sutterella.

Table 1. The statistical methods available for alpha diversity analysis and taxonomic analysis.

	Nominal Variable	Ordinal Variable
Covariate-adjusted analysis	Multinomial logistic regression	Proportional odds model
Unadjusted analysis	ANOVA F-test and Tukey’s test, Kruskal–Wallis and Dunn’s test, Multinomial logistic regression	Proportional odds model

Table 2. The statistical methods available for beta diversity analysis.

	Nominal Variable	Ordinal Variable
Covariate-adjusted analysis	MiRKAT-MC	MiRKAT-MC
Unadjusted analysis	PERMANOVA MiRKAT-MC	MiRKAT-MC

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.; Jang, H.; Koh, H. MiMultiCat: A Unified Cloud Platform for the Analysis of Microbiome Data with Multi-Categorical Responses. Bioengineering 2024, 11, 60. https://doi.org/10.3390/bioengineering11010060

AMA Style

Kim J, Jang H, Koh H. MiMultiCat: A Unified Cloud Platform for the Analysis of Microbiome Data with Multi-Categorical Responses. Bioengineering. 2024; 11(1):60. https://doi.org/10.3390/bioengineering11010060

Chicago/Turabian Style

Kim, Jihun, Hyojung Jang, and Hyunwook Koh. 2024. "MiMultiCat: A Unified Cloud Platform for the Analysis of Microbiome Data with Multi-Categorical Responses" Bioengineering 11, no. 1: 60. https://doi.org/10.3390/bioengineering11010060

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MiMultiCat: A Unified Cloud Platform for the Analysis of Microbiome Data with Multi-Categorical Responses

Abstract

1. Introduction

2. Materials and Methods

2.1. Association Testing

2.2. Prediction Modeling

2.3. Web Server Architecture

2.4. Data Availability

3. Results

3.1. Data Processing

3.2. Data Analysis: Association

3.3. Data Analysis: Prediction

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI