Robust and automatic definition of microbiome states

Analysis of microbiome dynamics would allow elucidation of patterns within microbial community evolution under a variety of biologically or economically important circumstances; however, this is currently hampered in part by the lack of rigorous, formal, yet generally-applicable approaches to discerning distinct configurations of complex microbial populations. Clustering approaches to define microbiome “community state-types” at a population-scale are widely used, though not yet standardized. Similarly, distinct variations within a state-type are well documented, but there is no rigorous approach to discriminating these more subtle variations in community structure. Finally, intra-individual variations with even fewer differences will likely be found in, for example, longitudinal data, and will correlate with important features such as sickness versus health. We propose an automated, generic, objective, domain-independent, and internally-validating procedure to define statistically distinct microbiome states within datasets containing any degree of phylotypic diversity. Robustness of state identification is objectively established by a combination of diverse techniques for stable cluster verification. To demonstrate the efficacy of our approach in detecting discreet states even in datasets containing highly similar bacterial communities, and to demonstrate the broad applicability of our method, we reuse eight distinct longitudinal microbiome datasets from a variety of ecological niches and species. We also demonstrate our algorithm’s flexibility by providing it distinct taxa subsets as clustering input, demonstrating that it operates on filtered or unfiltered data, and at a range of different taxonomic levels. The final output is a set of robustly defined states which can then be used as general biomarkers for a wide variety of downstream purposes such as association with disease, monitoring response to intervention, or identifying optimally performant populations.


INTRODUCTION
approach to the distinct problem of defining microbiome states that exhibit short-term transitions within 1. Choosing the k number of clusters (from 2 to 10) with the highest average Silhouette width (SI) 162 among all combinations of pairs of beta diversity measures, with the score being above the SI 163 threshold (0.25), and 164 2. Checking if that k value also passes the Prediction Strength (PS) threshold (0.80) for robustness, or 165 3. Confirming if those k clusters are stable according to the Jaccard threshold (0.75) from a bootstrap-166 ping process 167 We apply those criteria to the output of the 90 combinations (2 algorithms x 5 distances x 9 k values) 168 per dataset to discard those that are not robust and/or not reproducible. 169 The final output of our algorithm is a phyloseq object with a new variable defining the cluster identifier 170 into which each sample has been grouped, a file with <sampleID, clusterID> pairs, and a set of clustering 171 assessment graphs (described below) that can be used to judge robustness.

172
The following sections explain each step in detail, and the relevant factors considered at each point in 173 the automated procedure.

174
Clustering approaches 175 We selected two widely-used algorithms as representative of the two most common clustering approaches: 176 a) PAM (Kaufman and Rousseeuw, 1990), as a partitioning approach, and b) Hclust (Kaufman and 177 Rousseeuw, 1990), as an agglomerative hierarchical approach. 178 We selected PAM (Partitioning Around Medoids) (Kaufman and Rousseeuw, 1990) because it is an 179 improved approach to the well-known non-deterministic k-medoids (Kaufman and Rousseeuw, 1987). 180 This improvement is achieved in two ways. First, PAM selects k medoids among the input instances in a 181 greedy phase, rather than a random selection of the k-medoids (Reynolds et al., 2006). The greedy phase 182 takes each new medoid to minimize the objective function, that is, the sum of distances between each 183 instance and its medoid. Therefore, PAM is a deterministic algorithm and does not need to be run multiple 184 times, as opposed to the k-medoids approach. Second, it spreads out the remaining instances among the 185 defined medoids according to the minimum distance-to-medoid criterion, using the values defined in the 186 distance matrix. PAM then selects each possible pair of instances <medoid, not-medoid>, and evaluates 187 whether the swapping between different clusters results in a smaller value for the objective function. This 188 final step is repeated until the set of medoids do not change. 189 Hclust, a hierarchical clustering algorithm, adopts a bottom-up approach (Kaufman and Rousseeuw, 190 1990). Hclust begins with an independent cluster per instance. The two nearest clusters are then grouped, 191 in an iterative way, until the algorithm arrives at a single cluster representing the root of an inverted tree 192 structure. We chose average (i.e. UPGMA) distance between cluster members as the linkage criteria used 193 to compute the distance between two clusters, rather than single or complete linkage which takes only 194 one cluster element into account when computing the distance (Kaufman and Rousseeuw, 1990). 195 Beta diversity metrics 196 Koren et al. (2013)'s study was used as reference, because they evaluated beta diversity metrics' influence 197 on the clustering of microbiome samples, which mirrors our goals. Their approach recommended 198 comparing 2 or more distance measures in the clustering process, thus we chose 5 distinct distance 199 measures from the list available in the R vegan package (version 2.3-1) (Oksanen et al., 2015). This 200 was based on suggestions from the work of Koren et al. (2013) and a comprehensive study ranking all 201 available beta diversity measures with abundance data (Barwell et al., 2015), that compared multiple 202 quantitative and qualitative properties. 203 First, we selected well-extended Jensen-Shanon Distance (JSD), rootJSD and Bray-Curtis, as these clustering assessment was computed with the silhouette function in the cluster R package; the robustness 260 evaluation is computed with the prediction.strength function (Tibshirani and Walther, 2005) and the 261 corresponding bootstrapping scored by Jaccard similarity with the clusterboot function (Hennig, 2007, 262 2008), both in fpc R package (v2.1-11). 263 In summary, a total of 18,090 different clustering processes are executed to decide the best microbiome 264 state definitions within a given dataset; 9 (k = 2 : 10) potential cluster-numbers x 5 distance measures 265 (JSD, rJSD, Bray-Curtis, Morisita-Horn and Kulczynski) x 201 assessment scores (1 SI + 100 PS + 100 266 Jaccard) x 2 clustering algorithms (PAM and Hclust).

267
Selected microbiome datasets 268 As stated in the introduction, our primary motivation in creating this algorithm was its application to 269 longitudinal datasets, where we anticipate that the between-sample differences to be significantly lower 270 than would be seen in, for example, population-based sampling studies. As such, these datasets represent Citation section. Clearly, however, there is nothing precluding the use of this algorithm to inform any 276 other microbiome study format.

277
The human gut microbiome dataset from David et al. (2014)  The chick gut microbiome dataset was generated by Ballou et al. (2016). It analyses the response

289
The lake microbiome dataset Dam et al. (2016) come from the freshwater Lake Mendota in USA, 290 with 91 samples over 11 years and 17462 taxa, where approximately 5% of the taxa constitute 80-99% of 291 the total microbiome.

292
The  vaginal microbiome dataset consists of 937 samples and 330 OTUs, cor-293 responding to 32 women, with samples collected twice per week for 16 weeks. Dataset details and 294 availability are provided in the Data Citation section. In this case, the original data counts are already 295 pre-processed, and normalized to a sum of 100 per sample, as relative abundances. This contrasts with the 296 previous datasets, where a normalization procedure was applied by us before entering the data into our 297 algorithm.

298
The left and right palm and tongue microbiome datasets are longitudinal datsets from the Human 299 Microbiome Project, analysed in Caporaso et al. (2011)     In the chick gut microbiome, the two clusters identified by our method largely correspond to imma-348 ture/young chicks and mature/adult chicks, reinforcing the conclusion of the original manuscript, which 349 showed chick age to be the primary differential factor between samples, as it is described in the subsection 350 titled "Exemplar application -state sequence diagram".

351
In the vaginal microbiome dataset, our algorithm determined that the strongest supported grouping 352 consisted of 4 states (clusters), rather than the 5 clusters identified in the original manuscript (Gajer 353 et al., 2012). These four states were clearly assignable to the clusters identified in their manuscript as I,

354
II, III, and IV-B (see Supplementary Table S1). Members of their cluster labeled IV-A are distributed 355 among the other states identified by our algorithm, specifically in cluster I (our state 4) and cluster III 356 (our state 1), and several with cluster IV-B (our state 3). We note that, based on our algorithm, Prediction

Manuscript to be reviewed
Strength decreases dramatically when the number of clusters increases from k=4 to k=5. According to the original authors, only cluster IV-B is associated with a notable disease state (bacterial vaginosis), 359 while the remaining clusters correspond to a (nominally) healthy vaginal microbiome with no phenotypic 360 distinction between the subjects in those clusters . Thus there is no biological evidence 361 arising from this study contradicting our assignment of only four robust states in this dataset, versus the 362 five clusters described in the original study. The results of the vaginal microbiome are explored in greater 363 detail in the subsection titled "Relationship to other definitions of microbiome" below.

364
Without any published data providing an expert classification of the preterm infant gut microbiome 365 dataset from La Rosa et al. (2014), our robust clustering methodology determined that the optimal number 366 of states that could be robustly partitioned was k=4 . When we analyse the microbial composition of these 367 4 states, we found that state 2 contains ∼50% of the samples, state 1 contains ∼25%, state 3 ∼20% and  to the algorithm (number of taxa available in Table 2). This was done to determine if the identification of 387 state-clusters was primarily reliant on the most dominant taxa, or if a significant "signal" was emerging 388 from the remaining taxa. Second, where the source data allows, we aggregate taxa at a higher resolution 389 (at Genus taxonomic level) and re-execute the analysis of all, dominant and non-dominant taxa at the 390 genus (see columns "Genus" in Table 1). This was done because, in many studies, OTU tables are 391 aggregated at the Genus level. As such, we wished to demonstrate that the algorithm can perform on data 392 organized in this way.  these six clusters align with the increasing age of the chicks (see Figure 6 from (García-Jiménez et al.,     Manuscript to be reviewed community features between large segments of the population. Other studies, looking at other body 431 cavities, have coined phrases such as "state type" or "community groups" to describe a clustering of 432 microbiome community structures, at a given level of granularity. The selected level of granularity, 433 however, is somewhat arbitrarily decided, and indeed, several authors recognize this by indicating that 434 they observe what appear to be sub-groups within their defined groupings.

435
To demonstrate how our concept of "state" fits into this milieu of varying descriptors, we reanalyse their analysis, five "community groups". Each group was associated to its probability of the incidence of 438 bacterial vaginosis, and also the proportion of individuals with different ethnicities within that group. 439 We applied our algorithm, independently, to members of each of the more populated community 440 groups from that study: I, III and IV. This analysis yielded 2, 3 and 2 additional, more fine-grained 441 states (respectively) as the highest-scoring states within these community groups (see Fig. 4). The data 442 potentially support an even larger number of more fine-grained states, given that higher cluster numbers are 443 also supported by robust SI and Jaccard scores. These finer-grained states may relate to some detectable 444 biological reality -for example, the two states identified within in community group IV segregated almost 445 all women of the black ethnicity (excluding three) and many of the Hispanic group, into the same state. A 446 researcher could thus test the various sets of statistically-supported states emerging from our algorithm to 447 determine which, if any, correlate to their biological feature(s) of interest, then use these state definitions 448 as biomarkers for their study. This also shows that the same data can be interrogated, iteratively and 449 without manual adaptation of the algorithm, to reveal statistically supported novel substructures that could 450 act as higher-granularity biomarkers applicable within that subset of individuals.  to Ananke. This then highlights our distinct goal, which is to discover comparable microbiome states 503 between different individuals, and even if they appear in non-sequential order. An additional distinction is 504 that our approach attempts and evaluates a wide spectrum of different parameter-values in its search for 505 the optimum clusters, while Ananke has coded defaults and/or user-defined parameters (which may be 506 incorrectly assigned by the user). This makes our approach both more objective, as well as amenable to 507 automated environments, such as that used in the MDPbiome (García-Jiménez et al., 2018) study, without 508 modifying the default behavior of the algorithm. 509 TIME (Baksi et al., 2018) is a microbiome time series visualization tool, supporting biologists' attempt 510 to gain insights from microbiome taking time into account. Among its available analyses, TIME allows 511 the user to cluster taxa or time points; however, it is not able to group microbiome samples in a global 512 way, over multiple subjects, as our approach does. Moreover, TIME is specific to longitudinal analyses, 513 where our algorithm is applicable more broadly.

514
NetShift (Kuntal et al., 2019) explores differences between exactly two pre-defined states (e.g., health 515 vs disease), based on differences in their microbe interactions. It takes as input two association networks, 516 and compares their global and local graph properties (density, average path length, degree, etc.). A key 517 difference, therefore, is that we take an objective, data-centric approach to computing microbiome states, 518 which is independent of any external biological observation; moreover, we allow multiple states, as well 519 as a temporal sequence of states, compared to the two pre-defined static states that are input to NetShift. 520 MDSINE (Bucci et al., 2016) analyses microbiome time series data in terms of interaction networks, 521 suggesting if/how yes-no perturbations affect the microbial composition. MDSINE does not define groups 522 of microbiome samples, but rather taxa sub-communities. The key distinction, therefore, is the goal of 523 MDSINE to predict taxa concentration, which is distinct from the definition of a statistically-defined 524 "state" that can behave as a biomarker. Further, MDSINE requires qPCR data as input, in addition to the 525 OTU table that is unique input to our algorithm. Manuscript to be reviewed vaginal microbiome datasets. Other exemplars are available in our Zenodo deposit (see "Code and data 539 availability" section).

540
These diagrams are consistent with the observations made by the original authors of these datasets, 541 showing that, for example, that the chick gut microbiome follows a linear path from adolescence to 542 maturity, and that there are stable 'ground states' in the vaginal microbiome that differ from individual-543 to-individual, and that persist even after brief perturbations . Thus, our algorithm 544 identifies microbiome states consistent with previously published studies, but does so automatically, 545 objectively, and with rigorous statistical support.

547
There are a variety of choices that must be considered when clustering metagenomics data, and in many 548 existing approaches, these decisions may be made subjectively, and/or are not subsequently validated.

549
For example, testing whether clustered data is significant and robust, or selecting clusters based on 550 non-statistical information such as phenotype, where the data interpretation and manipulation is driven by 551 the presumed outcome. After reviewing and analysing many studies that followed distinct approaches 552 to clustering gut microbiome datasets, Costea et al. (2018) concludes that no procedure is obviously 553 preferred over any other, but rather, that the selection of an approach will depend on the experimental 554 conditions and/or the features of the resulting data in each particular case. Manuscript to be reviewed of metagenomics analyses, providing the most statistically-supportable outcome, regardless of the input data, and including null outcomes.

569
Our approach requires certain assumptions. In particular, we presume that even data that varies 570 continuously can be modeled in a discrete manner -that the continual flux in community makeup can 571 nevertheless be resolved to particular states that will appear as statistically significant groupings. While 572 there is disagreement in the microbiome community about whether this assumption mirrors any reality in 573 the microbial population dynamics, our discretization approach is, we believe, a plausible simplification 574 of microbiome dynamics. In fact, many real processes (including many biological ones) are not discrete, 575 although they are simplified as discrete to allow their modeling to be studied with computational and 576 mathematical models (Faust and Raes, 2012;Faust et al., 2015), as we do here by defining microbiome 577 states within a given dataset.

578
The clustering outcomes on the discretized data are then tested and validated using a variety of metrics.

579
PS seems to be the most restrictive score that can be applied to discriminate between a viable partition of 580 microbiome samples, and other non-robust possibilities where the scores fall below its threshold of 0.8 581 (see central columns of Fig. 1). Regarding beta diversity measures, JSD and Morisita-Horn are the metrics 582 that usually resulted in the highest SI values, and this held true even for the dominant/non-dominant 583 filtered data subset, and the aggregated taxa subset.

584
With respect to the utility of the identified states, the approach attempts to maximize the number of 585 statistically supportable clusters, with the aim that the resulting state definitions will have a sufficient 586 granularity to be useful as biomarkers. In this regard, we attempt to avoid singletons or very small clusters 587 (with size < 5), and note that, in this study, Hclust tends to often generate just 2 clusters according 588 to the limit established by the robustness PS score (see central columns of Supplementary Figure S1).

589
The final bootstrapping step to assess cluster stability, measured in Jaccard similarity terms, determines 590 that PAM partitions represent valid and stable clusters, with almost all k values. This contrasts with 591 Hclust, where many k values with different distance measures do not reach the minimum of 0.75 of 592 mean Jaccard similarity. As such, we suggest that Hclust is perhaps not suitable for this task, while an 593 agglomerative approach (e.g., PAM) is more suitable than hierarchical partitioning, given our desired 594 objectives and results. Within the eight datasets used in this manuscript, our approach identified as many 595 as 6 different microbiome states that passed robustness challenges. There is no "gold standard" that can 596 be used to assess these results; nevertheless, the states identified seemed to correlate, in most cases, with 597 the biological knowledge about the samples that could be gleaned from the available sample metadata,

598
suggesting that they represent, as intended, genuine biomarkers of biological relevance.

599
Beyond observations about the algorithm itself, the various result data from our study provide 600 observations that are potentially of-interest. First, the relative importance of dominant and non-dominant 601 taxa with respect to cluster number and robustness is non-trivial. Table 1 shows that removing the top 602 1% dominant taxa ("ND" column) sometimes reduces the overall ability of the algorithm to identify 603 clusters; however, in other cases, additional clusters that pass robustness tests, and seem to correlate 604 with biological annotations, are revealed, indicating that in these cases minor population changes are 605 being masked by the dominant species. Previous studies found that enterotypes could largely be defined 606 based on the differing proportions of a small number of dominant species (Gorvitovskaia et al., 2016), 607 and this is largely consistent with our observations. Nevertheless, the inclusion of non-dominant species 608 ("All" column) affected the clustering results in all cases, compared to the use of dominant taxa alone.

609
Focusing on less abundant taxa to characterize microbiome clusters is not common practice, though to the selection of input taxa, and thus users may decide to do these additional studies if it is suitable for 617 their goals.

618
As with most machine-learning approaches, this approach is sensitive to the sample size, failing when 619 there are too few samples to represent the number of true states, or when particular states are exceedingly 620 rare. In such cases, and depending on the final objective, our results suggest that focusing on dominant