Neonatal gut and respiratory microbiota: coordinated development through time and space

Background Postnatal development of early life microbiota influences immunity, metabolism, neurodevelopment, and infant health. Microbiome development occurs at multiple body sites, with distinct community compositions and functions. Associations between microbiota at multiple sites represent an unexplored influence on the infant microbiome. Here, we examined co-occurrence patterns of gut and respiratory microbiota in pre- and full-term infants over the first year of life, a period critical to neonatal development. Results Gut and respiratory microbiota collected as longitudinal rectal, throat, and nasal samples from 38 pre-term and 44 full-term infants were first clustered into community state types (CSTs) on the basis of their compositional profiles. Multiple methods were used to relate the occurrence of CSTs to temporal microbiota development and measures of infant maturity, including gestational age (GA) at birth, week of life (WOL), and post-menstrual age (PMA). Manifestation of CSTs followed one of three patterns with respect to infant maturity: (1) chronological, with CST occurrence frequency solely a function of post-natal age (WOL), (2) idiosyncratic to maturity at birth, with the interval of CST occurrence dependent on infant post-natal age but the frequency of occurrence dependent on GA at birth, and (3) convergent, in which CSTs appear first in infants of greater maturity at birth, with occurrence frequency in pre-terms converging after a post-natal interval proportional to pre-maturity. The composition of CSTs was highly dissimilar between different body sites, but the CST of any one body site was highly predictive of the CSTs at other body sites. There were significant associations between the abundance of individual taxa at each body site and the CSTs of the other body sites, which persisted after stringent control for the non-linear effects of infant maturity. Canonical correlations exist between the microbiota composition at each pair of body sites, with the strongest correlations between proximal locations. Conclusion These findings suggest that early microbiota is shaped by neonatal innate and adaptive developmental responses. Temporal progression of CST occurrence is influenced by infant maturity at birth and post-natal age. Significant associations of microbiota across body sites reveal distal connections and coordinated development of the infant microbial ecosystem. Electronic supplementary material The online version of this article (10.1186/s40168-018-0566-5) contains supplementary material, which is available to authorized users.


Community state type occurrence across chronological and developmental time 190
We next examined the association of CST occurrence with PMA in the 82 pre-and full-191 term infants. Patterns of temporal progression appear to be generally shared across individuals, 192 with a majority of infants manifesting most community state types in their first year, in a similar 193 sequence and at similar ages. These properties allow us to use CSTs to describe the 194 ecological succession and summarize the development of the infant gut and respiratory 195 microbiota in a conceptually and analytically tractable way. As described below, the occurrence 196 of all throat, nasal and gut CSTs were strongly associated with PMA in both pre-and full-term 197 infants (Supplemental Figure 1). 198 Throat CST and PMA. Throat CST 6, being comprised in large part of Staphylococcus, is 199 the only throat CST not dominated by Streptococcus and is typically the first CST to be 200 observed, especially in pre-term infants. CST 6 quickly gives way to CST 1, which makes up the 201 majority of observations in the first 2-3 months of life and is achieved more quickly but sustained 202 for less time in full-term subjects than pre-terms. CST 2 is the next to emerge, reaching peak 203 abundance in the first six months of life, manifesting earlier and with greater frequency in full-204 term than pre-term subjects and being the most common full-term CST from 42-65 weeks PMA. 205 Pre-term subjects exhibit CST 5 more frequently and persistently than full-term subjects, which 206 peaks in prevalence at ~25 weeks of life. CST 3 appears shortly after birth, several months 207 before CST 4; both at low frequency initially and increasing in prevalence for the duration of the 208 sampling period, accounting for a majority of samples beyond 70 weeks PMA and 209 distinguishable only by lowly abundant and rare taxa. 210 Nasal CST and PMA. Pre-term infants overwhelmingly begin life in nasal CST 1, which 211 is characterized by a high abundance of Staphylococcus and Corynebacterium and observed in 212 fewer than 50% of full-term infants. The prevalence of CST 1 is reduced after 40 weeks and it is 213 completely absent by 55 weeks PMA. CST 4, which is dominated by Streptococcus, is 214 observed in the majority of subjects and appears at nearly all time points, but is most prevalent 215 at 37-60 weeks PMA in both pre-and full-term infants. CSTs 2 and 3, which are dominated by 216 Corynebacterium and Streptococcus, are common at all time points after 40 weeks PMA and 217 comprise the majority of all samples beyond 60 weeks PMA. CST 5, which is characterized by a 218 high abundance of Streptococcus, first appears at 50-60 weeks PMA and is observed more 219 frequently in full-term subjects. CSTs 6 and 7, which are rare and are distinguished by high 220 levels of Moraxella and Staphylococcus, respectively, are sporadically observed after 42 weeks 221 PMA with CST 7 being more common prior to 60 weeks PMA and CST 6 being more common 222 for the remainder of the period of observation. 223 Gut CST and PMA. Gut CST 2 comprised the majority of pre-term samples prior to 40 224 weeks PMA and full-term samples prior to 42 weeks PMA, but quickly gives way to CST 1 in 225 most subjects. CSTs 1 and 3 account for the majority of observations through 60 weeks PMA, 226 with CST 3 being more common in pre-term infants and typically occurring after CST 1. CSTs 227 4 and 5 were prevalent beyond 42 weeks PMA and accounted for the majority of samples by 228 one year in both pre-and full-term infants, with CST 4 being especially common later on. 229 Community state type 6 is observed in a minority of pre-and full-term subjects after 42 weeks 230 PMA and peaks in abundance around one year of post-natal age, with no significant association 231 with GA at birth. 232 Overall, a strong temporal structure and ordered progression of CSTs relative to PMA at 233 all three body sites is apparent. While no single CST is observed in all 82 infants, common 234 patterns of sequential CST occurrence at each body site reveal canonical orderings, with the 235 throat progressing through CSTs 6, 1, 2, 5, 4 and 3; nasal through CSTs 1, 7, 4, 2, 3, 6 and 5; 236 and gut through CSTs 2, 1, 3, 6, 4 and 5. However, pre-and full-term infants are initially 237 colonized by distinct CSTs, with CSTs 1, 2 and 6 overrepresented prior to 40 weeks PMA in the 238 nose, gut and throat, respectively. Furthermore, individual infants matched by PMA transition 239 through CSTs at different rates, which suggest factors other than age regulate microbiota 240

progression. 241
Correlations between community state type and PMA in pre-and full-term infants. 242 In order to elucidate the relationship between time and the progression of community 243 types through each body site we further examined the associations between CSTs and time, 244 which can be measured developmentally as PMA or postnatally with WOL. We first fit 245 smoothed curves of the probability of being in a given CST against WOL and GA at birth 246 ( Figure 3) and the probability of being in a given CST against PMA and GA at birth 247 (Supplemental Figure 2) which revealed several canonical temporal trends in the occurrence 248 of CSTs. The chronological CSTs, a minority of all CSTs identified, showed no substantive 249 difference over the first year of life between pre-and full-term infants (e.g. throat CSTs 3 and 4), 250 11 indicating that the week of life, a proxy for environmental exposure, drives the occurrence of 251 these community types. In contrast, the idiosyncratic CSTs appear to be endemic markers of 252 GA at birth, with persistent differences in the probability of occurrence between pre-term and 253 full-term infants for the first year of life (e.g. nasal CSTs 1 and 5, throat CST 6). Lastly, the 254 convergent CSTs showed increased probabilities of occurrence at earlier post-natal ages in full-255 term infants, but CST occurrence probability in pre-term infants reached parity after a post-natal 256 interval proportional to their prematurity (inversely proportional to their GA at birth). 257 We then constructed a single index model of age for each CST, which fit the probability 258 of observing the CST as a function of some combination of GA at birth and WOL 259 (Supplemental Figure 3A). These single index models confirmed the temporal trends 260 characteristic of the three CST types described above and allowed us to quantify the differential 261 effects of time spent pre-and postnatally, with respect to the probability of manifesting a given 262 CST. These models were consistent with the trends described above, with three basic patterns 263 being observed: CSTs for which GA at birth was not significant, but week of life was 264 (chronological); CSTs for which a single index could not be well fit, indicating that no amount of 265 time since birth could make up for disparities in GA at birth (idiosyncratic); and CSTs for which 266 GA at birth and week of life were both significant, such that pre-term infants could catch up to 267 full-term infants after some period of time (convergent). In this latter category of CSTs, the pre-268 term infants typically catch up at a rate of one week of life per week pre-term (i.e. gut CST 2, 269 throat CST 1, 2 and 5), implicating developmental age (PMA) as the primary temporal correlate 270 (Supplemental Figure 3B). 271

Associations between community types and composition across body sites 272
The tendency for CSTs in each body site to depend on PMA and postnatal age 273 suggested potential relationships between CSTs across body sites. As expected, the co-12 occurrence patterns of CSTs across body sites were significantly non-random, as assessed by 275 a Chi-squared test (p-value << 0.001). To further explore these associations, we calculated the 276 pairwise correlations between CSTs observed at each site ( Figure 4). Again, co-occurrence 277 patterns between sites was highly significant, suggesting that the observation of a given CST at 278 one body site is highly predictive of the CST at other body sites. The greatest degree of CST 279 correlation between body sites was observed among nasal CST 1, gut CST 2, and throat CSTs 280 1 and 6, for which all cross-body site pairs were positively correlated. 281 Given the strong associations at all body sites between CST occurrence and infant 282 developmental and chronological age, correlations across body sites are expected. In order to 283 control for these factors and identify potential associations arising from direct or indirect 284 interactions across sites, we further assessed the associations between the CST of each body 285 site and the microbiota composition of the other body sites using a series of linear regression 286 models. As predictors for the abundance of each taxon in a given body site, we first used mode 287 of delivery, GA at birth, and day of life (which was modeled with a natural spline to allow for non-288 linear effects), as well as a per-subject random effect to account for repeated sampling of the 289 same individuals. We then added additional predictive terms for the CSTs of the other body 290 sites and refit the models. Because we sought to identify the relationships across body sites 291 that could not be explained by infant maturity alone, we called significant only those 292 associations between taxon abundance and remote CST for which inclusion of the remote CSTs 293 as terms in the model significantly improved its explanatory power (see methods). We identified 294 significant associations across all pairs of body sites (Supplemental Tables 1 and 2), with the 295 most significant associations identified between CSTs of the nose and gut and taxa in the 296 throat. Fewer associations were significant between the gut and nose. Within each body site, 297 certain taxa were uniquely associated with the CST of only one of the other body sites, while 298 other taxa exhibited significant associations with the CST of both of the other two body sites. 299 taxa themselves were not observed, ruling out direct exchange of these bacteria as the sole 301 explanation for the associations. These taxa included Bacteroides ovatus, Clostridium 302 perfringens, Actinobaculum, Faecalibacterium sp. (Supplemental Table 2). Instead these 303 associations were consistent with the presence of bacteria in one site impacting, or being 304 impacted by, development of microbiota at another site through indirect physiological or 305 metabolic mechanisms. In order to assess cases where taxa were present in both associated 306 sites, including Viellonella, Prevotella and Dorea, we tested the OTU residuals for correlation 307 after adjusting for PMA with a spline and each subject with a mixed effect. There were 308 approximately fifty shared OTUs between each pair of sites, which on average were positively 309 correlated for each site pair. The strongest correlation between shared OTUs was between the 310 nose and throat, followed by the throat and gut, followed by the nose and gut (Supplemental 311

Figure 4). 312
The associations between each set of CSTs and specific taxa (Supplemental Tables 1 313 and 2) was visualized in two ways. First, as a bipartite graph ( Figure 5A-C) in which each site's 314 most taxonomically specific significant taxa were connected to the CSTs of the distal body sites 315 to which they had significant associations at a FDR of 10%. Edge color indicates the direction 316 and significance of the association, either as a decrease or increase in abundance when the 317 associated remote CST is observed. Second, as a volcano plot ( Figure 5D), which indicates the 318 significance (F-test p-value) and the magnitude of the increase in explanatory power (R 2 ) when 319 the CSTs of distal body sites are added to the regression models that include as covariates 320 mode of delivery, GA at birth, day of life (as a natural spline), and a per subject random effect, 321 and taxon abundance as the outcome. We identified 140 unique taxa with significant cross-322 body site associations; 59 in the gut, 61 in the nose, and 66 in the throat, with some taxa being 323 significant in multiple body sites where they occurred. In the gut, 15 taxa were significantly 324 associated with CSTs in both the throat and nose, 11 taxa in the nose were associated with 325 both gut and throat CSTs, and 23 taxa in the throat were associated with both nose and gut 326 CSTs. Among taxa present in the gut, the largest numbers of associations were identified with 327 nasal CST 1 and throat CST 2, with 38 and 13 taxa respectively, and the single most significant 328 association was between Bacteroides ovatus and throat CST 2 ( Figures 5B and D). Notably, 329 B. ovatus was not identified in throat samples but was present in 1% of nasal samples at a low 330 (<1%) abundance. Among taxa present in the nose, the largest numbers of associations were 331 identified with gut CST 1 and throat CST 2, with 26 and 29 taxa respectively, and the single 332 most significant association was between an OTU of Prevotella and throat CST 2 (Figures 5A 333 and D). Among taxa present in the throat, the largest numbers of associations were identified 334 with nasal CST 1 and gut CST 1, with 22 and 36 taxa respectively, and the single most 335 significant association was between Prevotella pallens and nasal CST 6 ( Figures 5C and D). 336 In the gut, an OTU of Dorea exhibited the most significant associations, with six CSTs from the 337 nose and throat found to be significant. In the nose, an OTU of Veillonella had the most 338 associations, with seven CSTs from the gut and the throat. In the throat, the Lachnospiraceae 339 family and an OTU of Veillonella had the most associations, each with six CSTs from the nose 340 and gut. 341

The microbiome is canonically correlated across body sites following time and space 342
These observations prompted us to assess the extent to which the OTU composition of 343 each body site over time can be explained as a function of the OTU composition of the other 344 body sites, without the dimension reduction associated with using DMMs and CST classification. 345 We again paired microbiome samples from different body sites that were acquired at the same 346 visit for each participant, resulting in nasal-gut, nasal-throat and rectal-throat site pairs. We then 347 assessed the correlation of taxa between body sites directly using canonical correlation analysis 348 (CCA), which transforms two sets of multivariate observations into a series of canonical 349 correlates (weighted averages) that maximize the score cross-correlations, quantifying the 350 extent to which two sets of multivariate observations are correlated. Using a cross-validation 351 scheme that held out blocks of individuals, we found in validation data that the subspace cross-352 correlations varied from 0.75 (nasal-throat) to 0.6 (gut-throat), for the first canonical coordinate 353 (CC) (Supplemental Figure 5). Since we previously established that many OTUs vary as a 354 function of time, we anticipated that temporal variation was responsible for much of this 355 correlation. As expected, after adjusting for time by regressing out PMA in each site with a 14 356 degree-of-freedom spline, the time-stabilized correlations were attenuated in all site pairs, but 357 still significantly different from zero in the nasal-gut and nasal-throat pairs (Supplemental 358

DISCUSSION 360
Development of the infant microbiome landscape is a critical factor in overall infant 361 development and long-term health. In this study, we examined the progression of and 362 spatiotemporal interactions between the gut and respiratory microbiota in pre-and full-term 363 infants. Knowledge of inter-site interactions between these anatomical niches forms the basis 364 for understanding microbiota perturbations in sick infants and potential alterations in the 365 microbiota of other sites. We combined a community state type framework with longitudinal 366 modeling to identify significant associations between microbiota across the nose, throat, and gut 367 during early life development in pre-and full-term infants. We demonstrate that the abundance 368 of specific taxa in one body site exhibited strong associations with the community types of the 369 other body sites. Using a natural spline function to control for time and linear regression to 370 control for GA at birth, mode of delivery, and within subject correlation, we found that 371 incorporating the community types of the other body sites significantly improved the explanatory 372 power of our model for the abundances of 140 unique taxa, many of which were present in only 373 one member of a pair of associated body sites, ruling out the possibility of direct transmission as 374 the mechanism of association. We then performed canonical correlation analysis to validate the 375 explanatory power of the composition of each body site for every other body site. While the 376 effect of time accounted for the majority of the canonical correlation between sites, a significant 377 degree of correlation was observed even after controlling for temporal effects. These infants. The influence of the respiratory microbiota on lung immunity and respiratory diseases in 396 high risk pre-term infants underscores the need to better understand initial microbial 397 colonization and temporal dynamics of respiratory microbiota through longitudinal studies as we 398 describe here. 399 The gut and respiratory tracts in infants share the same embryonic origin, with mucosal 400 surfaces composed of columnar epithelial cells that sense the commensal microbiota and in turn 401 shape local and systemic immunity as infants mature and as a function of PMA [24-26, 35, 37]. 402 Changes in infant gut and respiratory microbiota that occur as a result of diet, antibiotics, 403 therapeutics and environmental exposures in the NICU are likely to influence the microbiota at 404 both sites [32,38,39]. The effect of these changes can be illustrated by antibiotic induced 405 alterations of neonatal gut microbiota during the crucial early postnatal period of immune 406 competence, which increase the risk of developing allergic airway disease and other atopies in 407 subsequent childhood [40][41][42]. In adults, common chronic lung diseases, such as asthma and 408 chronic obstructive pulmonary disease (COPD) often coincide with inflammatory bowel disease 409 (IBD) and other chronic gastrointestinal syndromes [37,43,44]. The occurrence of these 410 chronic lung diseases is accompanied by functional and structural changes in the intestinal 411 mucosa and increased intestinal permeability, suggesting that interactions between these two 412 distal sites through the gut-respiratory axis impact adult health and disease [43,45]. These gut-413 respiratory interactions likely function on several levels, ranging from direct transfer of bacteria 414 between these sites through reflux and aspiration to indirect effects from bacterial metabolic 415 products or mucosal immune responses common to both the gut and respiratory tract [33, 35, 416 37, 46]. Taken together, these observations of common developmental origins for the gut and 417 respiratory tracts as well as inflammatory diseases that affect both sites, support potential 418 systemic mechanisms that coordinate microbiota development at these distal sites in infants. 419 The microbiota samples for our study were collected as gut, nasal and throat swabs from 420 pre-and full-term infants. In a previous study, we established the taxonomic similarity of infant 421 gut microbiota samples collected either as rectal swabs or from fecal material on a diaper [32]. 422 When evaluating respiratory samples for this study and their relatedness to lung microbiota, we 423 first considered potential acquisition routes for respiratory microbiota. The lung microbiota in 424 healthy individuals is acquired by direct mucosal dispersion and micro-aspiration from the upper 425 respiratory tract (URT) [47,48]. The microbiota in these sites are taxonomically similar, albeit 426 with differences within the URT subcompartments (nasal cavity, nasopharynx, oropharynx and 427 trachea) and lungs a result of cellular and physiological features, such as oxygen and carbon 428 dioxide tension, pH, humidity and temperature that distinguish these environments and select 429 for particular taxa [47][48][49]. The nasopharynx and oropharynx are the primary sources of lung 430 microbiota in infants, likely due to the anatomy of the infant URT and increased production of 431 nasal secretions, both of which enhance dispersal of microbiota to the lungs [50,51]. With the 432 infant nasopharynx and oropharynx, a primary source of colonizing infant lung microbiota, the 433 nasal and throat samples used in our study as representative proxies of the neonate lung 434 microbiota identified significant associations of taxa and CSTs within the gut-respiratory axis. 435 The orthogonal variation of the nasal and throat microbiota is noteworthy, suggesting that 436 additional analysis of both sites will provide unique insights into gut-respiratory interactions. 437 In our initial observations of the CST microbiota content relative to PMA, we noted that In our examination of the associations between CSTs at each body site and time as 472 measured developmentally by PMA or chronologically by WOL, we identified several canonical 473 temporal trends in the occurrence of CSTs that were confirmed using single index models 474 (Supplemental Figure 3). Three groups of CST patterns were identified: 1) chronological, for 475 which the only factor associated with manifestation of the community type was WOL, 2) 476 idiosyncratic, for which there was a persistent disparity in occurrence patterns associated with 477 GA at birth and 3) convergent, for which there is a temporal off-set between pre-term and full-478 term infants initially, but an eventual convergence. These results suggested that the assembly their microbiota composition than sites that are farther apart. These findings demonstrate that 490 significant canonical correlations exist between the composition of microbial communities 491 across body sites which cannot be entirely attributed to each body site's independent temporal 492 progression or to the repeated sampling of the same individuals. 493

Conclusion 494
Understanding the variation between and within subjects, conditions, and over time as 495 the manifestation of distinct community types provides an attractive conceptual and analytical 496 framework for studying the microbiome [65]. While the extent to which community types are 497 discrete or simply represent dense locations on a continuous gradient appears to vary 498 depending on the conditions being sampled and the definition of "community type", both the 499 theoretical basis for community types and the utility of a community type-based framework has 500 been established [65][66][67][68][69][70][71][72][73]. Significantly, the underlying network structure observed in microbiota 501 gives rise to stable community types and a community type-based model is appealing in that it 502 is analytically tractable and circumvents many of the complications associated with high 503 dimensional compositional data. Furthermore, the utility of community types can be extended to 504 the identification of associations across body sites. Ding and Schloss used Dirichlet multinomial 505 mixture modeling to define canonical community types within 18 adult body sites independently 506 and then demonstrated that while the community types across different body sites were 507 dissimilar in composition, they were predictive of one another [66]. The occurrence of 508 community types simultaneously in different body sites was highly non-random, suggesting an 509 unknown mechanism of coordination or interaction acting at a distance. However, the 510 community type framework may mask important biological variability and lack power to detect 511 specific taxa that serve as superior phenotypic biomarkers [65]. The approaches taken in our 512 work reported here largely mitigate these shortcomings, by employing a sampling scheme that 513 was dense and evenly distributed over gestational ages at birth, week of life, and modes of 514 delivery, thereby making it unlikely that apparent community clusters are the result of a failure to 515 observe intermediate points along a gradient. Our subsampling procedure to determine the 516 number of clusters yielded a robust and parsimonious description of the data. Community types 517 were not confounded within individuals, but shifted in type over the period of observation for 518 each individual and were seen across a plurality of individuals. In this setting, where we sought 519 to characterize the development of respiratory and gastrointestinal microbiota over the first year 520 of life, community types have provided a high-level description of the state and progression 521 which facilitated the interrogation of associations of CSTs with developmental age (PMA) and 522 post-natal chronological age (WOL). 523 In summary, new clinical strategies for establishing and maintaining a homeostatic 524 microbiota are needed for neonates at risk for gut and respiratory dysfunction and immune 525

deficiencies. A greater understanding of infant respiratory microbiota colonization, interactions 526
between the respiratory and gut microbiota and possible developmental coordination between 527 the two body sites are crucial steps in that direction. Our results demonstrate the existence of a 528 host-wide network of associations between microbiota. The fact that these associations cannot 529 be entirely explained by time, subject, or direct exchange of bacteria suggest unobserved 530 factors mediating microbial dynamics and associations between microbiota across 531 environments and at substantial distances. To our knowledge, these observations directly 532 implicate, for the first time, a body-wide systemic mechanism coordinating the abundance and 533 distribution of microbiota during early life development. The methods employed here may 534 facilitate future efforts to evaluate disease, developmental maturity, therapeutic interventions, 535 and dynamic interactions between multiple microbial communities and host systems. 536

Clinical methods 538
All study procedures were approved by the University of Rochester Medical Center 539 (URMC) Internal Review Board (IRB) (Protocol # RPRC00045470) and all subject's caregivers 540 provided informed consent. We sampled 1,079 gut (279 from NICU and 800 post-discharge), 541 1,013 nasal and (262 from NICU and 751 post-discharge) and 538 throat (172 from NICU and 542 366 post-discharge) microbiota samples longitudinally from 38 pre-term and 44 full-term infants. 543 The Infants included in the study were from the University of Rochester Respiratory Pathogens 544

Research Center PRISM study and cared for in the URMC Golisano Children's Hospital Gosnell 545
Family Neonatal Intensive Care Unit (NICU) or in the normal newborn nurseries and birthing 546 centers. Fecal (rectal), nasal, and throat material was collected from pre-term infants from 23 to 547 37 weeks GA at birth (GAB) weekly until hospital discharge and then monthly through one year 548 of age, adjusted for prematurity. Rectal, nasal and throat samples were collected from full-term 549 infants at enrollment and monthly through one year. Each fecal sample was collected by 550 inserting a sterile, normal saline moistened, CopanÒ flocked nylon swab (Copan Diagnostics, 551 Murrieta, CA) beyond the sphincter into the rectum and twirling prior to removal. Nasal and 552 throat samples were similarly collected by inserting and twirling a sterile, moistened swab into 553 the throat or anterior nostril. Each swab was then immediately placed into sterile buffered saline 554 and stored at 4 o C for no more than 4 hours. Samples were processed daily, which involved 555 extraction of the fecal, nasal and throat material from the swabs in a sterile environment and 556 immediately frozen at -80 o C until DNA extraction. All sampling swabs, plasticware, buffers and 557 reagents used for sample collection and extraction of nucleic acids were sterile and UV-558 irradiated to insure no contamination from sources outside of the infant. Raw data from the Illumina MiSeq was first converted into FASTQ format 2x300 paired 575 end sequence files using the bcl2fastq program, version 1.8.4, provided by Illumina. Format 576 conversion was performed without de-multiplexing and the EAMMS algorithm was disabled. All 577 other settings were default. Sequence processing and initial microbial composition analysis 578 were performed with the Quantitative Insights into Microbial Ecology (QIIME) software package 579 [75], version 1.9.1. Reads were multiplexed using a configuration described previously [74]. 580 Briefly, for both reads in a pair, the first 12 bases were a barcode, which was followed by a 581 primer, then a heterogeneity spacer, and then the target 16S rRNA sequence. Using a custom 582 Python script, the barcodes from each read pair were removed, concatenated together, and 583 stored in a separate file. Read pairs were assembled using fastq-join from the ea-utils package, 584 requiring at least 40 bases of overlap for V3V4 sequences and 20 bases of overlap for V1V3 585 sequence, while allowing a maximum of 10% mismatched bases. Read pairs that could not be 586 assembled were discarded. The concatenated barcode sequences were prepended to the 587 corresponding assembled reads, and the resulting sequences were converted from FASTQ to 588 FASTA and QUAL files for QIIME analysis. Barcodes, forward primer, spacer, and reverse 589 primer sequences were removed during de-multiplexing. Reads containing more than four 590 mismatches to the known primer sequences or more than three mismatches to all barcode 591 sequences were excluded from subsequent processing and analysis. Assembled reads were 592 truncated at the beginning of the first 30 base window with a mean Phred quality score of less 593 than 20 or at the first ambiguous base, whichever came first. Resulting sequences shorter than 594 300 bases or containing a homopolymer longer than six bases were discarded. Operational 595 taxonomic units (OTU) were picked using the reference-based USEARCH (version 5.2) [76] 596 pipeline in QIIME, using the May 2013 release of the GreenGenes 99% OTU database as a 597 closed reference [77,78]. An indexed word length of 128 and otherwise default parameters 598 were used with USEARCH. Chimera detection was performed de novo with UCHIME, using 599 default parameters [76]. OTU clusters with less than four sequences were removed, and 600 representative sequences used to make taxonomic assignments for each cluster were selected 601 on the basis of abundance. The RDP Naïve Bayesian Classifier was used for taxonomic 602 classification with the GreenGenes reference database, using a minimum confidence threshold 603 of .85 and otherwise default parameters [79]. 604

CST inference with Dirichlet Multinomial Modeling (DMM). 605
The DMM model was fit using the R package DirichletMultinomial version 1.16.0, R 606 version 3.3.3. Sample composition was represented using normalized counts for each of the 607 most specific operational taxonomic units (OTUs) present in at least 5% of the samples from a 608 given body site. Normalization was performed on a per sample basis by taking the relative 609 abundance of each OTU (after removing OTUs present in less than 5% of samples) and 610 multiplying by 5,000. Resulting non-integer counts were rounded down. In the DMM model, the 611 number of Dirichlet components is a tuning parameter. For each body site, we used 10-fold 612 random subsampling of 80% of the samples to assess uncertainty in model fit for one through 613 ten components, with model fit being assessed as the Laplace approximation of the negative-614 log model evidence. We selected the number of components at each body site corresponding to 615 the lower bound on the standard error of the model fit. We then fit complete models for each 616 body site using all samples and the number of components selected, and used the resulting 617 posterior probabilities to assign each sample to a community state type (CST) corresponding to 618 a Dirichlet component. The CSTs observed in each subject and at each body site over time are 619 represented in Figure 2, which was plotted using the TraMineR package, version 2.0-7. Color 620 changes occur midway between consecutive samples of differing CSTs. Observation time 621 points were quantized for plotting purposes only, and this was done by rounding down to the 622 nearest whole week of post menstrual age. 623

Generalized Additive modeling of CST 624
Generalized additive models (GAMs) were fit using R package mgcv for each CST and 625 site. The probability of being in a CST was modeled (on the linear probability scale) as a 626 smooth function of week-of-life and GA at birth, and a random effect for each individual. 627 Formally, for each CST, for individual i at time t, we fit 628 CST %& = WOL %& , gaBirth %& + participant % + error %& , (1) 629 where WOL %& , gaBirth %& is a smooth function of week of life and GA at birth, participant % is a 630 random intercept for each participant and error %& represents independent, homoscedastic noise. 631 We plotted the fitted CST probability, under model (1) over a range of weeks-of-life for several 632 representative gestational ages, and then compared this estimate to a single index model. 633 The single index model restricts the smooth function f in model (1) to seek a common 634 "time" variable that accounts for both time spent inside, and outside the womb. Symbolically, 635 we require WOL %& , gaBirth %& = • WOL %& + • gaBirth %& ). For instance, if a=1 and b=0 636 then GA has no effect on the CST trajectory, and when a=b then time spent inside the womb 637 has the same effect on the probability of belonging to a CST as time spent outside the womb. 638

CST and Taxa regression 639
We paired microbiome samples from different body sites that were acquired at the same 640 visit for each participant. This generated 3 pairs of sites. The Nasal-Rectal sites had the 641 greatest number of matched sample pairs, with 82 participants having 951 pairs of samples, 642 while the Nasal-Throat sites had the fewest, with 40 participants having 483 sample pairs. The 643 Rectal-Throat sites had 491 sample pairs. We applied arcsin sqrt-tranformation to stabilize the 644 variance of relative abundance and then fit linear mixed effects models to the abundance using the CST 645 of the other two sites as the primary variables of interest. We adjusted as potential confounders which the sample was taken and the surrounding period of time is colored according to the CST 724 of the sample, with color changes occurring at the midpoint between consecutive samples in 725 which different CSTs were observed. For each subject, the black region on the left indicates the 726 period prior to birth and the white region on the right indicates the period after the last sample 727 was taken. In all three body sites, strong temporal structure and ordered patterns of CST 728 progression are evident. For example, CSTs 1, 2, and 6 are overrepresented during the period 729 prior to 40 weeks PMA in the nose, gut, and throat, respectively.