Assessing the Variation within the Oral Microbiome of Healthy Adults

The human oral cavity is inhabited by a diverse community of microbes, known as the human oral microbiome. These microbes play a role in maintaining both oral and systemic health and, as such, have been proposed to be useful biomarkers of disease. However, to identify these biomarkers, we first need to determine the composition and variation of the healthy oral microbiome. In this report, we investigate the oral microbiome of 1,049 healthy individuals to determine which genera and amplicon sequence variants are commonly found between individual oral microbiomes. We then further investigate how lifestyle, anthropometric, and dietary choices impact overall microbiome composition. Interestingly, the results from this investigation showed that while many features were significantly associated with oral microbiome composition, no single biological factor explained a variation larger than 2%. These results indicate that future work on biomarker detection may be encouraged by the lack of strong confounding factors.

impacted by factors such as sex, BMI, or diet could help identify potential interactions between the oral microbiome, health, and disease.
Here, we report the variation within the healthy oral microbiome by examining 741 samples from nonsmoking healthy individuals living within the Atlantic provinces of Canada. We then validated our results on a smaller subset of individuals (n ϭ 308) from the same cohort (Fig. 1). The bacterial oral microbiome composition of these individuals was investigated through 16S rRNA gene sequencing from saliva samples provided by each participant. Compositions were then compared using 41 different variables, including anthropometric, dietary, and sociodemographic factors (Table 1). In this investigation, we determined which of these factors play a role in shaping the oral microbiome and to what extent these factors can explain the overall oral microbiome composition.

RESULTS
The healthy oral microbiome is stable at the genus level but variable at higher resolutions. We examined the oral microbiome composition of the overall cohort containing 1,049 healthy individuals ( Fig. 1) from Atlantic Canada to understand how anthropometric, sociodemographic, and dietary choices could alter oral microbiome composition. We found that 16 genera were found to have a mean relative abundance greater than 1% ( Fig. 2A), with Veillonella having the largest mean contribution (21.49% Ϯ 0.38%), followed by Neisseria (13.04% Ϯ 0.40%), Streptococcus (11.86% Ϯ 0.26%), and Prevotella 7 (11.55% Ϯ 0.24%).
To characterize the core relative abundance of core genera and amplicon sequence variants (ASVs) within the oral microbiome of these samples, the mean relative abundance of genera/ASVs that were present in greater than a specific percentage of samples was analyzed. Interestingly, we found that at the genus level the oral microbiome is relatively stable, with 11 genera (see Fig. S2A and Table S1 in the supplemental material) present in greater than 99% of all individuals, making up, on average, a total relative abundance of 77.82% (Fig. 2B). However, this was not the case when we examined composition at a higher taxonomic resolution. We then found that only 5.17%, on average, of the total relative abundance of the oral microbiome was made up of 3 ASVs (Fig. S2B) shared between 99% of all participants in the study (Fig. 2C). These ASVs were classified as being in the Granulicatella, Streptococcus, and Gemella genera but could not confidently be assigned to a specific species.
Demographic, anthropometric, and lifestyle choices have small but significant impacts on oral microbiome composition. We examined the relationship of both alpha and beta diversities of the oral microbiome between 41 different variables that   significant associations between any of the 41 variables tested and four different alpha diversity metrics (Faith's phylogenetic diversity, number of ASVs, Shannon diversity, and evenness) after correction for multiple testing using linear models that were adjusted for DNA extraction batch (Data Set S1). We did, however, find 10 variables that were associated with differences in beta diversity as measured by both weighted UniFrac (Fig. 3A) and Bray-Curtis dissimilarity ( Fig. 3B) (q Ͻ 0.1 by permutational multivariate analysis of variance [PERMANOVA]) (Data Set S1). We found two additional variables that were only associated with weighted UniFrac distances and three additional variables that were only associated with Bray-Curtis dissimilarity (q Ͻ 0.1 by PERMANOVA). Redundancy analysis (P ϭ 0.001 by analysis of variance [ANOVA]) revealed that multiple anthropometric measures, such as height, fat-free mass, refined grain servings, sleeping light exposure, and waist-to-hip ratio were associated in similar manners. Furthermore, as expected, increases in all of these features were inversely associated with being female (Fig. 3C). As sex plays an important role in determining the height, fat-free mass, and waist-hip ratio of an individual, we attempted to determine whether sex was confounding our results from these variables. A separate analysis on weighted UniFrac distances controlling for sex indicated that fat-free mass (P ϭ 0.02, R 2 ϭ 0.0039) and waist-hip ratio (P ϭ 0.03, R 2 ϭ 0.0039), but not height (P ϭ 0.44, R 2 ϭ 0.0012), were significantly associated with microbial composition despite differences in sex.
Examining the amount of variation explained by each metadata feature by itself after controlling for DNA extraction showed small effect sizes for both weighted UniFrac distances and Bray-Curtis dissimilarities (R 2 ϭ 0.0030 to 0.009) ( Fig. 3A and B). Of the features that were significant, sleeping light exposure explained the least amount of variation in both weighted UniFrac distances (R 2 ϭ 0.0036) and Bray-Curtis dissimilarity (R 2 ϭ 0.0030). We also found that fat-free mass explained the largest Samples were subsampled to a depth of 5,000 reads. Two different metrics measuring amount of variation in both weighted UniFrac (R 2 ϭ 0.009) and Bray-Curtis dissimilarity (R 2 ϭ 0.006). In general, we found that the rankings of effect sizes between these two different metrics agreed ( Fig. 3A and B).
We also examined random forest machine learning classification and regression performance for each of these significant features. We found that, overall, random forest models performed poorly but did show slight associations between some variables (Fig. S3). For example, the area under the receiver operating curve (AUROC) for sex classification was 0.638, indicating slightly better than random performance. Regression models for features such as height and age showed an R 2 of 0.10 and 0.075 with a root mean standard errors of 8.629 cm and 7.635 years, respectively (Fig. S3). Interestingly, some features performed extremely poorly, such as the number of refined grain servings (R 2 ϭ 8.22EϪ6) or vegetable servings (R 2 ϭ 0.004) (Fig. S3).
Examining each significant factor in our weighted UniFrac analysis using a backward-selected multivariate PERMANOVA, we found that 7.0% of total oral microbiome variation could be explained by a total of 6 significant factors, including DNA extraction batch, despite using the same protocol, equipment, and personnel for each round (Table S2). Interestingly, of these 6 factors, DNA extraction number explained a considerable amount of the variation alone (4.18%) (Table S2). We found similar results examining beta diversity variation using Bray-Curtis dissimilarity with a slightly higher number of significant features and lower total variation explained (5.87%) (Table S3). It should be noted that many features were highly correlated with one another (R Ͼ 0.7), and, as such, model selection for these multivariate PERMANOVAs could have suffered due to the collinearity of these features. However, a model containing all features that were significantly associated with either weighted UniFrac or Bray-Curtis dissimilarity during univariate testing explained a similar level of variation for both weighted UniFrac and Bray-Curtis dissimilarity profiles (8.09% and 6.81%).
Redundancy analysis revealed several potential taxonomic associations with various features (P ϭ 0.001 by ANOVA). For example, results for the genus Megasphaera (label 58, Fig. 3C) are in the same direction as those for increasing fat-free mass, height, waist-hip ratio, and daily refined grain servings but in the opposite direction of being female (Fig. 3C). Another uncultured genus in the Veillonellaceae family (label 63) was similarly grouped. The genus Parvimonas (label 38) is in a direction similar to that of increasing age and being female. Both Lautropia (label 71) and Prevotella 2 (label 15) are associated with increasing vegetable intake, and Neisseria (label 76) is associated with increasing nut/seed servings and decreasing refined grain servings (Fig. 3C). The only genus in the phylum Synergistetes that passed the 10% prevalence filtering was found to be associated with increasing juice servings, BMI, and time since last dental appointment. Overall, we found that phyla tended to cluster together, with Firmicutes and Proteobacteria clustering in opposite directions (Fig. 3C).
To help validate the associations we found between features and weighted UniFrac and Bray-Curtis dissimilarities, we analyzed an additional 308 samples from a smaller subset of the Atlantic Partnership for Tomorrow's Health (PATH) cohort that had not completely answered all 41 variables of interest. We found that associations between both beta diversity metrics (weighted UniFrac and Bray-Curtis dissimilarity) and anthropometric features, such as height, weight, waist-hip ratio, and fat-free mass, were recoverable within our smaller cohort (Table 2 and Fig. S4). We were unable to recover any significant taxonomic dietary associations within this smaller validation cohort. We also were unable to recover taxonomic associations between lifestyle variables, such as sleeping light exposure or the time since an individual's last dental visit. The inability beta diversity were tested, weighted Unifrac distances (A) and Bray-Curtis dissimilarity (B), using a PERMANOVA test while controlling for differences in DNA extraction and correction for false discovery (q Ͻ 0.1). Relationships between significant features, samples, and genera that were present in at least 10% of samples were then visualized by redundancy analysis (RDA) on center-log-ratio genus count tables. (C) Genera are colored by phylum and labeled numerically.
Assessing Variation in the Healthy Oral Microbiome to recover these differences could have been due to the highly reduced sample size within this validation cohort.
The abundance of various oral bacterial genera and ASVs are associated with anthropometric measurements and dietary choices in healthy individuals. We next decided to identify genera that were associated with the 15 features previously identified as being associated with beta diversity in either the weighted UniFrac or Bray-Curtis dissimilarity analysis. We found 42 genera (Fig. 4A) and 42 ASVs (Fig. 4B) that had abundance profiles that were significantly associated with at least one of these features after controlling for DNA extraction. We found that sex, height, and fat-free mass shared similar genera and ASV associations. To control for the possibility of sex confounding our height and fat-free mass associations, we reanalyzed the data controlling for sex. We found that no ASVs or genera were significantly associated with fat-free mass after controlling for sex, and only 3 genera, "Chloroplast," "unclassified Burkholderiaceae," and Treponema 2, were significantly associated with height. Interestingly, two of these three genera were not previously associated with height in our initial analysis. These results suggest that many of these features associated with height or fat-free mass are driven by differences in sex. To test this, we also tested for differences in sex while controlling for both fat-free mass and height. Interestingly, we did not find any significantly associated ASVs and only three significantly associated genera, "Defluvittaleaceae UCG 011," Leptotrichia, and Treponema 2.
We did not find any other features that shared similar patterns of taxonomic associations, but there were multiple genera with multiple feature associations. The genus Prevotella 7 had the highest number of features (5) associated with its relative abundance, including four anthropometric measurements (height, fat-free mass, waist size, waist-hip ratio, and weight) and sex. Interestingly, BMI was not significantly associated with any genera or ASVs despite many other anthropometric measures showing strong taxonomic signals. We were unable to identify any single ASVs associated with waist size and weight but were able to identify a small number of genera, including Prevotella 7, which was related to both, and Mogibacterium, which was associated with waist size. We also found that for some phyla, many taxa with significant associations had the same effect size direction. For example, genera in the Actinobacteria or Proteobacteria phyla tended to be negatively associated with fat-free mass, height, and being male. We also found several genera in the Proteobacteria phylum that were significantly associated with the amount of time since an individual's last dental appointment.
In contrast, examining the ASVs associated with each feature, we found that in a small number of cases ASVs in the same genera had opposite directions of association to the same features. For example, two ASVs classified as uncultured Rothia both were significantly associated with age but in opposite directions, suggesting that lower Differentially abundant genera and ASVs whose abundance profiles are associated with features found to influence oral microbiome composition. Genera (A) and ASVs (B) meeting a false discovery rate of q Ͻ 0.1 using the Corncob R package, which uses beta-binomial regressions. Each feature's false taxonomic resolution is required to identify some associations. Furthermore, we also identified cases were ASVs that were associated with a feature were classified in a genus that was found not to be related to that feature. For example, uncultured Selenomonas ASV-4ca02 was strongly associated with being male, even though this entire collective genus was not (Fig. 4). Further examples include ASV-e2cc4, which was classified in the genus Alysiella and significantly associated with reduced refined grain servings. Examples of the opposite occurrence are also present, with genera such as Mycoplasma being associated with age, but no single ASV for this association could be identified.
We further validated our differential abundance analysis using our smaller validation data set and found 8/17 genera associated with sex, 8/16 genera associated with fat-free mass, 5/15 genera associated with height, and 3/11 genera associated with age were recoverable (Fig. S5A). Additionally, the negative association between Prevotella 2 and waist-hip ratio was also verified within this data set. Furthermore, several associations between ASVs and features such as sex (5/14), height (4/12), fat-free mass (2/3), and sleeping light exposure (1/2) were also found within this smaller validation data set (Fig. S5B). All significant effect sizes that were recovered in the validation data set except for one, between sleeping light exposure and ASV-d4746 Streptococcus, remained in the same direction as the original cohort, indicating relationships that were robust to sample choice.
Predicted microbial pathway abundances reveal multiple pathways associated with anthropometric, dietary, age, and sex features. Microbial pathway abundances were predicted using PICRUSt2 (30) to determine potential associations between pathway abundances and features previously identified to be significantly associated with differences in beta diversity. Differential analysis between features and predicted pathway abundances were done using Corncob with Benjamini-Hochberg-corrected P values at an alpha of 0.05, and associations with effect sizes under |0.05| log odds were filtered out. We found 9/15 features originally associated with beta diversity metrics to have at least one predicted pathway association (Fig. 5). Of these features, we found that refined grain servings had the largest number (N ϭ 33) of predicted pathway associations. Of these associations, many were negatively associated with increasing refined grain intake, including various tricarboxylic acid cycle derivatives, glucose and xylose degradation, 2-methylcitrate cycle, and heme biosynthesis. Furthermore, only a smaller number of pathways were associated with increasing refined grain intake, such as phylloquinol biosynthesis and CMP-legionanimate biosynthesis (Fig. 5).
Only one pathway, aerobic respiration I (cytochrome c), was predicted to be associated with increasing vegetable servings, while six pathways were associated in the opposite direction. These pathways included fermentation of carbohydrates into lactate, lactic acid, ethanol, acetate, and formate as well as the biosynthesis of peptidoglycan. We found only one association with salt usage (L-tryosine degradation) and did not find any predicted associations with juice serving intake.
A number of predicted pathway abundances were also associated with various anthropometric features, with waist-hip ratio having the highest number of predicted associations (N ϭ 15) and fat-free mass having only one predicted association (GDP-Dglycero-␣-D-mannose-heptose biosynthesis). We also found a small number of predicted pathway abundances associated with age (N ϭ 7) and sex (N ϭ 2).
Validation analysis on the second smaller data set was only able to validate a minority of predicted pathway associations, many of which were associated with an individual's waist-to-hip ratio (8/15) (Fig. S6). Only 4 of the original 33 predicted pathway associations with refined grain intake were verified within this cohort. These discovery rate was corrected separately, and each was tested to control for differences in DNA extraction and differential variability within that feature. Ordinal variables were converted into a ranked scale for testing, and all features except for sex were scaled. The asterisk indicates that sex was treated as a categorical value; therefore, the magnitude is not directly comparable to other log odd ratios.

FIG 5
Various predicted pathway abundances are associated with features significantly associated with overall microbiome composition. Pathway abundances were predicted from 16S rRNA gene sequencing data using PICRUSt2. Predicted pathway abundances meeting an FDR of Ͻ0.05 and an effect size of |0.05| log odds were considered significant associations using the Corncob R package. Each feature's false discovery rate was corrected separately, and pathways included L-tyrosine degradation, ADP-L-glycero-␣-D-manno-heptose biosynthesis, phylloquinol biosynthesis, and 1,4-dihydroxy-2-naphthoate biosynthesis. Three out of six pathways associated with height and four out of seven pathways associated with age were also found to be significant within the smaller validation data set (Fig. S6).

DISCUSSION
Our analysis of 1,049 healthy ( Fig. 1) individuals from Atlantic Canada revealed that much of the oral microbiome of Atlantic Canadians was made up of 11 "core" genera that belong to six different phyla (Actinobacteria, Fusobacteria, Proteobacteria, Firmicutes, Bacteroidetes, and Fusobacteria). Interestingly, some of these core genera, found in 99% of all samples, were found in relatively low abundance (Ͻ2% mean abundance), indicating that bacteria within the oral microbiome can be consistently observed with minor contributions (see Table S1 in the supplemental material). In contrast, the composition at the ASV level had only 3 ASVs being present in 99% of samples and only contributing 5.17% of the total oral microbiome composition on average. Overall, these results indicate that individuals tend to share similar genera within the oral cavity, but the species/strains shared between individuals can be highly variable. These findings are in line with previous work from the Human Microbiome Project that found the oral microbiome to be relatively similar between individuals at the genus level (1).
We found that various anthropometric and lifestyle features were significantly associated with overall oral microbiome composition; however, they explained only a small amount of total oral microbiome variance while controlling for DNA extraction batch (5.87 to 7.00%) (Tables S2 and S3). We found that fat-free mass explained the largest amount of variance (0.6 to 0.9%) (Fig. 3A and B) of all biological features. While this feature had many differentially abundant genera and ASVs associated with it, we were unable to recover any of them after controlling for differences in sex. This indicates that these associations could be driven by sex and not underlying fat mass; however, we were also unable to recover many relationships between sex and taxonomic abundance while controlling for fat-free mass, indicating that both of these factors significantly confound the other. However, despite these issues there is previous evidence to suggest that some bacteria are related to differences in body size. A study in children found reduced abundance of Veillonella, Prevotella, Selenomonas, and Streptococcus in obese children (31). Interestingly, in our adult population we found similar trends, with members of the Veillonella family being positively associated with increasing fat-free mass and members of the Prevotella genus also being linked with higher fat-free mass. Another publication on the Southern Community Cohort Study found that both Granulicatella and Gemella were associated with obesity (27), which we also found within our cohort at both the genus and ASV level. One interesting result from our study was our inability to identify any genera or ASVs linked to BMI despite numerous relationships between anthropometric measurements being identified. These results indicate that future studies should include sex and other measurements of body composition, such as lean body mass, when looking at relationships between the microbiome and obesity.
We found two genera, Defluvittaleaceae UCG-011 and an uncultured genus from Veillonellaceae, that were strongly associated with being male (Fig. 4A). However, neither of these associations was recovered in our validation cohort, indicating that they could either be false positives or require a larger sample size to recover due to their low mean relative abundance (0.0042% and 0.063%, respectively). Despite this, we were still able to recover eight genus-level associations in our validation data set (Fig. S5A); however, only a few of these associations match those that were previously each was tested to control for differences in DNA extraction and differential variability within that feature. Ordinal variables were converted into a ranked scale for testing, and all features except for sex were scaled. The asterisk indicates that sex was treated as a categorical value; therefore, the magnitude is not directly comparable to other log odd ratios.
reported. Renson et al. found two genera, Lactobacillus and Actinobacillus, to be higher in males, which we did not find in our study (21). This could have been due to multiple differences, including sampling procedures, systemic protocol bias, or the compositional nature of microbiome data (32). Raju et al. found that there was a high relative abundance of Haemophilus in females, which we also found in our study; however, they also found Oribacterium to be increased in females, which was opposite from what was found in this study (31). Differences between these studies and ours can be attributed in part to differences in sample collection procedure and sequencing primers used, highlighting the technical biases in the field (33).
We were unable to recover any taxonomic relationships between dietary features within our validation data set; however, refined grain servings per day had one of the largest impacts on overall oral microbiome composition and microbial pathway potential in our initial analysis. During this initial analysis, we found that bacteria from four genera, Bergeyella, Parvimonas, Veillonella, and Neisseria, decreased in relative abundance with increasing refined grain intake (Fig. 4). Interestingly, refined grain intake had a strong association with inflammatory bowel disease (IBD) in a previous analysis of this cohort (34), and alterations in the oral microbiome have been linked to IBD in the past (35). Work by Said et al. found multiple genera in differential abundance between individuals with and without IBD, including the increased presence of Veillonella in IBD patients (14), which we found to be linked positively with refined grain intake.
We found that of all features significantly associated with overall oral microbiome composition, refined grain intake had the largest number of predicted pathway associations (Fig. 5). Many of these pathways were related to metabolic functions and the biosynthesis of various cofactors and metabolic building blocks, indicating a shift in metabolic potential within the microbial community. This shift is not surprising given that differing levels of refined grain intake could impact the availability of various carbohydrates to oral microbiota. However, it should be noted that only a small number of these pathway associations were verified within our small validation data set.
Other dietary factors we found linked to overall oral microbiome composition in our original analysis include both juice servings and vegetable servings. However, we were only able to find a small number of genera, ASVs, and predicted pathway abundances linked to vegetable serving intake. We found a number of fermentation pathways were predicted to be associated with reduced vegetable intake, indicating a shift in anaerobic activity. While we found a small number of taxonomic associations with juice serving intake, we found no predicted pathway associations. Furthermore, we were unable to recover any taxonomic or pathway associations for both vegetable intake or juice serving intake in our validation data set, indicating the possibility of a false positive or the requirement of a large sample size to see these effects. Previous work within the field has found conflicting evidence on the role of diet impacting oral microbiome composition and may be reflective of different dietary assessment methods.
Looking at all features that were significantly associated with oral microbiome composition together in a single model, we were only able to explain a small portion of the total variance between samples (5.87 to 7.00%). This indicates that while many of these features are significantly related to microbial composition, each one by itself tends to cause only small shifts in overall microbial composition. Furthermore, a majority of the variance accounted for was due to differences in DNA extraction date. This shows that slight technical variations, such as the time when DNA extraction was done, can have large impacts on sample composition, emphasizing the need to control for these technical variations during large population-based studies.
One large limitation to our study was our lack of detailed dental history information from participants. While we did record how recently each individual last visited the dentist, we were unable to retrieve detailed information on dental health, which has been found to have dramatic impacts on oral microbiome composition (17). Furthermore, our study was also unable to capture potential variance that could have been attributed to the time of sampling. Various studies have shown that oral microbiome composition can vary with regard to collection time due to events such as teeth brushing and eating throughout the day (36). These could explain some of the missing variation that was not accounted for in our study; however, it is unlikely to explain all 93.00%, indicating we are still missing a suitable amount of information on what determines an individual's oral microbiome composition.
In conclusion, our study indicates that the healthy oral microbiome is relatively similar between individuals at the genus level and is impacted very little by any one factor. Future studies that attempt to identify oral microbial biomarkers associated with disease may be encouraged by the lack of major confounding variables and may be justified in controlling only for sex, body composition, oral health, and basic dietary information.

MATERIALS AND METHODS
Study design and population. The current study includes the analysis of saliva samples from the Atlantic Partnership for Tomorrow's Health (PATH) study. Atlantic PATH is part of the Canadian Partnership for Tomorrow's Health (CanPath) project, a pan-Canadian prospective cohort study examining the influence of environmental, genetic, and lifestyle factors on the development of chronic disease (37). The applicable provincial and regional ethics boards approved the study protocol, and all participants provided written informed consent prior to participation. The primary inclusion criteria were that participants were aged 30 to 74 years at the time of recruitment and a resident in one of the Atlantic Canadian provinces (Nova Scotia, New Brunswick, Prince Edward Island, and Newfoundland and Labrador). Recruitment and baseline data for all participating regions were collected between 2000 and 2019. Details on participant recruitment and a descriptive cohort profile have been published elsewhere (37). The questionnaire included sociodemographic information, health information, behaviors, environmental factors, and self-reported anthropometric information. Participants also had anthropometric measures (height, weight, waist and hip circumferences, body composition, blood pressure, grip strength, and resting heart rate) and biological samples (blood, urine, saliva, and toenails) collected. Approximately 9,000 participants in the Atlantic PATH cohort provided a saliva sample. Participants were instructed to refrain from eating, smoking, or chewing gum for at least 30 min prior to oral specimen collection. Oral saliva specimens were collected during normal clinic hours, 9:00 a.m. to 7:00 p.m., after completion of the approximately 1-h interview and registration process. Oral samples (3 ml) were collected in sterile 50-ml conical tubes after rinsing with water. Samples were stored at 4°C and batch shipped on ice to the central processing facility at the QEII Health Sciences Centre in Halifax, Nova Scotia. Samples were processed within 24 h of collection, aliquoted into cryovials, and stored at Ϫ80°C until analysis.
The current analysis includes a total of 1,214 saliva samples from healthy Atlantic Canadians living within the provinces of Nova Scotia, New Brunswick, and Prince Edward Island. Based on self-reported data, participants were defined as healthy if they had not been diagnosed with any of the following conditions: hypertension, myocardial infarction, stroke, asthma, chronic obstructive pulmonary disease, major depression, diabetes, inflammatory bowel disease, irritable bowel syndrome, chronic bronchitis, emphysema, liver cirrhosis, chronic hepatitis, dermatologic disease (psoriasis and eczema), multiple sclerosis, arthritis, lupus, osteoporosis, and cancer. A total of 165 of these samples were removed due to insufficient sequencing depth, and of the remaining 1,049 samples, an additional 308 were removed due to incomplete answering of the 41 variables examined in this study. These 308 samples were then used in validation analysis (details below) to confirm findings within the larger 741-participant data set.
Sociodemographic, lifestyle, and anthropometric variables. Questionnaires were used to collect sociodemographic and lifestyle variables. Self-reported variables included age, sex, education level, household income, rural/urban status, province, dental visits, sleep patterns, alcohol consumption, smoking status, and dietary variables, such as food avoidance, the use of specific types of fat/oil, artificial sweetener usage, the frequency of dessert, soda drinks, soy/fish sauce, salt seasoning, and fast food, as well as servings of vegetables, fruit, juice, whole grains, refined grains, dairy products, eggs, fish, tofu, beans, and nuts/seeds. Anthropometric measures were collected by trained personnel in assessment centers. Waist and hip circumferences were measured using Lufin steel tape. Height was measured by a Seca stadiometer. Height and weight measures were used to calculate body mass index (BMI; weight, in kilograms, divided by height, in meters squared). Body weight, fat mass, and fat-free mass were measured using the Tanita bioelectrical impedance device (Tanita BC-418; Tanita Corporation of America Inc., Arlington Heights, IL). Table 1 lists all variables that were used for analysis.
Oral microbiome 16S rRNA sequencing. Frozen saliva samples were thawed at room temperature and aliquoted into 96-well plates. DNA from samples was then extracted using a QIAamp 96 PowerFecal QIAcube HT kit by following the manufacturer's instructions using a TissueLyser II and the addition of Proteinase K. Sequencing of the 16S rRNA gene was performed by the Integrated Microbiome Resource at Dalhousie University. The V4-V5 region was amplified from extracted DNA in a PCR using 16S rRNA gene V4-V5 fusion primers (515FB-926R) (38) and high-fidelity Phusion polymerase. Amplified DNA concentrations were then normalized and pooled to be sequenced on an Illumina MiSeq. The sequencing of samples was conducted over 6 Illumina MiSeq runs producing 300-bp paired-end reads.
16S rRNA gene sequence processing. Primers were removed from paired-end 300-bp sequences using cutadapt (39). Primer-free reads were then stitched together using the QIIME2 (v. QIIME2-2018.8) (40) VSEARCH (41) join-pairs plugin. Stitched reads were then filtered using the QIIME2 plugin q-score-joined using the default parameters. Quality filtered reads were then input into the QIIME2 plugin Deblur (42) to produce amplicon sequence variants (ASVs). Trim length was 360 bp, and the minimum number of reads required to pass filtering was set to 1. Amplicon sequence variants that were found in an abundance of less than 0.1% of the mean sample depth (18) were then removed from analysis. This is to keep in line with the approximate bleed-through rate on an Illumina MiSeq sequencer. After filtering, a total of 13,248 ASVs were recovered. Representative sequences were then placed into the Greengenes 13_8 99% (43) reference 16S rRNA tree using the QIIME2 (2019.7) fragment insertion SEPP (44,45) plugin. Rarefaction curves were then generated using the QIIME2 alpha-rarefaction plugin, and a suitable rarefaction depth of 5,000 was chosen for diversity analysis based on when the number of newly discovered ASVs came to a plateau (see Fig. S1 in the supplemental material). Representative sequences were then assigned taxonomy using a custom-trained V4-V5 16S rRNA naive Bayesian QIIME2 classifier (46) trained on the 99% Silva V132 database (47).
Oral microbiome composition analysis. Taxonomic composition tables were generated using the QIIME2 taxa plugin and collapsed at the genus level. All samples over 5,000 reads in depth (1,049) were subsampled to a depth of 5,000 reads each, and taxa that contributed less than a mean relative abundance of 1% were grouped together in the "Other" category. The composition stacked bar chart was then generated in R using ggplot2 (48), and the x axis was ordered based on the PC1 weighted Unifrac coordinates of each sample.
Core oral microbiome analysis. Taxonomic tables subsampled previously at 5,000 reads were collapsed at the genus and ASV level using QIIME2. To examine the mean relative abundance explained by genera/ASVs at different prevalence levels, we removed genera/ASVs that were not present in various numbers of samples (5 to 99%). After removal of these genera/ASVs, the remaining total mean relative abundance of all genera/ASVs that passed the filtering parameter was calculated.
Oral microbiome alpha diversity analysis. Alpha diversity metrics were generated using QIIME2 (v2019.7) and the previously generated tree containing both representative sequences and reference sequences. All samples were subsampled to a depth of 5,000 reads. Association between four different alpha diversity metrics (Faith's phylogenetic diversity [PD], Shannon diversity, evenness, and number of ASVs) was then tested using general linear models while controlling for DNA extraction. A base model containing only DNA extraction as a covariate and a testing model containing DNA extraction and the covariate of interest were then compared using an ANOVA, and P values were recorded. Recorded P values were then corrected for false discovery (Benjamini and Hochberg [49]) with a chosen alpha of q Ͻ 0.1.
Oral microbiome beta diversity analysis. Beta diversity metrics were generated using QIIME2 and the previously generated phylogeny. All sequences were subsampled to a depth of 5,000 reads based on the plateauing stage of rarefaction plots (Fig. S1). Associations between two different beta diversity metrics (weighted UniFrac distance and Bray-Curtis dissimilarity) were then tested using a PERMANOVA (adonis2 function in Vegan [50]) while controlling for DNA extraction. Marginal P values were then corrected for false discovery (Benjamini and Hochberg), and an alpha value of q Ͻ 0.1 was chosen. Significant features from univariate analysis were then included in a single multivariate model that underwent backwards covariate selection, where each covariation with the highest P value was removed from the model until all features were found to be significant (P Ͻ 0.05). Additional testing using adonis2 on fat-free mass and height were done while controlling for both sex and DNA extraction. Finally, overall relationships between taxa, metadata, and samples were visualized with a redundancy analysis triplot. This plot was constructed using the rda function within the vegan R package. Within this function, nonrarified center-log-ratio genera count tables were filtered for features with at least 10% prevalence and then used as the response variable within the redundancy analysis (RDA) model. Each feature previously associated with either weighted UniFrac or Bray-Curtis dissimilarity profiles were input as explanatory variables within the RDA model. The significance of the RDA model was checked using the function anova.cca within the vegan R package. Finally, visualization of the resulting RDA model was done with the R package ggord (51) using symmetrical species and site scaling.
Differential abundance analysis. Differential abundance analysis was conducted using the Corncob (52) (v 0.1.0) and Phyloseq (53) R packages. A genus-level taxonomic table was generated using QIIME2 (2019.7), and genera that were not found in at least 10% of samples were removed. The 15 covariates that were found to be significantly associated with either weighted UniFrac or Bray-Curtis dissimilarities were chosen for testing. The testing of each covariate was done using the "differentialtest" function in the Corncob package while controlling for differences in DNA extraction and differential variability across DNA extraction and the covariate of interest. Heatmaps were then constructed containing any genera/ ASVs that were significantly associated with at least one of the covariates that were tested.
Prediction of microbial pathway abundances using Picrust2. Amplicon sequence variant abundance tables were rarified at a depth of 5,000 reads and input into the picrust2_pipeline.py script to generate predicted microbial pathway abundances. MetaCyc pathway identifiers were then mapped to their respective pathway names using the picrust2 add_descripition.py script. Differential abundance analysis of predicted pathway abundances using the R package Corncob was done in the same manner as that previously explained for taxonomic data. Only features that were found to be significantly associated with weighted UniFrac or Bray-Curtis dissimilarities were tested. DNA extraction batch and differential variability within the tested feature were controlled for as previously described, and P values were corrected using Benjamini-Hochberg false discovery correction (49). An alpha value of 0.05 was chosen for corrected P values, and pathways with an effect size lower than |0.05| log odds were filtered out.

Validation analysis.
A total of 308 subjects had not completely answered all 41 metadata variables of interest and, therefore, were removed from the original analysis. This smaller data set was used to test our previous results by removing samples during testing of each covariate that had not answered that question on the questionnaire. Both beta diversity analysis and differential abundance analysis on taxa and pathways were carried out in the same manner as that previously explained. Both beta diversity metrics using PERMANOVA tests and differential abundance analysis using Corncob were done in a univariate fashion while also controlling for DNA extraction batch. Furthermore, only features/taxa that were originally identified as being significantly associated with oral microbiome composition in our initial cohort were tested. As there was previous evidence that these features were associated with that covariate/metric, P values were not corrected for false discovery but an alpha value of 0.05 was chosen. Furthermore, to keep with the original pathway analysis, only pathways that had an effect size of |0.05| log odds in the discovery cohort were tested for differential abundance in the validation cohort.
Random forest model training and validation. Nonrarified ASV abundances were converted into relative abundances and used to train random forest classification and regression models for each feature that was significantly associated with either weighted UniFrac or Bray-Curtis dissimilarities. An optimal mtry parameter was chosen using 3-fold repeated cross validation within the caret R package (54). Trained models for each feature were then validated on the holdout validation data set to determine model performance. Model performance for classification was visualized using the PRROC R package (55), and R 2 performance of regression models was determined using the postResample function within the caret R package.
Data availability and materials. All sequencing data have been uploaded to the European Nucleotide Archive and are available under the accession number PRJEB38175. Code used to analyze all data is available at https://github.com/nearinj/Nearing_et_al_2020_Oral_Microbiome. Deidentified metadata used in this project can be accessed by contacting the Atlantic Partnership for Tomorrow's Health project.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.

ACKNOWLEDGMENTS
This research was conducted using Atlantic PATH data and biosamples under application number 2018-103. We thank the Atlantic PATH participants who donated their time, personal health history, and biological samples to this project. We also thank the Atlantic PATH team members for data collection and management. The data used in this research were made available by the Atlantic Partnership for Tomorrow's Health (Atlantic PATH) study, which is the Atlantic Canada regional component of the Canadian Partnership for Tomorrow's Health Project, funded by the Canadian Partnership Against Cancer and Health Canada. The views expressed here represent the views of the authors and do not necessarily represent the views of Health Canada.