Over 50000 metagenomically assembled draft genomes for the human oral microbiome reveal new taxa

The oral cavity of each person is home for hundreds of bacterial species. While taxa for oral diseases have been well studied using culture-based as well as amplicon sequencing methods, metagenomic and genomic information remain scarce compared to the fecal microbiome. Here we provide metagenomic shotgun data for 3346 oral metagenomics samples, and together with 808 published samples, assemble 56,213 metagenome-assembled genomes (MAGs). 64% of the 3,589 species-level genome bins contained no publicly available genomes, others with only a handful. The resulting genome collection is representative of samples around the world and across physiological conditions, contained many genomes from Candidate phyla radiation (CPR) which lack monoculture, and enabled discovery of new taxa such as a family within the Acholeplasmataceae order. New biomarkers were identified for rheumatoid arthritis or colorectal cancer, which would be more convenient than fecal samples. The large number of metagenomic samples also allowed assembly of many strains from important oral taxa such as Porphyromonas and Neisseria. Predicted functions enrich in drug metabolism and small molecule synthesis. Thus, these data lay down a genomic framework for future inquiries of the human oral microbiome.

transport and metabolism, which are reported active up-regulated in Deinococcus 193 during gamma-irradiation 40 (Supplementary Fig. 2b). 194

Distribution of species and strains 195
The new samples from this study differed in oral microbiome composition 196 compared to published samples across geography/ethnicity (Fig. 5a,b). Both the 197 4D-SZ and Yunnan samples abundantly contained many uSGBs (of the top 10 198 abundance species in cohort) such as Neisseria spp., Porphyromonas spp. and kSGBs 199 such as Haemophilus parainfluenzae and Veillonella denticarios, which were rare in 200 the other cohorts (Fig. 5b). Pregnant samples from the U.S. contained Fannyhessea 201 vaginae (the vaginal pathogen previously known as Atopobium vaginae 41 ), 202 Urinacoccus, etc. that were of much lower abundance in other cohorts (Fig. 5b). 203 Samples from Fiji, although not well mapped (Fig. 2a), showed high levels of a few 204 SGBs that were also seen in the RA study from Beijing, China, including an SGB 205 from Saccharibacteria (TM7) (Fig. 5b). 206 At the strain level, the new samples from the current study greatly expanded 207 the genome collection for common taxa such as Neisseria spp., Porphyromonas spp., 208 and Prevotella spp. (Fig. 5c,d). The numbers of publicly available reference genome 209 for the top ten most abundant species in the genera Porphyromonas and Prevotella 210 were less than 10, and less than 100 for the genus Neisseria. Here we obtained more 211 than 1000 genomes for a few of the species, and increased the diversity in all the 212 species in these genera (Fig. 5c). Most of the species with a large number of genomes 213 showed strain-level variations (subspecies). The Prevotella nanceiensis kSGB, for 214 example, included 3 reference genomes that were similar to a few genomes from 215 developed countries, while our samples contributed two large clusters that were more 216 distantly related (Fig. 5d). 217

New disease markers according to the oral genomes 218
To illustrate the utility of our genome collection in metagenomic studies 219 including MWAS, we reanalyzed dental and salivary microbiome data from RA 220 patients and controls 6 . For better confidence in the markers regardless of cohort, we 221 only analyzed SGBs containing >10 genomes. Similar to the original study, oral 222 markers selected by a 5x 10-fold cross-validated gradient boosting algorithm 223 include a number of Gram-negative bacteria e.g. Haemophilus spp., 224 Aggregatibacter spp. enriched in dental samples from healthy volunteers, while 225 only a Pseudomonas SGB and a Enterococcus SGB were selected for RA samples 226 (Fig. 6a) from Actinomyces 44 , was identified as highly predictive of RA. As Actinomyces are 233 the basis for dental attachment of oral bacteria 45 , potential contribution of 234 Pauljensenia spp. to periodontitis in RA patients remains to be explored; the dental 235 microbiome was obviously deranged, consistent with epidemiology 5 . 236 A set of saliva samples from colorectal cancer and controls from France are 237 also available 46 . Here, we found Pauljensenia spp., to be control-enriched, along with 238 Acinetobacter radioresistens, Lachnoanaerobaculum sp., Catonella sp., etc (Fig. 6b). 239 Streptoccocus thermophilus, a species previously found to be enriched in fecal 240 samples from control or adenoma compared to CRC patients 47 was also identified in 241 control saliva. The markers enriched in CRC oral samples are more unexpected. 242 Besides Porphyromonas spp., Prevotella maculosa, we found a Lachnospiraceae 243 SGB (potentially TMA-producing and consistent with gut results 10, 48-50 ), 244 Capnocytophaga leadbetteri, Cardiobacterium hominis, etc. (Fig. 6b). Thus, the 245 substantially expanded collection of oral microbial genomes enabled discovery of new disease markers and genomic representation of previously reported markers, 247 facilitating the shift from fecal to oral microbiome-based diagnosis and therapeutics. 248

Potential functions in drug metabolism and small molecule synthesis 249
Many human target drugs are reported to be metabolited to its inactive form 250 by gut human microbiome 51, 52 or impact the gut bacteria 51-53 . Gut bacteria genes that 251 metabolite 41 human targeted drugs, 6 non-traditional antibacterial therapeutic and 252 key enzymes experimented validated for 12 human diseases were mapped to our oral 253 SGB genomic contents (Supplementary Tables 6). We show that many oral 254 communities share homologous to these gut bacteria encoding enzymes, suggesting 255 the oral microbiome may also play an importance role in medical therapy and disease 256 development (Fig. 7a) uSGBs, and the BGCs coding for bacteriocin, arylpolyene, type III PKS (polyketide 268 synthase) has appeared more than 500 times on the oral bacterial community (Fig. 7b, 269 Supplementary Table 7). For each specie's genome, the size percentage (mean: interpreted the data, prepared the display items and wrote the manuscript. All authors 308 contributed to finalizing the manuscript. 309

COMPETING FINANCIAL INTERESTS 310
The authors declare no competing financial interest. 311 312

ONLINE METHODS 313
The newly cohort and published datasets used in this study. The 2675(2284 saliva 314 and 391 tongue) oral metagenomics samples from Chinese 4D-shenzhen 315 corhorts (Supplementary Tables 1 sheet 4)    Clustering metagenomic genomes into species-level genome bins. The 56,213 380 reconstructed genomes and 190,309 reference genomes were grouped into species 381 -level genome bins(SGBs) by a two-step clustering strategy as reported previously 16 382 with a slight modification. In the first step, all-versus-all genetic distance matrix 383 between the 246,522 genomes was carried out using Mash version 2.0 65 ("-k 21 -s 1e4" 384 for sketching ). Then, hierarchical clustering with average linkage and 0.05 genetic 385 distance cutoff on the distance matrix by fastcluster 66 was resulted to 33008 clusters. 386 Because the Mash will underestimate the distance between the incomplete genomes 67 387 and split same-species genomes into multiple SGBs, we performed clustering base on 388 average nucleotide identity(ANI) in the second step. First, We divide the SGB into 389 known SGB(kSGB) and unknown SGB(uSGB) according to with or without reference genomes. Then, a representative genome was selected for each SGBs. For the kSGB, 391 the genome which has the largest genome size was selected. For the uSGB, all MAGs 392 were rank by completeness(in decreasing order), contamination(increasing), 393 coverage(decreasing), strain heterogeneity(increasing), N50(decreasing). And 394 representative genome was selected as the one minimizing the sum of the five ranks. 395 We recalculated the more precise genetic distance using pyani v0. Although some kSGBs already have taxonomy label, we still using GTDB-Tk to 426 classify them because GTDB-Tk has its own taxonomy classification system that is 427 different from the NCBI taxonomy database. Then above the genus level, we 428 manually removed the classification tag with a single letter suffix (Supplementary 429    (Supplementary table 1 sheet 2), the total 1089 oral metagenomes samples 447 were mapped to these databases respectively using Bowtie2 v2.3.5 with default 448 parameters. The barplot of mapping rate was generated using R package ggplot2 adjust for potential confounders such as gender, age, BMI (Table S1). BMI is only 491 available for RA. Species relative abundances was asin-sqrt transformed as described 492 before 86 . Non-oral SGBs were excluded. Corrected for multiple hypothesis tests was 493 done using FDR. We predicted disease status using gradient boosting model (GBM) 494 in caret 87 R package, such that 80% of the samples were randomly sampled for each 495 estimator. The depth of the tree at each estimator was not limited, but leaves were 496 restricted to have at least 30 instances. We used 4000 estimators with a learning rate 497 of 0.002. All the FDR <1% oral marker SGBs are included in the model as predictors.
To avoid overfitting, 5 repeat ten folds cross validation ROC was used to measure the 499 model performance. VarImp function was used to extract the GBM importance.