A cross-cohort analysis of dental plaque microbiome in early childhood caries

Summary Early childhood caries (ECC) is a multifactorial disease with a microbiome playing a significant role in caries progression. Understanding changes at the microbiome level in ECC is required to develop diagnostic and preventive strategies. In our study, we combined data from small independent cohorts to compare microbiome composition using a unified pipeline and applied a batch correction to avoid the pitfalls of batch effects. Our meta-analysis identified common biomarker species between different studies. We identified the best machine learning method for the classification of ECC versus caries-free samples and compared the performance of this method using a leave-one-dataset-out approach. Our random forest model was found to be generalizable when used in combination with other studies. While our results highlight the potential microbial species involved in ECC and disease classification, we also mentioned the limitations that can serve as a guide for future researchers to design and use appropriate tools for such analyses.

This dataset was composed of participants of 5 to 11 years old.Since the age restriction to be considered as ECC is less than 6 years old children only, participants with age less than 6 years were selected from this database which were 20 Caries free and 12 with caries.
The NCBI repository had raw FASTQ reads for 10 caries-free and 11 ECC participants only.These numbers were also confirmed with the authors, and they mentioned that only good-quality samples were deposited in the NCBI SRA repository.For this longitudinal study, the authors divided the children into three groups: H2H-Children who remained healthy for the entire duration, H2C-Children who were caries-free at the time of recruitment f-primer GAGTTTGATCCTGGCTCAG r-primer TACCGCGGCTGCTGGCAC but developed caries later, C2C-Children who had caries at the time of recruitment and continued to have caries for the duration of this study.To mimic the samples for case-control type data, we selected all samples from H2H at different data points with confident health (27 samples) and from caries only the samples with dmfs value greater than or equal to 5 were selected which can be considered as severe ECC (SECC).A further sample filtering was done for the samples that retained less than 3000 non-chimeric reads after the dada2 step in Qiime2 processing.

Figure S1 :
Figure S1: Differentially abundance taxa on raw abundance without batch correction values at species-level using DESeq2, related to Figure 7.Only the significant taxa in Pooled studies were selected for plotting.

Figure S2 :
Figure S2: Meta-analysis for differentially abundance taxa without batch correction values, related to Figure 7. (A) Genus-level.(B) Species-level.The heatmap represents the log odd ratios of differentially abundant with significant p-adjusted value (p-adjusted<0.05) in pooled dataset along with the odd ratio estimates in each dataset.The forest plot signifies 95% confidence interval for the log odd ratio values for each taxon in pooled dataset.

Figure S3 :
Figure S3: Random Forest performance in terms of AUROC with combined genus and species-level OTUs for Pooled dataset with batch-corrected values, related to Figure 9.

Figure S4 :
Figure S4: AUROC values with species-level data with CLR normalization without batch correction, related to Figure 9.The left panel represents the cross-validation results for each dataset.The middle panel is for model performance for LODO analysis.In LODO analysis, all datasets except one were used for the training and the left-out dataset was then used for the testing to assess the generalizability of the model.The rightmost column is for the cross-validation performance of the pooled dataset.The y-axis represents the number of top OTUs used for model assessment.

Figure S5 :
Figure S5: AUROC values with batch corrected values using genus-level data, related to Figure 9.The left panel represents the cross-validation results for each dataset.The middle panel is for model performance for LODO analysis.In LODO analysis, all datasets except one were used for the training and the left-out dataset was then used for the testing to assess the generalizability of the model.The rightmost column is for the cross-validation performance of the pooled dataset.The y-axis represents the number of top OTUs used for model assessment.

Figure S6 :
Figure S6: AUROC values with imputed data with CLR normalization with species-level data, related to Figure 9.The left panel represents the cross-validation results for each dataset.The middle panel is for model performance for LODO analysis.In LODO analysis, all datasets except one were used for the training and the left-out dataset was then used for the testing to assess the generalizability of the model.The rightmost column is for the cross-validation performance of the pooled dataset.The y-axis represents the number of top OTUs used for model assessment.