Machine learning classification by fitting amplicon sequences to existing OTUs

ABSTRACT The ability to use 16S rRNA gene sequence data to train machine learning classification models offers the opportunity to diagnose patients based on the composition of their microbiome. In some applications, the taxonomic resolution that provides the best models may require the use of de novo operational taxonomic units (OTUs) whose composition changes when new data are added. We previously developed a new reference-based approach, OptiFit, that fits new sequence data to existing de novo OTUs without changing the composition of the original OTUs. While OptiFit produces OTUs that are as high quality as de novo OTUs, it is unclear whether this method for fitting new sequence data into existing OTUs will impact the performance of classification models relative to models trained and tested only using de novo OTUs. We used OptiFit to cluster sequences into existing OTUs and evaluated model performance in classifying a dataset containing samples from patients with and without colonic screen relevant neoplasia (SRN). We compared the performance of this model to standard methods including de novo and database-reference-based clustering. We found that using OptiFit performed as well or better in classifying SRNs. OptiFit can streamline the process of classifying new samples by avoiding the need to retrain models using reclustered sequences. IMPORTANCE There is great potential for using microbiome data to aid in diagnosis. A challenge with de novo operational taxonomic unit (OTU)-based classification models is that 16S rRNA gene sequences are often assigned to OTUs based on similarity to other sequences in the dataset. If data are generated from new patients, the old and new sequences must be reclustered to OTUs and the classification model retrained. Yet there is a desire to have a single, validated model that can be widely deployed. To overcome this obstacle, we applied the OptiFit clustering algorithm to fit new sequence data to existing OTUs allowing for reuse of the model. A random forest model implemented using OptiFit performed as well as the traditional reassign and retrain approach. This result shows that it is possible to train and apply machine learning models based on OTU relative abundance data that do not require retraining or the use of a reference database.

T here is increasing interest in training machine learning models to diagnose diseases such as Crohn's disease and colorectal cancer using the relative abundance of clusters of similar 16S rRNA gene sequences (1,2).These models have been used to identify sequence clusters that are important for distinguishing between individuals from different disease categories (3).There is also an opportunity to train models and apply them to classify samples from new individuals.For example, a model for colorectal cancer could be trained, "locked down, " and applied to samples from new patients.
To apply these models to new samples, the composition of the clusters would need to be independent of the new data.For example, amplicon sequence variants (ASVs) are defined without consideration of sequences in other samples, phylotypes are defined by clustering sequences that have the same taxonomy (e.g., to the same family) when classified using a taxonomy database, and closed reference operational taxonomic units are defined by mapping sequences to a collection of reference OTUs.In contrast, de novo approaches cluster sequences based on their similarity to other sequences in the dataset and can change when new data are added.Although it would be preferable to select an approach that generates stable clusters, there may be cases where OTUs generated by a de novo approach outperform those of the other taxonomic levels.In fact, we recently trained machine learning models for classifying patients with and without screen relevant neoplasias (SRNs) in their colons and found that OTUs generated de novo using the OptiClust algorithm performed better than those generated using ASVs or at higher taxonomic levels (4).
It could be possible to construct reference OTUs and map new sequences to those OTUs to attain similar performance as was seen with the OptiClust-generated OTUs.The traditional approach to reference-based clustering of sequences to OTUs has multiple drawbacks and does not produce clusters as good as those generated using OptiClust (5).Sovacool et al. recently described OptiFit, a method for fitting new sequence data into existing OTUs that overcomes the limitations of traditional reference-based clustering (5).OptiFit allows researchers to fit new data into existing OTUs defined from the same dataset resulting in clusters that are as good as if they had all been clustered with OptiClust.We tested whether OptiClust-generated OTUs could be used to train models that were then used to classify held out samples after clustering their sequences to the model's OTUs using OptiFit.
To test how the model performance compared between using de novo and referencebased clustering approaches, we used a publicly available dataset of 16S rRNA gene sequences from stool samples of healthy subjects (n = 226) as well as subjects with screen relevant neoplasia consisting of advanced adenoma and carcinoma (n = 229) (1).For the de novo workflows, the 16S rRNA sequence data from all samples were clustered into OTUs using the OptiClust algorithm in mothur (6) and the VSEARCH algorithm used in QIIME2 (7,8).For both algorithms, the resulting abundance data was then split into training and testing sets, where the training set was used to tune hyperparameters and ultimately train and select the model.The model was applied to the testing set, and the performance was evaluated (Fig. 1A).For traditional referencebased clustering (database-reference-based), we used OptiFit to fit the sequence data into OTUs based on the commonly used Greengenes reference database.To compare with another commonly used method, we also used VSEARCH to map sequences to reference OTUs from the Greengenes database with the parameters used by QIIME2.We used the Greengenes database since it has reference OTUs for use with VSEARCH and prior analysis demonstrated that closed-reference OTU clustering using the Greengenes reference produced higher quality OTUs and recruited a higher fraction of reads than the SILVA or RDP references (5).Again, the data were then split into training and testing sets, hyperparameters tuned, and performance evaluated on the testing set (Fig. 1B).In the OptiFit self-reference workflow (self-reference-based), the data were split into a training and a testing set.The training set was clustered into OTUs and used to train a classifica tion model.The OptiFit algorithm was used to fit sequence data of samples not part of the training data into the training OTUs and classified using the best hyperparameters (Fig. 1C).For each of the workflows, the process was repeated for 100 random splits of the data to account for variation caused by the choice of the random number generator seed.
We first examined the quality of the resulting OTU clusters from each method using the Matthews correlation coefficient (MCC).MCC is an objective metric used to measure OTU cluster quality based on the similarity of all pairs of sequences and whether they are appropriately clustered or not (9).We expected the MCC scores produced by the OptiFit workflow to be similar to that of de novo clustering using the OptiClust algorithm.In the OptiFit workflow, the test data were fit to the clustered training data for each of the 100 data splits resulting in an MCC score for each split of the data.In the remaining work flows, the data were only clustered once and then split into the training and testing sets resulting in a single MCC score for each method.Indeed, the MCC scores were similar between the OptiClust de novo (MCC = 0.884) and OptiFit self-reference workflows (average MCC = 0.879, standard deviation = 0.002).Consistent with prior findings, the reference-based methods produced lower MCC scores (OptiFit Greengenes MCC = 0.786; VSEARCH Greengenes MCC = 0.531) than the de novo methods (OptiClust de novo MCC = 0.884; VSEARCH de novo MCC = 0.641) (5).Another metric we examined for the OptiFit workflow was the fraction of sequences from the test set that mapped to the reference OTUs.Since sequences that did not map to reference OTUs were eliminated, if a high percentage of reads did not map to an OTU we expected this loss of data to negatively impact classification performance.We found that loss of data were not an issue since on average 99.8% (standard deviation = 0.7%) of sequences in the subsampled test set mapped to the reference OTUs.This number is higher than the average fraction of reads mapped in the OptiFit Greengenes workflow (mean = 96.8% and standard deviation = 3.5%).These results indicate that the OptiFit self-reference method performed as well as the OptiClust de novo method and is better than using an external database.
We next assessed model performance using OTU relative abundances from the training data from the workflows to train a model to predict SRNs and used the model on the held-out data.Using the predicted and actual diagnosis classification, we calculated the area under the receiver operating characteristic curve (AUROC) for each data split.During cross-validation (CV) training, the performance of the OptiFit self-reference and OptiClust de novo models were not significantly different (P-value = 0.066; Fig. 2A), while performance for both VSEARCH methods was significantly lower than the OptiClust de novo, OptiFit self, and OptiFit Greengenes methods (P-values < 0.05).The trained model was then applied to the test data classifying samples as either control or SRN.The VSEARCH Greengenes method performed slightly worse than the OptiClust de novo method (P-value = 0.030).However, the performance on the test data for the OptiClust de novo, OptiFit Greengenes, OptiFit self-reference, and VSEARCH de novo approaches was not significantly different (P-values > 0.05; Fig. 2B and C).These results indicate that new data could be fit to existing OTU clusters using OptiFit without impacting model performance.Random forest machine learning models trained using OptiClust-generated OTUs and tested using OptiFit-generated OTUs performed as well as a model trained using entirely de novo OTU assignments.A potential problem with reference-based clustering methods is that sequences that do not map to the reference OTUs are discarded, resulting in a possible loss of information.However, we demonstrated that the training samples represented the most important OTUs for classifying samples.Missing important OTUs is more of a risk when using a database-reference-based method since not all environ ments are well represented in public databases.Despite this and the lower quality OTUs, the database-reference-based approach performed as well as the models generated using OptiFit.This likely indicates that the sequences that were important to the model were well characterized by the Greengenes reference OTUs.However, a less well-studied system may not be as well characterized by a reference database that would make the ability to utilize one's own data as a reference an exciting possibility.Our results highlight that OptiFit overcomes a significant limitation with machine learning models trained using de novo OTUs.This is an important result for those applications where models trained using de novo OTUs outperform models generated using methods that produce clusters that do not depend on which sequences are included in the dataset.

Dataset
Raw 16S rRNA gene sequence data from the V4 region were previously generated from human stool samples.Sequences were downloaded from the NCBI Sequence Read Archive (accession no.SRP062005) (1).This dataset contains stool samples from 490 subjects.For this analysis, samples from subjects identified in the metadata as normal, high risk normal, or adenoma were categorized as "normal, " while samples from subjects identified as advanced adenoma or carcinoma were categorized as "screen relevant neoplasia".The resulting dataset consisted of 261 normal samples and 229 SRN samples.

Machine learning
A random forest model was trained with the R package mikrompl (v 1.2.0) (14) to predict the diagnosis (SRN or normal) for the samples in the test set for each data split.The training set was preprocessed to normalize OTU counts (scale and center), collapse correlated OTUs, and remove OTUs with zero variance.The preprocessing from the training set was then applied to the test set.Any OTUs in the test set that were not in the training set were removed.P-values comparing model performance were calculated as previously described (15).The averaged receiver operating characteristic (ROC) curves were plotted by taking the average and standard deviation of the sensitivity at each specificity value.

FIG 1
FIG 1Overview of clustering workflows.The de novo and database-reference-based workflows were conducted using two approaches: OptiClust with mothur and VSEARCH as is used in the QIIME pipeline.

FIG 2
FIG 2 Model performance of the OptiFit self-reference workflow was as good or better than other methods.(A) Area under the receiver operating characteristic curve during cross-validation (train) for the various workflows.(B) AUROC on the test data for the various workflows.The mean and standard deviation of the AUROC are represented by the black dot and whiskers in panels A and B. The mean AUROC is printed below the points.(C) Averaged receiver operating characteristic (ROC) curves.Lines represent the average true positive rate for the range of false positive rates.