Conditional Forest Models Built Using Metagenomic Data Accurately Predicted Salmonella Contamination in Northeastern Streams

Understanding the associations between surface water microbiome composition and the presence of foodborne pathogens, such as Salmonella, can facilitate the identification of novel indicators of Salmonella contamination. This study assessed the utility of microbiome data and three machine learning algorithms for predicting Salmonella contamination of Northeastern streams. ABSTRACT The use of water contaminated with Salmonella for produce production contributes to foodborne disease burden. To reduce human health risks, there is a need for novel, targeted approaches for assessing the pathogen status of agricultural water. We investigated the utility of water microbiome data for predicting Salmonella contamination of streams used to source water for produce production. Grab samples were collected from 60 New York streams in 2018 and tested for Salmonella. Separately, DNA was extracted from the samples and used for Illumina shotgun metagenomic sequencing. Reads were trimmed and used to assign taxonomy with Kraken2. Conditional forest (CF), regularized random forest (RRF), and support vector machine (SVM) models were implemented to predict Salmonella contamination. Model performance was assessed using 10-fold cross-validation repeated 10 times to quantify area under the curve (AUC) and Kappa score. CF models outperformed the other two algorithms based on AUC (0.86, CF; 0.81, RRF; 0.65, SVM) and Kappa score (0.53, CF; 0.41, RRF; 0.12, SVM). The taxa that were most informative for accurately predicting Salmonella contamination based on CF were compared to taxa identified by ALDEx2 as being differentially abundant between Salmonella-positive and -negative samples. CF and differential abundance tests both identified Aeromonas salmonicida (variable importance [VI] = 0.012) and Aeromonas sp. strain CA23 (VI = 0.025) as the two most informative taxa for predicting Salmonella contamination. Our findings suggest that microbiome-based models may provide an alternative to or complement existing water monitoring strategies. Similarly, the informative taxa identified in this study warrant further investigation as potential indicators of Salmonella contamination of agricultural water. IMPORTANCE Understanding the associations between surface water microbiome composition and the presence of foodborne pathogens, such as Salmonella, can facilitate the identification of novel indicators of Salmonella contamination. This study assessed the utility of microbiome data and three machine learning algorithms for predicting Salmonella contamination of Northeastern streams. The research reported here both expanded the knowledge on the microbiome composition of surface waters and identified putative novel indicators (i.e., Aeromonas species) for Salmonella in Northeastern streams. These putative indicators warrant further research to assess whether they are consistent indicators of Salmonella contamination across regions, waterways, and years not represented in the data set used in this study. Validated indicators identified using microbiome data may be used as targets in the development of rapid (e.g., PCR-based) detection assays for the assessment of microbial safety of agricultural surface waters.

Overall microbiome composition was not associated with Salmonella presence in surface water samples. The overall microbiome composition was not associated with Salmonella contamination (see Fig. S1 in the supplemental material). Consistent with results of permutational multivariate analysis of variance (PERMANOVA), the principal-component analysis (PCA) biplot ( Fig. 1) showed that the sample microbiomes did not cluster based on the presence of Salmonella. The first two components explained a relatively low percentage of variance in the microbiome composition. Specifically, they explained 24.2% of the variance at the species level (Fig. 1A), 22.6% of the variance at the genus level (Fig. 1B), and 21.8% of the variance at the family level (Fig. 1C). PERMANOVA results also did not indicate significant association between microbiome composition and Salmonella isolation (P = 0.091 [species level], P = 0.349 [genus level], P = 0.318 [family level]).
CF models outperformed RRF and SVM models when predicting Salmonella contamination. Regardless of the feature set, the mean area under the curve (AUC) and Kappa score were always higher for conditional forest (CF) than for regularized random forest (RRF) and support vector machine (SVM) when species-level microbiome data were used ( Fig. 2; Table S2). The AUC was also consistently higher for CF models than for RRF and SVM models when genus-level and family-level microbiome data were used. The Kappa values were similar for CF and RRF models when genus-level and family-level microbiome data were used ( We found that the AUC range for CF did not overlap AUC ranges of the other two methods, indicating that CF has outperformed RRF and SVM. Hence, further analyses were carried out using the CF models only. Across all models built using the species-level data, the CF model run on central log-ratio (CLR)-transformed relative abundance data without environmental features had the highest AUC (0.87, SD = 0.10) and second highest Kappa score (0.51, SD = 0.20) (Fig. 2; Table S2). When using genus-level data, the CF model built using the CLR-transformed relative abundance data without environmental features had the highest AUC (0.80, SD = 0.12) and Kappa (0.42, SD = 0.21) score. When using familylevel data, the CF model using the relative abundance data with environmental features resulted in the highest Kappa score (0.34, SD = 0.23), and second highest AUC (0.76, SD = 0.14) ( Fig. 2; Table S2). However, we found that the performance of models built using microbiome data only and that of models using microbiome and environmental factors were not significantly different when cross-validated performance measures were compared (paired t test, P . 0.05).
The two most informative taxa for predicting Salmonella presence both belonged to the Aeromonas genus. CF identified Aeromonas sp. strain CA23, Aeromonas strain CU5, Aeromonas salmonicida, Rhizobium rhizoryzae, and Nocardiopsis alba as the five most informative species for predicting Salmonella contamination (Fig. 3A). Using genus-level microbiome data, CF identified Aeromonas, Tabrizicola, Haematobacter, Defluviimonas, and Rhizobium as the five most informative genera (Fig. 3B). In family-level analysis, Aeromonadaceae, Rhodobacteraceae, and Methanobacteriaceae were identified as the three most informative families. Furthermore, four environmental features (i.e., stream level, dissolved oxygen level, conductivity, and average amount of solar radiation in the last 30 days) were also identified as informative features in the CF model that included environmental features (Fig. 3C).
The differential abundance analysis carried out using ALDEx2 identified two bacterial species, Aeromonas sp. CA23 (P = 0.00159) and Aeromonas salmonicida (P = 0.00172) (Fig. 4) as significantly differentially abundant between Salmonella-positive and Salmonella-negative samples. This was consistent with the results of the differential abundance test using genus-level microbiome data, where two bacterial genera, Aeromonas (P = 0.03246) and Tabrizicola (P = 0.037197) (

DISCUSSION
This study was based on the premise that contamination of surface waters with enteric pathogens such as Salmonella cooccurs with other, more abundant microbiota that are easier to rapidly and directly (e.g., by PCR without culture-based enrichment) detect. Such microbiota may be introduced to the waterway from the same contamination source as the target pathogen or be associated with the environmental conditions that favor pathogen introduction to, or survival in, freshwater systems. As such, these taxa could be useful as indicators of when and where surface waters may by contaminated by Salmonella and thus represent possible microbial targets of rapid tests used to monitor surface water for Salmonella contamination. Hence, we aimed to leverage water microbiome data and machine learning classifiers to identify specific taxa predictive of Salmonella contamination that could be further developed into rapid detection assays (10). For example, taxa that (i) consistently occur in samples contaminated with Salmonella or (ii) are consistently present in a significantly different relative abundance in samples contaminated with Salmonella may be utilized to develop rapid PCR-based indicator assays in preference over the current time-consuming enrichment-based detection method. Data transformation did not have a notable effect on the performance of predictive models. Microbiome data analyses are challenging due to inherent data complexities. These include data sparsity (i.e., many taxa are present in only small proportions of samples, resulting in a large proportion of zero counts), collinearity (i.e., some taxa are highly correlated), imbalanced library sizes, and a well-known "small n, large p" problem (i.e., small number of samples and a large number of taxa). Some of these challenges can be addressed prior to applying machine learning models or can be addressed by applying machine learning methods that are robust to these microbiome data challenges. Here, we used a central log-ratio (CLR) transformation to mitigate the potential effects of different library sizes and to take into consideration the compositional nature of microbiome data (33,34). We found that the microbiome data transformation had an inconsistent effect on a model performance, as it improved the AUC of some models (e.g., CF using genus-and family-level data and SVM using family-level data) while decreasing the AUC of other models (e.g., RRF using genus-and family-level data and SVM using genus-level data), whereas there were no significant differences among different models. Previous studies on gut microbiome data emphasized the importance of data transformation due to the compositionality of the microbiome data (28,30,32). However, similar to our finding, those studies also reported that the performance of tree-based algorithms (i.e., random forest and XGBoost) was not significantly affected by data transformation (32).
Model selection is critically important when applying machine learning algorithms to microbiome data. Different machine learning classifiers address microbiome data challenges to various degrees. Hence, we assessed the performance of three different machine learning algorithms for prediction of Salmonella contamination. At all taxonomic levels of microbiome data used for classification, we found that CF performed better overall than RRF and SVM. This is consistent with another study that compared the abilities of different algorithms to accurately predict foodborne pathogen contamination of surface water and consistently found that CF was a high-performing algorithm (27). However, in that study, the authors also found that RRF and/or SVM performed well. Given the complex relationships that underpin microbial ecosystems, algorithms that use hierarchical relationships between the presence-absence of various taxa (e.g., models that incorporate "interactions" or hierarchy such as CF and RRF) may predict pathogen presence better than algorithms that do not (e.g., SVM).
Differential abundance analysis and conditional forest models identified putative indicators of Salmonella contamination in surface waters. When assessing the association between the overall microbiome composition and Salmonella contamination, associations between certain taxa and Salmonella contamination may be missed due to the large number of taxa included in the analyses. Indeed, we found a lack of association between the overall microbiome composition and the presence of Salmonella based on the principal-component analysis (PCA) and PERMANOVA. Hence, we used differential abundance analysis and CF to identify individual taxa (or their relative abundance) associated with Salmonella contamination.
Both differential abundance analysis and CF identified some of the same taxa predictive of Salmonella contamination in stream water samples, suggesting that the association between Salmonella contamination and the presence of these taxa identified in the present study is robust to the analytical method used. Specifically, we identified two bacterial species (Aeromonas salmonicida and Aeromonas sp. CA23) which were present in a significantly lower relative abundance in samples contaminated with Salmonella using ALDEx2 differential abundance analysis. These taxa were also among the top-ranked factors based on the CF. Aeromonas species have previously been found in natural water and a broad range of foods, as well as human and animal gastrointestinal tracts (35,36). Aeromonas sp. strain CA23 is closely related phylogenetically to A. salmonicida strain CBA100 (37). Aeromonas salmonicida has been found in a variety of aquatic environments and has been identified as a cause of disease in salmonid fish and wild cod (38,39). Aeromonas has also been reported as an opportunistic pathogen in immunocompromised humans (35). However, the data on the relationship between A. salmonicida and Salmonella are limited. McIntosh et al. reported that antibiotic and mercury resistance

Prediction of Salmonella Contamination in Water
Microbiology Spectrum genes were found on an A. salmonicida plasmid that is highly similar to pSN254, a plasmid previously identified in Salmonella (40). This may suggest horizontal gene transfer between the two genera, which requires cooccurrence in an environmental reservoir. Our study found that a higher relative abundance of A. salmonicida was associated with Salmonellanegative samples. This could possibly indicate that A. salmonicida is able to outcompete Salmonella, resulting in exclusion of Salmonella from the water microbiome or vice versa. Furthermore, Aeromonas has morphological and biochemical characteristics similar to those of microorganisms belonging to Enterobacteriaceae, a family of microorganisms that includes Salmonella and is commonly used as an indicator of poor hygienic conditions in food systems (41). Bonadonna et al. reported that the presence of E. coli and fecal coliforms was associated with lower counts of Aeromonas, whereas the prevalence of total coliforms (which include non-Enterobacteriaceae) was associated with higher counts of Aeromonas in bathing waters along the coast of the Adriatic Sea (42,43). Due to limited information available, the ecological relationship between Salmonella and Aeromonas is poorly understood. Hence, more research is needed to characterize it and explain reported associations between these genera. Another genus identified in this study, Tabrizicola, belongs to the family Rhodobacteraceae, which are usually found in aquatic environments, including lakes and wastewater treatment facilities (44)(45)(46). However, the ecological role of Tabrizicola is also poorly understood, which limits our ability to meaningfully interpret our identification of Tabrizicola as predictive of Salmonella contamination.
At a family level, we identified Aeromonadaceae (including Aeromonas), Rhodobacteraceae (including Tabrizicola), Shewanellaceae, and Parvibaculaceae as being strongly associated with accurately predicting Salmonella contamination in the CF analyses and being significantly differentially abundant between Salmonella-positive and -negative samples. All four identified families are known marine or aquatic microbiome members commonly found in natural waters (47)(48)(49). However, their relationship with Salmonella is unknown.
A study by Gu et al. used 16S rRNA gene amplicon sequencing data and found an association between specific microbial taxa and the prevalence and population density

Prediction of Salmonella Contamination in Water
Microbiology Spectrum of Salmonella enterica detected in ponds and wells on the Eastern Shore of Virginia (ESV) between January and December (50). They found that the relative abundance of Sphingomonadales was significantly correlated with S. enterica prevalence and concentration in irrigation ponds and wells (50). However, in our study, we failed to find evidence of an association between Sphingomonadales and Salmonella contamination. This inconsistency could be explained by regional differences in the water microbiome composition or by differences in the microbiomes of ponds, wells, and streams. Similarly, microbiome composition is known to be influenced by environmental features such as natural variation over time, weather, and adjacent land use (22,23,25), and environmental differences between the waterways, water types, regions, or years may also explain the different findings.
Previous studies reported that environmental features had an effect on the probability of isolating Salmonella from freshwater samples (20,50,51). Therefore, it is important to consider environmental condition as supplementary features when developing new tools for predicting Salmonella contamination of freshwater based on the specific taxa. However, in our study, environmental features did not significantly improve the prediction accuracy.
Conclusions. This study applied machine learning classifiers and differential abundance analysis to surface water microbiome data to identify putative novel indicators of Salmonella contamination. We identified Aeromonas and Tabrizicola bacterial genera and Aeromonadaceae, Rhodobacteraceae, Shewanellaceae, and Parvibaculaceae families that warrant further assessment as putative indicators of Salmonella contamination of water from streams. The taxa identified here are potential targets for the development of an alternative or complementary (to E. coli quantification) water quality/safety monitoring strategy focused on mitigating the use of surface waters likely contaminated with Salmonella. Models developed in this study need to be validated on new samples collected from a broader geographic area and over multiple seasons to assess the predictive accuracy of taxa identified here. Furthermore, deeper metagenomic sequencing that would enable metagenome assembly may facilitate identification of putative novel indicators of Salmonella contamination as well as characterization of their functional potential to explain their ecological relationship with Salmonella.

MATERIALS AND METHODS
Sample collection and processing. Water samples were collected from 60 streams in upstate New York State between July and October 2018 as described by Weller et al. (20). All chemical, microbial, and environmental water quality data were previously reported by Weller et al. (20). Briefly, 10-L grab samples were collected from each stream and tested for Salmonella presence. Each grab sample was filtered through modified Moore swabs (mMS). Buffered peptone water supplemented with novobiocin (20 mg/l) was added to Whirl-Pak bags containing mMS and incubated at 35°C for 24 h. After incubation, a BAX real-time PCR screen was used to identify samples that were presumptively positive for Salmonella. Salmonella presence was confirmed using culture-based methods fully described in the work of Weller et al. (20).
Separately, 100-mL grab samples were collected for metagenomic analysis. The 100-mL samples were filtered through a 0.45-mm filter (Nalgene; Thermo Fisher Scientific, Waltham, MA, USA). Filters were then stored at 280°C until DNA extraction.
DNA extraction and microbiome sequencing. DNA was extracted using the DNeasy Power Water kit (Qiagen, MD, USA) per the manufacturer's instructions. Extracted DNA was examined for quality and quantified using Nanodrop One (Thermo Fisher Scientific, MA, USA) and Qubit 3 (Thermo Fisher Scientific, MA, USA), respectively. DNA was then sent to the Penn State Genomics Core Facility for library preparation and sequencing. Libraries were prepared using Nextera XT Flex per the manufacturer's instructions. Pooled libraries were sequenced on an Illumina NextSeq with 150-bp paired-end reads.
Sequence quality control and taxonomic classification. FastQC version 0.11.5 was used to assess read quality using default parameters (52). Illumina adapters and low-quality bases were trimmed using Trimmomatic (v 0.36) (53) with default parameters. Trimmed reads were taxonomically classified using Kraken2 (v 2.1.2) (54), and abundances were inferred using Bracken (v 2.5) (55). The NCBI's RefSeq nucleotide database (v 207) (56) was used to build a Kraken2 database. Any read that mapped to a single reference genome was labeled with the NCBI taxonomic annotation (taxid) corresponding to that reference genome. Any read that mapped to multiple reference genomes or did not meet or exceeded the confidence scoring threshold was assigned a last common ancestor (LCA) taxonomic identification (taxid) (54). Confidence scores were set to 0.1, meaning that at least 10% of the total number of kmers from a read were classified. Bracken was used to estimate the abundance of taxa by redistributing reads in the Prediction of Salmonella Contamination in Water Microbiology Spectrum taxonomy using Bayes' theorem (55). Assigned taxonomy and taxid counts of all samples were merged into a table that was used for downstream analyses. Microbiome analyses. All statistical analyses of microbiome data were performed in R (version 4.1.2; R Core Team, Vienna, Austria) (57), using a compositional analyses framework (33). First, the estimated abundances were transformed using the centered log-ratio (CLR) transformation (34). Ratio transformations capture the relationship between the taxonomic units in the data, and the logarithm of these ratios ensures that the data are symmetric and linearly related (34). Distances between samples were calculated using the Aitchison distance (i.e., Euclidian distance after CLR transformation) to investigate the among-sample differences in compositional microbiome data (33). Principal-component analysis (PCA) was carried out using the 'princomp' function in R on relative abundances of taxids to visualize the ordination and clustering of samples based on the microbiome composition (33). The first two principal components were plotted using the 'ggplot2' package (v 3.3.3) (58). Samples were color coded to visually assess whether they cluster based on the Salmonella presence/absence. Permutational multivariate analysis of variance (PERMANOVA) was carried out to assess statistical associations between microbiome composition and Salmonella presence using the 'adonis' function in the 'vegan' package (v 2.5.7) (59). Differential abundance testing was conducted using the ALDEx2 R package (60) to identify bacterial species, genera, and families that were differentially abundant between Salmonella-positive and -negative samples. Each identified bacterial species, genus, and family were tested using Welch's t test to assess statistical significance of detected differences in their relative abundance. To mitigate reporting falsepositive results, we concurrently carried out differential abundance statistical tests and reported as putative novel indicators only taxa that were identified by both machine learning classifiers and ALDEx2.
Predictive modeling. Three machine learning algorithms (i.e., conditional forest [CF] [61], regularized random forest [RRF] [62], and support vector machine [SVM] with sigmoid kernel [63]) that had previously been applied on microbiome data (64)(65)(66) and were suitable for microbiome data structure (27) were used in this study. Additionally, these algorithms were previously reported to outperform others for predicting Salmonella presence using environmental data (27). These methods were used to develop models that predict Salmonella presence or absence in water samples. Both relative abundance of taxids and CLR-transformed relative abundance of taxids were separately used as features to assess the effect of microbiome data transformation on model performance. Model training and evaluation were performed using the 'mlr' package (v 2.19.0) (67). Tenfold cross-validation repeated 10 times was used to tune hyperparameters to maximize area under the curve (AUC) (68,69). For the CF model, mtry and minbucket (number of taxa included in each splitting) were tuned. For the RRF model, mtry, nodesize (minimum node size of each decision tree), and coefReg (the coefficient of regularization) were tuned. Lastly, for SVM, cost and gamma were tuned to find the best-performing model (see Table S1 in the supplemental material). In total, analyses were conducted using two feature sets, (i) untransformed relative abundances (abundance of specific taxa/total count of taxids) of microbial taxa and (ii) CLR-transformed relative abundances of microbial taxa, to assess whether models perform better on transformed microbiome data. Analyses were also carried out at three different taxonomic levels (i.e., species, genus, and family) and with or without environmental data. Environmental features were provided from Table S1 in the work of Weller et al. (27). In total, 36 models were constructed separately based on the mean AUC and kappa scores ( Table 1). The best-performing model for each combination of algorithm and input data (Table S2) was selected for identifying informative taxa. Paired t test was used assess the statistical significance of differences among selected models (70). Conditional variable importance (CVI) was calculated using the 'party' package (v. 1.3.7) for CF models (61). Conditional variable importance was used to mitigate the problem of complex interaction between bacterial taxa (i.e., highly correlated taxids) by screening correlated variables during the decision tree building process (61).
Data availability. Sequences generated in this study are available in the NCBI Sequence Read Archive database under the BioProject accession number PRJNA849616. Scripts used for bioinformatics and statistical analyses are available in the GitHub repository: https://github.com/tuc289/SurfaceWaterMicrobiome/ tree/master/Year2.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. SUPPLEMENTAL FILE 1, PDF file, 1.7 MB. We thank Martin Wiedmann, Sherry Roof, and Alexandra Belias for their support as well.