Machine Learning Predicts Biogeochemistry from Microbial Community Structure in a Complex Model System

ABSTRACT Microbial community structure is influenced by the environment and in turn exerts control on many environmental parameters. We applied this concept in a bioreactor study to test whether microbial community structure contains information sufficient to predict the concentration of H2S as the product of sulfate reduction. Microbial sulfate reduction is a major source of H2S in many industrial and environmental systems and is often influenced by the existing physicochemical conditions. Production of H2S in industrial systems leads to occupational hazards and adversely affects the quality of products. A long-term (148 days) experiment was conducted in upflow bioreactors to mimic sulfidogenesis, followed by inhibition with nitrate salts and a resumption of H2S generation when inhibition was released. We determined microbial community structure in 731 samples across 20 bioreactors using 16S rRNA gene sequencing and applied a random forest algorithm to successfully predict different phases of sulfidogenesis and mitigation (accuracy = 93.17%) and sessile and effluent microbial communities (accuracy = 100%). Similarly derived regression models that also included cell abundances were able to predict H2S concentration with remarkably high fidelity (R2 > 0.82). Metabolic profiles based on microbial community structure were also found to be reliable predictors for H2S concentration (R2 = 0.78). These results suggest that microbial community structure contains information sufficient to predict sulfidogenesis in a closed system, with anticipated applications to microbially driven processes in open environments. IMPORTANCE Microbial communities control many biogeochemical processes. Many of these processes are impractical or expensive to measure directly. Because the taxonomic structure of the microbial community is indicative of its function, it encodes information that can be used to predict biogeochemistry. Here, we demonstrate how a machine learning technique can be used to predict sulfidogenesis, a key biogeochemical process in a model system. A distinction of this research was the ability to predict H2S production in a bioreactor from the effluent bacterial community structure without direct observations of the sessile community or other environmental conditions. This study establishes the ability to use machine learning approaches in predicting sulfide concentrations in a closed system, which can be further developed as a valuable tool for predicting biogeochemical processes in open environments. As machine learning algorithms continue to improve, we anticipate increased applications of microbial community structure to predict key environmental and industrial processes.

from three different sections (top, middle, and bottom) of the columns were harvested under anaerobic conditions to understand the microbial diversity of the stationary phase (Table S1).

Sequencing method
Three Illumina MiSeq runs were used to sequence 731 samples (674 effluent samples and 57 sessile samples). Briefly, PCR amplicon libraries targeting the 16S rRNA encoding gene present in metagenomic DNA were produced using a barcoded primer set adapted for the Illumina HiSeq2000 and MiSeq (2). DNA sequence data were then generated using Illumina paired-end sequencing at the Environmental Sample Preparation and Sequencing Facility (ESPSF) at Argonne National Laboratory. Specifically, the V4 region of the 16S rRNA gene was PCR amplified with 515F-806R primers (3) that included sequencer adapter sequences used in the Illumina flowcell (2). The forward amplification primer also contained a twelve-base barcode sequence that supports the pooling of up to 2,167 different samples in each lane (2,4 of each of the products were pooled into a single tube so that each amplicon was represented in equimolar amounts. This pool was then cleaned up using AMPure XP Beads (Beckman Coulter) and then quantified using a fluorometer (Qubit, Invitrogen). After quantification, the molarity of the pool was determined and diluted down to 2 nM, denatured, and then diluted to a final concentration of 6.75 pM with a 10 % PhiX spike for sequencing on the Illumina MiSeq.

Bioinformatics analyses
731 paired-end reads generated from Illumina MiSeq were filtered, denoised, and merged using dada2 (5). Samples from three different runs were processed separately in dada2, considering different error profiles for different runs. The merged reads were inflated to redundant fasta files using deunique_dada2.py (https://github.com/bowmanlab/seq_data_scripts/blob/master/deunique_dada2.py) for analysis using paprica. The output from deunique_dada2.py (.exp.fasta) was analyzed using paprica v0.7.0 (https://github.com/bowmanjeffs/paprica/releases/tag/paprica_v0.7.0) for the determination of community and predicted metabolic structure (6). In brief, paprica places each read on a phylogenetic reference tree created from complete 16S rRNA genes from all completed genomes in GenBank. Placements to terminal branches on the reference tree are referred to as closest completed genomes (CCG), while placements to internal branches are referred to as closest estimated genomes (CEG). The output of the paprica metabolic inference is an estimate of the enzymes and metabolic pathways contained in each member of the community. Further analyses were carried with 16S rRNA gene copy number corrected abundances generated using paprica.
The taxonomic affiliations of the unique sequences were also determined using ROPE (https://github.com/avishekdutta14/ROPE). ROPE (RDP classification of paprica edges) determines the most abundant unique sequence (MAUS) affiliated to a particular taxonomic edge and uses RDP classifier to determine the taxonomic affiliation of MAUS. The taxonomic affiliation of the MAUS is then allocated to the affiliated edge. Another function of ROPE is to assign taxonomy to the unique sequences as obtained from paprica outputs.

Random Forest models
Random forests (RF) (7) classification and regression models were created using randomForest package (8) in the R statistical package to predict the phases and sulfide concentration. For classification-based RF models, relative and absolute percentage abundances of unique bacterial sequences were used as independent variables to predict different phases. Accuracies for such models were determined using confusionMatrix function from the caret package in R (9). For regression-based RF models, relative percentage abundances of unique bacterial sequences were used as independent variables to predict sulfide concentrations. Actual sulfide concentration vs.
predicted sulfide concentrations was plotted, and linear model function (lm) in R statistical package was used to determine the accuracies (from R 2 ) and errors (residual standard error) for the predictions. For all the models, 300 trees were generated, and the default mtry parameter was used for classification, and regression task was used, which is the square root of the number of features (for classification) or 1/3 of the number of features (for regression) randomly picked to split the tree at each node.
The out-of-bag error statistic provided in RF shows the goodness of model fit but not necessarily predictive performance. For this reason, 30% of the observations were randomly withheld and were used to perform more precise model validation. The data that was withheld for validation was termed as validation dataset, and the remaining dataset, which was used to train RF models was termed as the training dataset. Sulfide concentration variations in the validation dataset and training dataset were kept similar to remove the chances of underfitting.
Among 674 effluent samples, sulfide concentration was measured for 649 samples. For regressionbased RF models, two time points (H and I) and outliers based on sulfide concentrations were removed. Samples from timepoints H (dated: 07/16/2019) and I (dated: 07/19/2019) were not considered for constructing regression-based RF models, since the sulfide concentrations for those timepoints were consistently low while compared to the sulfide concentrations from other S phase samples. Microbial diversity data along with other set of physicochemical parameters were studied for these data points, and it was concluded that these consistent lower values of sulfide concentration compared to the neighboring data points might be due to sampling or analytical error. Outliers for sulfide concentration from each phase were determined using Tukey's method (10). An observation was considered to be an outlier when its value was outside the range: [Q1 -1.5 × (Q3 -Q1), Q3 + 1.5 × (Q3 -Q1)], where Q1 and Q3 are the first and third quartiles, respectively.
In order to minimize problems due to overfitting and achieve parsimonious models, the VSURF package (11) was used. This package allows feature selection following three steps: Step 1 eliminates irrelevant variables from the data set, Step 2 selects variables related to the response, and Step 3 refines the variable selection by eliminating redundancy in the set of variables selected in the second step for prediction purpose. Important variables were obtained from the random forest models based on the percentage increase in mean squared error. Percentage increase in mean squared error measures the effect on the predictive power when the value of a specific original variable is randomly permuted (7). If the random permutation drastically changes the predicted value (as measured by the mean squared error), then the original variable is considered critical (12). Differential abundance of the important variables across different samples was analyzed using Canonical analysis of Principal Coordinated (CAP).
A cross-validation experiment was conducted using randomForest package in which samples from a column were used as a validation set, whereas the observations from the remaining 19 columns were used as the training set. Since the microbial communities shifted differently in different columns, this design was made to analyze the robustness of the RF models and to predict the sulfide concentration of samples, which was not used to train the RF models. To compare all the RF models, RSE % was calculated using the following equation RF models were also used to determine the source of the microbial community. The dataset for this model was prepared by including three sessile samples (from the top, middle, and bottom sections of the columns) and three effluent time points of each column (before the columns were sacrificed for sessile community harvesting).      Figure S1: H2S concentration prediction with cross-validation Random Forest models based on relative taxa abundance. Scatter plot of predicted versus actual H2S concentration from validation set of all the models descried in Table 4.  Table 5.