Maxent estimation of aquatic Escherichia coli stream impairment

Background The leading cause of surface water impairment in United States’ rivers and streams is pathogen contamination. Although use of fecal indicators has reduced human health risk, current approaches to identify and reduce exposure can be improved. One important knowledge gap within exposure assessment is characterization of complex fate and transport processes of fecal pollution. Novel modeling processes can inform watershed decision-making to improve exposure assessment. Methods We used the ecological model, Maxent, and the fecal indicator bacterium Escherichia coli to identify environmental factors associated with surface water impairment. Samples were collected August, November, February, and May for 8 years on Sinking Creek in Northeast Tennessee and analyzed for 10 water quality parameters and E. coli concentrations. Univariate and multivariate models estimated probability of impairment given the water quality parameters. Model performance was assessed using area under the receiving operating characteristic (AUC) and prediction accuracy, defined as the model’s ability to predict both true positives (impairment) and true negatives (compliance). Univariate models generated action values, or environmental thresholds, to indicate potential E. coli impairment based on a single parameter. Multivariate models predicted probability of impairment given a suite of environmental variables, and jack-knife sensitivity analysis removed unresponsive variables to elicit a set of the most responsive parameters. Results Water temperature univariate models performed best as indicated by AUC, but alkalinity models were the most accurate at correctly classifying impairment. Sensitivity analysis revealed that models were most sensitive to removal of specific conductance. Other sensitive variables included water temperature, dissolved oxygen, discharge, and NO3. The removal of dissolved oxygen improved model performance based on testing AUC, justifying development of two optimized multivariate models; a 5-variable model including all sensitive parameters, and a 4-variable model that excluded dissolved oxygen. Discussion Results suggest that E. coli impairment in Sinking Creek is influenced by seasonality and agricultural run-off, stressing the need for multi-month sampling along a stream continuum. Although discharge was not predictive of E. coli impairment alone, its interactive effect stresses the importance of both flow dependent and independent processes associated with E. coli impairment. This research also highlights the interactions between nutrient and fecal pollution, a key consideration for watersheds with multiple synergistic impairments. Although one indicator cannot mimic theplethora of existing pathogens in water, incorporating modeling can fine tune an indicator’s utility, providing information concerning fate, transport, and source of fecal pollution while prioritizing resources and increasing confidence in decision making.


Introduction
Rapid urbanization of rural areas causes deterioration in water quality, rendering many water bodies unfit for their domestic and recreational use.An assortment of contaminants is introduced into aquatic systems, but pathogens represent the major cause of stream impairment in the United States (United States Environmental Protection Agency, 2017).
Pathogens are difficult to measure directly because of their sporadic distribution, costly identification, and potential health risks to laboratory workers (Field and Samadpour, 2007).
Most pathogens in aquatic systems stem from human and animal fecal wastes, including direct deposition of feces in water (Vidon et al., 2008), run-off from land with fecal deposits (Tyrrel and Quinton, 2003;Jamieson et al., 2004), and sanitary sewer malfunctions (Ferguson et al., 2003;McLellan and Eren, 2014).To address the difficulties in monitoring specific pathogens, fecal indicator organisms (FIOs) are commonly used to assess the presence of fecal pathogens.
An effective fecal indicator is associated with the presence of specific pathogens, with a straightforward method for enumeration that correlates with magnitude and age of fecal pollution (Savichtcheva and Okabe, 2006;Maier et al., 2009).The use of FIOs such as fecal coliform bacteria and Escherichia coli are traditionally used for determining surface water pathogen impairment (Yates, 2007;US Environmental Protection Agency, 2012).Although these indicators assist in alerting populations when exposure to pathogens is likely, the current approach is limited by using a single indicator such as E. coli for a designated use (Wade et al., 2003;Savichtcheva and Okabe, 2006;Field and Samadpour, 2007).The cosmopolitan nature of E. coli in warm-blooded animals makes them impractical for source identification (Field and PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018) Manuscript to be reviewed Samadpour, 2007;Yates, 2007;McLellan and Eren, 2014;Blount, 2015).The ability of E. coli to survive in soils (Lasalde et al., 2005;Ishii et al., 2006), algae (Byappanahalli et al., 2003), and sediments (LaLiberte and Grimes, 1982;Alm and Burke, 2003;Drummond et al., 2015) provide a reservoir for continued persistence and potential to naturalize (Winfield and Groisman, 2003;Lasalde et al., 2005;Luo et al., 2011).These characteristics and deficiencies emphasize the difficulty of single standard FIO monitoring for impairment, stressing the need for additional methods to evaluate source and mechanisms of FIO impairment.
In addition to the above issues, appropriately characterizing FIO impairment for regulation and decision-making is difficult due to complex fate and transport processes (Benham et al., 2006;de Brauwere et al., 2014;Drummond et al., 2015).These complex fate and transport processes include transport through run-off and storm water (Kistemann et al 2002, Lipp et al 2001;McKergow and Davies-Colley, 2010), remobilization from sediments and hyporheic exchange (Drummond 2015;Dwivedi 2016), particle attachment (Characklis, 2005), and UV light exposure (Sinton 2002).Additionally, ecological processes control FIO fate and transport through variable survival patterns of indicators and pathogens (Anderson et al., 2005;Stott et al., 2011), availability of nutrients and organic matter (Surbeck et al. 2010;Perkins et al. 2016), andpredation (McCambridge andMcMeekin, 1981).Appropriately characterizing the physics and ecology driving fate and transport can better inform management decisions for total maximum daily load (TMDL) development, reduction of pollution, and allocation of resources.

Manuscript to be reviewed
Modeling provides flexible approaches to infer sources and processes associated with FIOs and other pathogens, overcoming some of the issues of the single indicator paradigm.Various statistical and machine learning models have been used to approach such problems of incorporating age of fecal pollution for source tracking or detection of viruses (Brion et al. 2002;Black et al., 2007); identifying land use, environmental, and water quality parameters associated with FIOs and pathogens (Brion and Lingireddy, 1999;Viau et al., 2011;Wilkes et al., 2011;Gonzalez et al., 2012;Gonzalez and Noble, 2014;Hall et al., 2014;Herrig et al., 2015;Lušić et al., 2017); determining factors influencing particle attachment and virulence (Piorkowski et al., 2013); and optimizing microbial source tracking (Belanche-Muñoz and Blanch, 2008;Ballestè et al., 2010;Smith et al., 2010;Molina et al., 2014).Some other applications of modeling include using turbidity or rainfall to predict E. coli concentrations at unmonitored sites (Money et al. 2009, Coulliete et al. 2009), estimating E. coli loads using physical, chemical, and biological factors within a neural network (Dwivedi 2013), and hyporheic-groundwater interactions associated with transport of E. coli within sediments porewater (Dwivedi 2016).Modeling can inform decision-makers concerning what drives impairment, addressing some of the shortcomings of a single indicator approach.
Maxent, a commonly used ecological niche model (Phillips et al., 2004;Phillips and Dudik 2008), identified environmental variables associated with probability of E. coli stream impairment, making inferences concerning source and mechanisms driving fecal pollution.Although modeling E. coli using a machine learning model such as Maxent is not a novel approach, e.g.Dwivedi et al. 2013, this  Manuscript to be reviewed quality is associated with E. coli impairment in lower order streams, uses nonparametric bootstrapping as a probabilistic assessment of model performance based on the area under the curve (AUC) of the receiving operator characteristic (ROC), and uses loss of information as an indicator of sensitive variables.Ecological niche models have been utilized for species distribution (Lozier et al., 2009), conservation of rare species (Guisan et al., 2006), invasive species (Thuiller et al., 2005), and disease vector epidemiology studies (Boeckmann and Joyner, 2014), but this is a new application of Maxent to microbial water quality.Additionally, developing models in lower order streams has not been previously reported; this is important because water from low order streams is used for domestic water supply and recreation in many areas of the United States.Manuscript to be reviewed water quality, demonstrates a probabilistic method for variable selection, and reframes the question from "How much E. coli in our watershed?"to "what factors separate E. coli impairment from compliance?," which is useful when evaluating watershed decisions.

Sampling sites and data collection
Sinking Creek is a 1 st to 3 rd Strahler order mixed-use stream that is noncompliant for state of Tennessee standards for fecal coliform and E. coli (Tennessee Department of Environmental and Conservation, 2006).Starting in August 2004, samples were collected by hand in August, November, February, and May of each year until August 2011 as a long-term monitoring plan at 14 sites in Sinking Creek, and samples were analyzed for 10 water quality parameters and populations of E. coli (Figure 1).Specific conductance (conductivity) and water temperature were measured using an Orion 115A+ conductivity meter (Thermo Fisher Scientific, Waltham, MA).The pH was measured using an EL2 portable pH meter (Mettler Toledo, Columbus, OH).Dissolved oxygen was collected using a YSI Model 55 dissolved oxygen meter (YSI Inc., Yellow Springs, OH).Samples for nitrates (NO 3 ), phosphates (PO 4 ), biochemical oxygen demand (BOD), alkalinity, and hardness were collected in clean 2 L polyethylene bottles and stored at 4 o C until laboratory analysis.A flow meter (Global Water, FP101) was placed in the center of the channel to measure stream velocity.Stream width was calculated where the stream velocity was PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018) Manuscript to be reviewed measured, and depth was averaged over three points across the stream width.Velocity was multiplied by stream width and average depth to estimate discharge.
NO 3 and PO 4 analyses were performed in triplicate using colorimetric HACH TM methods (Hach, Loveland, CO) and reagents.NO 3 and PO 4 analyses were conducted by adding 10 mL of water to a vial containing NitraVer5 or PhosVer3 for the respective analyses.Vials were shaken to dissolve the reagent and samples were analyzed with a DR890 colorimeter (Hach, Loveland, CO) (HACH Company, 2013).Triplicate sample for alkalinity and hardness were determined using 100 mL sample volumes and a digital titrator (Hach, Loveland, CO) (HACH Company, 2013).
Phenolphthalein and bromcresol green-methyl red indicators were used, and the sample was titrated with 1.6 N sulfuric acid to a grey-green endpoint (Hach Company, 2006).BOD was measured in triplicate using the 5-day BOD test (American Public Health Association, 2005.
Populations of E. coli were determined using the Colilert defined substrate test.Briefly, 97 wells were filled with 100 mL of water sample with the Colilert substrate added.Samples were incubated for 24 hours, and wells that fluoresce under a UV light were considered positive for E. coli, and a most probable number estimate was made based on the number of positive wells in both the large and small wells (American Public Health Association, 2005).If a sample was in excess of the geometric mean United States recreational water quality criteria, the sample site was considered impaired.Impairment was based on recommendation 1, which is a threshold of 126 that corresponds to an illness rate of people (US Environmental Protection Manuscript to be reviewed To get an estimation of land use throughout the Sinking Creek watershed, land cover data were downloaded from the National Land Cover Dataset (NLCD) for 2006 (Fry et al., 2011).Each sampling site's drainage area was delineated using StreamStats version 3 from the United States Geologic Survey (Ries III et al., 2017).Land was grouped into 3 categories; forested, developed, and agricultural.Forested land includes the categories deciduous forest, evergreen forest, and mixed forest.Developed land use includes all developed categories; open space (less than 20% impervious surface), low intensity (20 -49% impervious surface), medium intensity (50 -79% impervious surface), and high intensity (80 -100% impervious surface).
Agricultural land included grassland/herbaceous and pasture/hay.The area of each land use was divided by the total area of the drainage area to get the percentage land use shown in table 1, and sampling sites as well as land cover categories are shown in figure 1.

Modeling background
Maxent is an iterative machine learning model commonly used for mapping species distributions (Phillips, S., Dudík, M., Schapire, 2010).Within the sample space, x, and given a set of environmental features (parameters), f 1 (x), f 2 (x), …, f n (x), the Maxent distribution estimates a vector of feature weights, β = (β 1 β 2 , …, β n ), that maximizes the entropy of the raw distribution of impairments, q β (x), using a Gibbs distribution, (1) Where Z β is the normalization constant that ensures that q β (x) integrates to one over the study area (Phillips, S., Dudík, M., Schapire, 2010).This modeling approach is justified because it provides the maximum information concerning impairment.From a water quality management PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018) Manuscript to be reviewed standpoint, this approach is beneficial because decision-makers and stake-holders are more concerned with factors associated with impairment rather than compliance when approaching fecal pollution monitoring and management.
Original features (parameters) can be transformed into quadratic, product, hinge, and threshold feature classes so that complex multivariate responses can be modelled (Phillips and Dudık, 2008), but Maxent incorporates L1 regularization to balance satisfying the constraints on the features while minimizing overfitting.L1 regularization is not unique to Maxent, and is used in many general linear models (Elith et al., 2011).A regularization parameter λ j smooths probability distributions, generating sparse solutions and removing unnecessary features; this shrinks weights to balance fit and complexity (Elith et al., 2011).Because of regularization, Maxent fits a penalized maximum likelihood model equivalent to minimizing the relative entropy dependent on the error-bound constraints Where m is the number of positive samples, n is the number of features, and x is the feature vector for occurrence point i.Equation 2 provides insight into how Maxent uses background data: the first term is larger for models that distinguish between impairment states the best.
The second term represents the regularization, which gets larger as the weights increase,   indicating a complex model more likely to over fit.The output of is termed the raw   () PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018) Manuscript to be reviewed distribution, but it is difficult to interpret due to its scale dependence.More background points result in smaller raw values because their sum cannot exceed 1 over a large amount of points (Phillips and Dudık, 2008;Elith et al., 2011;).For this reason, the logistical output of the Maxent model, will be used because it represents the probability of impairment given the sample (), space, x.This is a logistic model using the same set of weights β with the intercept of the model determined by the entropy of , H. The model is shown in Eq. 3 below.  ()

Univariate models
Data were processed using a list-wise deletion process, where individual samples from a site were removed if they were missing a parameter measurement due to laboratory errors, equipment malfunctions, calibration issues, or sites being dry at the time of sampling.A sample of 100 bootstrapped models were developed, and 20% was subsampled for testing validation.
Bootstrapping is a nonparametric resampling technique to make inferences about a population based on resampling from a set population, generating population level statistics, while providing an estimate of uncertainty of those statistics (Campolongo, 1997).For this modeling approach, all background points are used in the development of the null model, and the impaired samples are bootstrapped.Although Maxent can incorporate a wide variety of feature classes, only linear and quadratic feature classes were used to develop action values, or thresholds of impairment.The rationale for using these types of feature classes is for ease of generating action values as well as to assess both linear and non-linear effects of single parameters.
PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018) Manuscript to be reviewed The AUC was calculated for the training and testing datasets.The AUC is a metric of performance for binary classification.the true positive prediction rate (sensitivity) and false positive prediction rates (1 -specificity) of each sample are plotted as a ROC for different decision boundaries, and the area under that ROC is integrated.An AUC of 0.5 indicates that the model is no better than random chance, and a value of 1.0 indicates perfect model performance (Zweig and Campbell, 1993;Zou et al., 2007).
The decision boundary (logistic threshold) between impaired and unimpaired samples should maximize accurately predicting impairment (exceedance of the E. coli criteria) while balancing correct negative predictions (Bean et al., 2012).Therefore, maximum test sensitivity and specificity was defined as the appropriate decision boundary.A low sensitivity would indicate poor performance in identifying impairment, while a low specificity would indicate an overcautious model in which resources might be wasted in remediation of false positives.
Accuracy for Maxent models was calculated as follows: , where TP are true Manuscript to be reviewed logistic threshold.Figure 2 demonstrates the concepts of the AUC performance metric, selected decision boundary, and the concept of the action value in relation to the selected decision boundary and Maxent model function (Eq.3).

Multivariate models and sensitivity analysis
Although some authors state that collinearity is not as problematic in Maxent compared to traditional regression approaches, collinearity was explored and subsequently removed using Pearson correlation coefficients (Elith et al., 2011).Variables that were highly correlated (r > 0.8) were evaluated to determine which variable to include based on expertise, connections to previous models, and accuracy metrics within the analysis.The initial multivariate model included all noncollinear variables and were developed using 100 bootstrapped samples like the univariate models, with the addition of product feature classes to incorporate variable interaction.Average variable contribution was determined by calculating the increase in Manuscript to be reviewed testing information gain.Student's t-tests were performed on each jack-knifed model to evaluate significance, and variables were included if the information gain from either the testing or training sets decreased; decrease in information would correspond to a significant loss of information, providing criterium for inclusion of the variable in final models.

Univariate Model Performance
The The plausible explanation of these differences is the latter variables had higher specificity rather than sensitivity at the chosen decision boundary (Table S2).Accuracy was found to be highest for alkalinity and lowest for pH.
PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018) Manuscript to be reviewed Action values were developed for 8 significant univariate models by solving for the value of the variable when probability of impairment equals the logistic threshold.For example, the action value for alkalinity is 128 mg/L.This means that E. coli impairment is likely to occur when alkalinity is observed to be higher than this threshold.Action values and 95% confidence intervals are included in table 2. Action function graphs for each significant univariate model are presented in Figure S1 to aid in interpretation of table 2, and summary statistics for each variable are given in table S1.

Multivariate Model Performance
Pearson correlations ranged from -0.269 to 0.834, with three variables identified as collinear; alkalinity, conductivity, and hardness.Conductivity was selected because of its use in previously developed fecal indicator models (Wilkes et al., 2011;Gonzalez et al., 2012;Gonzalez and Noble, 2014;Piorkowski et al., 2013) Manuscript to be reviewed removal of dissolved oxygen improved model performance based on testing AUC, justifying development of two optimized multivariate models; a 5-variable model including all sensitive parameters, and a 4-variable model that excluded dissolved oxygen.
Accuracy of the 5-and 4-variable optimized model was 77.8 %.The patterns of variable contribution were consistent in each model, with water temperature accounting for most of the information gain in each model.Figure 3 shows the variable contribution for the initial multivariate models, each model run during the sensitivity analysis, and the final 4-variable and 5-variable models produced.The information gain for each model is also shown within this figure.
Response surfaces were developed for each of the model runs to assess spatiotemporal trends.
Each grid within the surface represents a single sample, with each sampling site representing a single column.The columns are oriented in a downstream fashion, with headwaters sites starting on the left (SC14) and sites further downstream existing on the right (SC1).The temporal scale is represented by the rows, with each row indicating a specific sampling trip.
Although the data resolution is coarse, the goal is to demonstrate the potential of visualizing trends in the probability of impairment over space and time.Figure 4  Manuscript to be reviewed Generally, the sites influenced by the greatest amount of developed or agricultural land use (SC5 -SC1) had the highest probability of impairment.August had the highest probability of impairment, followed by May, November, and February.Mean probability of impairment and associated 95% confidence intervals are shown in table S3.

Discussion
Over 170,000 miles of U.S. rivers and streams are listed as pathogen impaired based on FIOs.Manuscript to be reviewed The fact that water temperature dominated the information in this model suggests that seasonality plays an important role in E. coli survival.Although fecal indicators and pathogens have been found to possess diverse temperature-survival relationships (Hofstra, 2011;Sterk et al., 2013), the high August probability for E. coli impairment indicates favorable conditions for long-term survival in the summer.Warming due to climate change could exacerbate this condition by increasing those favorable conditions (Weniger et al., 1983;Atherholt et al., 1998;Patz et al., 2000;Guzman Herrador et al., 2015).However, August was not the only month with numerous E. coli impairments.Therefore, monitoring for FIOs only in the summer months could distort estimates of impairment in watersheds with year-round users.
Although discharge was not predictive of E. coli impairment alone, its interactive effect stresses the importance of both flow dependent and independent processes associated with E. coli impairment.Dissolved solutes such as NO 3 and ions measured through conductivity are largely discharge-dependent; however, FIOs are not as strongly dependent on discharge.This flow independence is due to additional ecological mechanisms such as nutrient limitation and competition (Surbeck et al., 2006;Drummond et al., 2015).Various forms of nitrogen are associated with increased concentration of FIOs in certain environments (Carrillo et al., 1985;Herrig et al., 2015), and results of the Maxent models suggest that nutrient loading in the form of NO 3 contributes to E. coli impairment in Sinking Creek.Other studies have found that dissolved organic carbon can affect magnitude and extent of fecal indicators (Surbeck et al., 2010;Blazewicz et al., 2013;Cloutier et al., 2015), but this was not collected during this PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018) Manuscript to be reviewed sampling program and was found to be insignificant using BOD as a surrogate for organic pollution.This interaction between nutrient levels and fecal pollution highlights the potential for synergistic effects of different sources of pollution, suggesting a limitation of TMDL development when only considering one pollutant at a time.
Although machine learning application to microbial water quality problems is not unique, this study presents some beneficial techniques in this area of research.First, it demonstrates the ability to open the black box of Maxent, using action values to predict threshold of impairment based on a single variable.Multivariate action functions can be developed as well, but is not presented in this manuscript.The probabilistic approach to model validation and variable selection allows for inclusion of uncertainty, improving on deterministic methods traditionally used for validation and criteria for variable inclusion.Probabilistic methods have been used in TMDLs (Borsuk et al., 2002), frequency of water quality posting errors (Kim and Grant, 2004), and uncertainty of different fecal indicator methodologies (Gronewold et al., 2008); this paper adds to this framework through identifying the probability of stream impairment given a set of environmental variables.This improves confidence in decision-making for implementation of monitoring, management, and remediation strategies.Modeling microbial water quality is a challenge no matter the method used, but this study demonstrates that Maxent provides a valid approach to understand the factors driving impairment.
Streams are dynamic systems with multiple flow regimes, confounding an already difficult modeling process.Understanding how models behave in extreme situations is useful for PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018) Manuscript to be reviewed regulation, monitoring, and management of these ecosystems.Over the long-term study periods, samples from both drought and high water conditions were captured.Maxent has been suggested as a strong prediction of extreme values (Petrov et al. 2013), and this study found that Maxent sufficiently predicted impairment during the high flow sampling date of November 11, 2009.Depending on which multivariate model was used, accuracy ranged from 72.8 % to 90.9 % for this sampling date.Five sampling dates resulted in at least one site being dry, indicating drought-like conditions.Maxent correctly predicted impairment in these situations 62.2 % to 73.0 % of the time.This suggests that Maxent can be useful for certain extreme situations, but is highly dependent on the environmental variables used for prediction.
While this study presents proof of concept of using Maxent to infer source and mechanisms of impairment, there are some limitations to this study.Although the dataset has a large time scale (8 years), only collecting from 4 months makes the resolution coarse, reducing the scale at which inferences can be made.The list-wise deletion of samples before univariate modeling removed some data that could inform each of those models; however, using the same series of data in the multivariate models and list-wise deletion are commonly used procedures in statistical models.Future applications of Maxent will improve on the coarse resolution of the data by using monthly and potentially weekly sampling approaches, and research will be developed as to the best approach for handling missing data in Maxent.Manuscript to be reviewed these parameters are informative as a proof of concept for using Maxent as a modeling approach.Future areas of research include using Maxent to optimize water quality monitoring to identify causes of impairment with FIOs and specific pathogens in the most cost-effective way using a variety of microbial, chemical, and physical parameters.

Conclusions
Characterizing E. coli impairment is essential because of the plethora of streams polluted with fecal wastes.This study used Maxent to identify water quality parameters associated with E.
coli impairment in a low-order, mixed-use watershed.Univariate models generated action values, or thresholds of impairment, based on single parameters, while multivariate models extracted information concerning multivariate interaction.We presented a probabilistic Manuscript to be reviewed approach to sensitivity analysis, improving confidence in variable selection.Maxent presents a flexible machine learning approach to aid in understanding mechanisms and sources of fecal pollution as well as a host of other complex decision boundary problems.We demonstrated that:  Models using alkalinity and water temperature were found to be either the most accurate or best performing univariate models; this stresses the importance of discharge composition and seasonality in E. coli impairment.Discharge, however, was not an influential univariate parameters by itself, stressing the importance of flowindependent processes that correlate with impairment.
 Sensitivity analysis indicated that the most information was lost when conductivity was removed from the multivariate models, and water temperature, discharge, dissolved oxygen, and NO 3 represent other sensitive parameters sensitive to E. coli impairment in this watershed.
 Results suggest that E. coli impairment in this stream is driven by seasonality and agricultural run-off.This suggests that multi-month sampling along a stream continuum  Manuscript to be reviewed Manuscript to be reviewed This represents the mean probability of 100 bootstrapped runs.Rows are oriented by each sampling period, while columns represent each sampling site over the length of the stream; left to right indicates flow direction.Black cells denote samples in which a parameter was missing and were excluded from analysis, and while circles with black centers represent samples in which a stream would be identified as impaired in the study.Manuscript to be reviewed   Manuscript to be reviewed Variable contribution and permutation importance for the multivariate models, normalized to percentages.
PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018) Manuscript to be reviewed Manuscript to be reviewed information gain associated with a change in each feature for each iteration of the model algorithm, normalized to percentages.The permutation importance of a feature is an indicator of variable sensitivity.In each model run, the feature training presence and background data are randomly permutated, and the resulting drop in training AUC is normalized to percentages for each variable and averaged over the 100 bootstrapped runs.A jack-knife sensitivity analysis was used to determine the best subset of covariates to include in a trimmed model.Each variable was removed from the analysis, and a comparison was made to determine if the removal of a variable caused a significant (p<.05) change in training or PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018) It is a difficult task to develop and implement remediation strategies in watersheds with many diffuse causes of fecal impairment, but modeling can increase confidence in decision making through inferring mechanisms and sources of fecal pollution.Incorporating environmental variables into models allows for insights into the ecology of fecal indicators, identifying causes of chronic FIO impairment.Although one indicator cannot mimic the plethora of existing pathogens in water, incorporating modeling can fine tune an indicator's utility, ultimately informing the public concerning health risks, and aiding in overcoming the shortcomings of a single indicator monitoring strategy.


is essential to characterize spatiotemporal variability, importance of flow in relation to other water quality parameters, and the potential synergistic effect of nutrient and fecal pollution.Incorporating modeling can fine tune an indicator's utility, informing the public concerning human health risks, enhancing our understanding of FIOs, assisting in water PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018)Manuscript to be reviewed quality decision-making, and providing input variables for quantitative microbial risk assessment.

Figure 3
Figure 3 Summary of training and testing performance of Maxent models.Performance is based on AUC metrics, accuracy is based on the maximum test sensitivity and specificity decision boundary (logistic threshold), and action values with 95% confidence intervals.If an upper bound of a confidence interval exceeds the maximum sampling value for a set of data, the maximum value is given.*Model was not significant #Values of the variables that corresponded to impairment $ 95% CI for the lower bound of the action value &95% CI for the upper bound of the action value PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018)Manuscript to be reviewed Table2.Summary of training and testing performance of Maxent models based AUC metrics, accuracy based on maximum test sensitivity and specificity decision boundary (logistic threshold), and action values with 95% confidence intervals.If an upper bound of a confidence interval exceeds the maximum sampling value for a set of data, the maximum value is given.*Model was not significant #Values of the variables that corresponded to impairment $ 95% CI for the lower bound of the action value &95% CI for the upper bound of the action value study is unique in the following ways: it focuses on how the water PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018) The motivation for using Maxent to predict E. coli impairment is to investigate how environment, i.e water quality parameters, shapes the niche of E. coli impairment based on a decision boundary; in this case, a water quality standard.A probabilistic procedure for univariate and multivariate model development is presented using nonparametric bootstrapping cross-validation.Univariate models generated action values, or environmental thresholds of impairment, to indicate potential E. coli impairment based on a single parameter.
sampling program resulted in 29 sampling trips over 14 sites, allowing for a potential of 406 samples for analysis.127 samples were removed due to missing information in the dataset, leaving 279 total samples for model development.This included 95 impairments, identified by exceedance of the E. coli recreational water quality standard.Table 1 presents the summary statistics for E. coli and the associated land use in each sampling site's drainage area.Each training set included 279 background points, 76 points for training, and 19 points to evaluate performance on testing data.Table 2 summarizes the training and testing performance of the univariate Maxent models used to identify E. coli impairment based on environmental variables.Water temperature performed best based on AUCs, but had lower accuracy than conductivity, dissolved oxygen, and alkalinity.
Seasonality is demonstrated by influence of temperature in the models, whereas the influence of agricultural run-off is suggested by the other variables and the association between land use and E. coli in the watershed.Even small increases in agricultural land cause substantial increases in E. coli concentrations (Table1), whereas similar increases in developed land do not have the same pronounced effect.This To address these impairments, characterization of sources and transport mechanisms is necessary (United States Environmental Protection Agency, 2017), and statistical models can be used as an inferential tool to overcome these issues.We applied Maxent to identify individual and interacting factors influencing E. coli fate and transport that resulted in impairments using univariate and multivariate approaches.In this particular stream, water temperature, conductivity, discharge, and NO 3 were found to be the most influential group of factors driving fecal pollution.The results indicate that seasonality and agricultural run-off are the suggested causes of impairment in this watershed.studyhighlights the need for multi-month sampling across a stream continuum to truly estimate spatiotemporal variability associated with impairment.PeerJ reviewing PDF | (2018:02:25580:1:1:CHECK 30 Jul 2018)

Table 1 .
Sampling sites, land use, and E. coli concentrations in Sinking Creek.Percentage of each land cover types (Agricultural, Developed, and Forested) as well as E. coli Geometric means (GM), geometric standard deviations (GSD), and maximum and minimum values for each site used in the study.

Table 3 .
Variable contribution and permutation importance for the multivariate models, normalized to percentages.