A data‐driven predictive model for disinfectant residual in drinking water storage tanks

A data‐driven approach is developed and proven for ranking the risk of low disinfection residual in water distribution storage tanks, 1 month ahead. The forecasting methodology uses water quality data collected from drinking water treatment plants, storage tank outlets, and rainfall data as inputs. This methodology was developed and tested with data from a water utility serving more than 5 million people. Results show high‐risk category prediction accuracy of 75%–80%. Using a final year of unseen validation data, more than 90% of the storage tanks ranked in the top 20 by the forecasting methodology experienced low disinfectant residual in the following month. Storage tanks are critical water distribution system infrastructure that are currently managed reactively. The adoption of such readily transferable machine learning approaches enables direct proactive management strategies and efficient interventions that can help ensure drinking water quality.


| INTRODUCTION
Disinfection of drinking water is typically the last stage of the treatment process before water reaches the drinking water distribution system (DWDS).Disinfection is critical for ensuring drinking water is free from bacterial contamination.A residual of the chosen disinfectant is usually retained in the final treated water to limit bacteriological regrowth and provide some protection against contamination events within the DWDS.The most commonly used disinfectant is free chlorine but there are systems where chloramines are used as the residual to reduce disinfection by-products (DBPs) formation (Mian et al., 2018).The concentration of disinfectant residual is generally highest in the water exiting the drinking water treatment plant (DWTP) and the disinfectant is consumed by reactions with water quality constituents as water travels through the DWDS.Disinfectant residual consumption is caused by chemical, biological, and physical reactions that occur in the bulk water and at the pipe wall, that couple with extended water residence time (or water age) to cause loss of residual (Al-Jasser, 2007;Vasconcelos et al., 1997).
Low disinfectant residual concentrations indicate excessive reactions over a short period of time or moderate reactions over an extended period have occurred and an increased risk of regrowth or from possible but rare contamination events.Water utilities are therefore required to monitor disinfectant residuals as part of regulatory compliance.This is typically achieved using discrete samples taken from across the DWDS, from the DWTPs to storage tanks, and consumers' taps and although online monitoring is growing in application, it is still used less frequently than discrete sampling.In addition to disinfectant residual, other key water quality variables are monitored in different jurisdictions (e.g., iron, manganese, turbidity, coliform bacteria, heterotrophic plate counts), depending on the regulations.
Storage tanks, commonly known in the United Kingdom as Service Reservoirs (SRs), are fundamental components of the DWDS.They are used for reserving water when the water demand increases, for balancing the pressure fluctuations in the DWDS, for providing continuous supply to consumers when an event or a shutdown occur in some parts of the system, and for providing emergency demands for fire suppression (Brandt et al., 2017).The water residence time in SRs varies from a couple of hours to days depending on the size of the SR, the size of its related DWDS, and the associated demand.Increased SRs residence time can and does influence water quality, degrading the disinfection residual over time.In the worst case when microbial contaminants are also present in the storage tank, waterborne disease outbreaks and loss of life can result, for example, as happened in Gideon, MO (Clark et al., 1996;Craun & Calderon, 2001).Moreover, Ellis et al. (2018) found that the number of bacteriological failures at SR outlets was double the amount at DWTPs, indicating the elevated water quality risk that storage tanks can bring to the large populations that they serve.Therefore, consistent monitoring of the water quality exiting these infrastructure components is required to understand their performance and hence to maintain the water safety standards before reaching the consumers taps.
Water utilities use disinfectant residual sample data to provide assurance of the safety of drinking water and to prioritize interventions in the DWDS such as flushing or SR cleaning.However, this can only ever be a reactive approach and cannot preempt and prevent problems; once low chlorine events are detected, the risk of bacteriological contamination in the DWDS is already elevated.There is hence a need for the ability to predict low disinfectant residual events, particularly in SRs, for proactive protection of public health.There is a family of models that attempt to quantify our understanding of the mechanisms of chlorine decay and have accurately predicted water distribution system hydraulics and water quality for decades, provided they are well calibrated and maintained (Clark & Sivaganesan, 2002;Rossman et al., 1994;Speight & Boxall, 2015).These can be limited in applicability due to uncertainty and site-specific coefficients required, particularly by the uncertainty of reactions and interactions associated with the pipe wall (Vasconcelos et al., 1997).SR in contrast are dominated by bulk water reactions, with coefficients potentially estimated from jar tests.There is less understanding of the residence time distributions within SR and hence modeling water quality in storage is more complicated, sometimes requiring tools such as computational fluid dynamics modeling to be applied and well validated for a comprehensive prediction of disinfectant residual (Grayman et al., 2004).There is, however, sufficient historical water quality data available such that an alternative modeling approach to prediction of chlorine residuals could be usefully attempted by machine learning techniques.
Machine learning (ML) is a type of artificial intelligence that develops algorithms based on statistical knowledge to understand patterns in data (MathWorks, 2016).ML algorithms use existing data (training dataset) to create and train models for either understanding data trends (unsupervised learning) or predicting trends based on new unseen data (supervised learning).Supervised learning techniques can be further split into regression and classification categories based on the type of the predictions required.Regression techniques are used to predict numerical values while classification techniques determine an output category (class), such as those used in email software to classify junk email.ML algorithms have been successfully applied in many different fields such as natural language processing, image recognition, finance, policing, and engineering (Jordan & Mitchell, 2015).
Data-driven models based on ML algorithms are trained using available data to connect inputs and outputs with no need for understanding or specifying the complex processes that occur between input and output.In drinking water quality applications, ML algorithms have been successfully used to predict high iron concentrations in DWDS (Kazemi et al., 2023;Mounce et al., 2017), short-term turbidity trends in water distribution trunk mains (Kazemi et al., 2018;Meyers et al., 2017), bacteriological contamination indicators in the DWDS (Mohammed et al., 2017), and factors that cause discoloration (Speight et al., 2019).For prediction of disinfectant residual, Gibbs et al. (2006) used artificial neural networks (ANNs) to understand the connections between chlorine decay at customers' taps and various

Article Impact Statement
A machine learning model is developed for predicting low disinfection residual events in storage tanks 1 month in advance into future with 75-80% accuracy, using historical monitoring samples.
water quality parameters, and there have been some advances using various ML methods for the prediction of chlorine losses in the DWDS (Garcia et al., 2020;Kyritsakas et al., 2023;Xu et al., 2019).This growing body of knowledge demonstrates that data-driven models, especially those using ML methods, have the potential to transform the sparse water quality monitoring data collected from DWDS into useful information for water utilities.However, to the best of the authors' knowledge there is no data-driven research that has been conducted for investigating water quality deterioration in SRs, so this study represents an exploration of that gap.
This paper investigates the ability of ML algorithms to predict low chlorine events in SRs.A new methodology is proposed to inform utilities about SRs that are at high risk of experiencing low chlorine with a sufficient time to implement interventions.The outputs are SR risk rankings for the following month based on their relative probability of low chlorine events.This type of information enables water utilities to prioritize their interventions toward the high-risk SRs, transforming reactive management to a proactive practice.

| METHODOLOGY
A data-driven model was developed for the classification of SRs into "High-risk" or "Low-risk" classes using historical water quality data from SRs and DWTPs outlets.ML techniques were applied to data for a large water utility that operates multiple systems in the north of the United Kingdom.Ensemble decision tree ML algorithms were used for the classification and a comparison of results is made using performance metrics.

| Case study dataset and data preprocessing
The data used in this investigation was taken from a large corporate database of a utility located in the north of the United Kingdom.This utility serves water to more than 5.4 million people via complicated DWDS consisting of more than 250 DWTPs, more than 1000 SRs, and approximately 45,000 km of pipes.The disinfection type is mainly chlorination, but some systems have switched to chloramination due to DBPs presence.The dataset contained water quality data collected from the DWTPs and the SRs outlets, collected mostly for regulatory purposes during the period January 2012 to November 2021.Turbidity data measured in the raw water before reaching the DWTPs were also included.The dataset contained more than 450,000 SRs water quality samples, more than 330,000 DWTPs water quality samples, and more than 35,000 turbidity samples taken in the raw water before reaching the DWTPs.
In the United Kingdom, the regulations require four samples per month for every active SR to measure chlorine concentration (total and free), coliform bacteria, and heterotrophic plate counts (HPCs) at 22 and 37 C.There are no specific requirements regarding metals and other important water quality variables such as total organic carbon(TOC) and turbidity (DWI, 2020;DWQR, 2019).Detection of coliform bacteria in SRs water quality samples is rare (only 0.14% in this dataset) and therefore they were excluded from this analysis.Turbidity and the metals data at the SRs were also excluded as there was insufficient data.However, there were a large number of flow cytometry total and intact cells counts (FC_TCCs, FC_ICCs) and temperature data available, collected since January 2015.This data was included in the analysis.The DWTPs dataset variables that were selected for analysis were free chlorine, total chlorine, TOC, temperature, FC_TCCs, and FC_ICCs.Temperature, FC_TCCs, and FC_ICCs measurements in the DWTPs also started in January 2015.
Rainfall has been associated with bacteriological events in the DWDS (Curriero et al., 2001;Kumpel & Nelson, 2013).Therefore, in order to examine the influence of rainfall in the prediction model, rainfall data covering the complete area of supply for the examined period were collected from the Met Office gauging stations that were in a close proximity to the water utility's DWTPs and the SRs.(Met Office, 2021).
The raw dataset required preprocessing before being used in the predictive model.This involved three steps: association of the SRs with the other assets; identification of the low chlorine events; and selection of the temporal scale of the analysis.

| Association of service reservoirs with the other water infrastructure
The association between the SRs, the DWTPs that supplied them, and the raw water that fed the DWTPs was achieved using the water utilities asset management information.The association between each SR and DWTP and Met Office gauging station was achieved using a nearest neighbor Euclidian search.Given the geographically sparse nature of the case study utility's water systems (across a country scale), it is unlikely that a distance-based search would mismatch rainfall gauges with DWTPs or SRs to wrong systems so this approach was not verified beyond spot checks.

| Identification of low chlorine events
Low chlorine events were defined by samples where measured chlorine was below a certain threshold.This was different for chlorination and chloramination systems.A low chlorine event in the chlorination systems was defined as a sample where the free chlorine concentration was below 0.25 mg/L as the WHO guidelines and the utility's suggestion for chlorine concentration in the consumers taps is 0.2 mg/L (WHO, 2000).For the chloramination systems the minimum threshold was defined equal to 0.7 mg/L of total chlorine as per the utility's suggestion.Based on those two thresholds the low chlorine events were calculated separately for the chlorinated and the chloraminated systems.

| Selection of the temporal scale of the analysis
A monthly scale was selected reflecting the low frequency of low chlorine events and because a monthly prediction time-step gives just sufficient time for proactive interventions.All the variables used as inputs were therefore monthly averaged.The monthly sum of low chlorine events per SR was also calculated.

| Low chlorine event prediction model
The predictive model was designed using a supervised classification approach.Input and corresponding output class were required for training.The input data were the monthly averaged values of the water quality variables.The output class was the group that the SR belongs to, which could be either the no low chlorine event class (Low-Risk) or the event class (High-Risks).The high event class included all SRs that had at least one low chlorine event per month.The monthly scale design of the model implies that there was a monthly lag between the inputs (variables) and the outputs (classes).So, for example, for January 2012 inputs the corresponding output classes were those in February 2012.
The model used different ensemble decision trees as the ML algorithms for its prediction.The ensemble decision trees use multiple weak decision trees to improve the predictive model performance (Rokach, 2010).Ensemble decision trees were selected because of their "white-box" approach as they provide a human interpretable justification of the prediction results: a variable importance analysis and a class probability for each SR.Two main ensemble decision tree approaches were compared, random forest and boosting (Dietterich, 2000).The model used the inputs-outputs of the years 2012 to 2020 for training and then the year 2021 was used for testing its predictive performance.As the chlorinated and the chloraminated systems have different water quality behavior, the final input dataset was split into two subgroups, the chlorinated SRs and the chloraminated SRs.The model was trained and tested for these two groups separately.A simplified diagram of the model is shown in Figure 1.

| Input-output variables
The output variables for month n are the class that each SR belongs to, either High-or Low-Risk.The associated input variables are the monthly averaged values of the various water quality parameters and precipitation for the month n-1.In recognition that monthly average values obscure potentially useful information it was also decided to include and explore the influence on model accuracy of five further calculated variables.These were monthly standard deviation of free chlorine and the total chlorine per SR, the age of water exiting the SRs (given by the water utility, based on design flows and SR dimensions, as the sum of cascading SRs retention time), and the average total and free chlorine per SR in the previous year (n-12).For each output variable (high-or low-risk class) at month n, 21 different input variables were used.The input and output variables for a given month n are presented in Table 1.
A schematic that shows the inputs and outputs for a given month n is presented in Figure 2.

| Ensemble decision tree methods
The ensemble decision tree methods tested and compared for their performance in this work were classification random forest (CRF), the main sub-space algorithm, Ada-Boost the main boosting algorithm, and two other boosting algorithms, LogitBoost and RusBoost (Breiman, 2001;Friedman et al., 2000;Seiffert et al., 2008).
CRF consists of a number of independent weak decision tree classifiers which contribute equally to the final decision (in classification each tree has one vote).In CRF, the split of a randomly selected sample of data at each weak tree node is made by one of the variables of a small randomly selected sample of the total variables.Boosting trees is an ensemble method that also creates a strong classifier from a number of weak decision trees but in this case the weak classifiers do not contribute equally to the final decision and are dependent to the previous ones.In AdaBoost (adaptive boosting), each new generated tree aims to improve the errors made by the previous tree by introducing weights in the misclassified data of the random sampling.Therefore, each new tree is dependent on the previous one.The final decision is made using the weighted contribution of each different tree.Rus-Boost (Random under-sampling) uses the AdaBoost algorithm in combination with a random under-sampling method to create more balanced datasets.Finally, Logit-Boost (adaptive logistic boosting) use the Adaboost algorithm, but then applies the logistic regression cost function instead of minimizing the exponential loss.RF and Rusboost were successfully applied in two different research papers for the prediction of iron exceedance in district distribution areas (Kazemi et al., 2023;Mounce et al., 2017).
The predictive model was developed in MATLAB 2021b using the statistics and machine learning toolbox.The number of trees and the minimum number of observations per tree leaf for both the RF and the boosting methods were set to 1000 and 1, respectively.In RF, the number of randomly selected variables per node were set equal to square root of the total number of available variables as Breiman (2001), that is, when all 21 variables were used for training, the model randomly selects five variables from which the split decision in the node is made.In boosting, the learning rate was set equal to 0.1.

| Data bias and data augmentation methods
The dataset was heavily unbalanced toward the Low-Risk class for both the chlorination and the chloramination SRs as Table 2 indicates.In general, data bias could skew the training of the model and produce inaccurate predictions.To tackle this problem two types of methods were explored, oversampling the minority class (High-Risk) and under-sampling the majority class (Low-Risk).For the former approach, synthetic data (artificial data) of the Low-Risk class were required.Two different oversampling methods were used, the synthetic minority oversampling technique (SMOTE) and the adaptive synthetic method (ADASYN).SMOTE generates synthetic data by calculating the distance of the vector between a random sample of the minority class and some of its neighbor samples and then by multiplying the result with a random weight (Chawla et al., 2002).ADASYN follows the SMOTE approach but instead of selecting the minority sample randomly, the minority sample is selected based on the number of majority class samples in its neighbor, with high priority given to those surrounded by more majority class samples (He et al., 2008).As regard the under-sampling approach the method selected in this work was the random removal of samples belonging in the majority class.
The predictive model was initially tested using the four aforementioned ML algorithms (tests 1-4).Then, each augmentation method was used to reduce the Low-Risk/High-Risk bias and the predictive model was retrained using the updated dataset (tests 5-16).The level of bias reduction was matched to ML algorithms, in particular for RusBoost as it is designed for use with unbalanced datasets.

| Performance metrics
Metrics were used to evaluate the performance of the different ML algorithms.The simplest performance metric is Accuracy; however, this was not applied due to the disproportional number of Low-Risk samples compared to the High-Risk ones.Instead, three performance metrics were selected to cover different aspects of the model behavior: true positive rate (TPR), precision, and Matthews correlation coefficient (MCC).TPR is the rate of the correctly predicted events (true positives) and precision is the ratio of correct positives over the total number of predicted positives (true positives and false positives).These two metrics were used to evaluate the ability of predicting High-Risk SRs and the ability to generate a smaller number of false positives.MCC is a metric that quantifies the overall performance and balance of the model as it uses all four probable prediction results (true positives, true negatives, false positives, and false negatives) for its calculation (Baldi et al., 2000).MCC values lie between À1 and 1 with À1 indicating a complete disagreement between observations and predictions and 1 a completely agreement between observations and predictions.The formulas for these metrics are as follows: where TP, TN, FP, FN are the True Positives (correctly predicted events), True Negatives (correctly predicted nonevents), False Positives (incorrectly predicted nonevents), False Negatives (incorrectly predicted events), respectively.Given that each metric provides information about a different aspect of model performance, the best models were selected using the average value of the three metrics.Using the average, we can see which model predicts the highest number of high-risk class SRs (high TPR) without creating a large number of false positives (high precision) and with good overall balanced performance (high MCC).

| Variables importance
To examine the importance of the various input variables on the model's performance, two different approaches were applied.Firstly, the models with higher than 0.7 performance metrics average were selected and rerun using only the variables indicated by the ML analysis as being the most important ones as indicated by the form of the decision trees (i.e., selected for inclusion in the top levels of the tree and thus scoring higher in the ensemble decision tree variable importance test during the training period).Secondly, the importance of each unique variable in the top two overall models was investigated by retraining the model iteratively without using one variable at a time, then comparing the MCC of the new test with the MCC of its corresponding model when all the variables were used.3 | RESULTS

| Model results
In the chlorinated systems, SMOTE and ADASYN were used to generate synthetic data equal to four times the amount of the minority class (Low-Risk class) data when ADABoost, LogitBoost, and CRF were used as ML algorithms, and equal to two times the Low-Class data when RusBoost was used.This is because RusBoost includes under-sampling method in its algorithm and, therefore, it should be applied in unbalanced datasets.In the chloraminated systems, there are more SRs included in the Low-Risk class and, thus, SMOTE and ADASYN were used to generate synthetic data equal to two times the minority class data.The random under-sampling approach was used to generate a completely balanced dataset to train ADABoost, LogitBoost, and CRF algorithms and to generate a dataset with a Low-class to High-class risk ratio equal to 2 to train the RusBoost algorithm.
The performance metrics of the 16 different initial tests in each one of the disinfection systems are presented in Tables 3 and 4. The model tests are named using the acronym Cl and Clm for the chlorination and the chloramination systems, respectively.For the chlorinated SRs, the best tests, based on the average performance metric (=0.8), were Cl.3 that uses AdaBoost with no augmentation method and Cl.6 that used CRF and SMOTE as an augmentation method.The former test has a higher MCC and Precision that indicates this is a more stable model and the latter test has a higher TPR performance.For the chloraminated systems, the best test was Clm.10 with a performance metrics average of 0.75.Overall, the tests in the chlorinated systems produced better results compared with those in the chloraminated systems as indicated by both the MCC and performance metrics.
In both types of systems, the AdaBoost tests with augmentation methods produced the highest TPR results reaching 91% TPR (Cl.7,Cl.11,Cl.15,Clm.7,Clm.11,Clm.15),however, these tests also had low MCC (0.46-0.66) and precision (0.32-0.6) values indicating that they also produced many false positives that reduced the model's stability.The RusBoost without augmentation (Cl.1 and Clm.1) performed better than all the other models in chloramination SRs and better than all except AdaBoost when no augmentation was used.However, RusBoost performance had a different behavior in the two disinfection systems when augmentation methods were introduced as its performance was increased in the chlorinated systems (Cl.5, Cl.9, Cl.13) and decreased in the chloraminated systems (Clm.5, Clm.9, Clm.13).CRF and LogitBoost had the lowest TPR performance in both disinfection systems when no augmentation method was used, but when SMOTE and ADASYN were introduced, their TPR performance was increased that also improved their average performance between 4% to 8%.However, these two algorithms had different behavior when the under-sampling method was used as the CRF increased its average performance in both systems and the Logit-Boost decreased its own one.

| Variable importance
A critical element of this study was to understand the need for monitoring of different water quality parameters and thus an analysis of variable importance was performed.In the chlorinated systems, three CRF models (Cl.6, Cl.10, Cl.14), three RusBoost models (Cl.1, Cl.5, Cl.9), one AdaBoost model (Cl.3), and two LogitBoost models (Cl.8 and Cl.12) were selected to be retrained using only their most important variables as indicated by the form of the decision trees.In the chloraminated systems, the selected top performing models were three CRF models (Clm.6,Clm.10, and Clm.14) and three RusBoost models (Clm.1,Clm.5, Clm.13).The variables that were important and their relative importance was different in each of these models.Free chlorine was consistently one of the most important variables in the chlorinated system models and total chlorine for the chloraminated systems.
Examples of the variables' importance in two different tests, test Cl.3 and test Cl.6, are shown in Figure 3.This graph shows that the contribution of monthly average free chlorine (FreeCl) in the Cl.3 model was more significant than any other variable.Comparison of the two graphs in Figure 3 clearly shows how the variables' contribution and importance differ from model to model.For example, in Cl.3 the Temperature_DWTP variable is the second most important variable but in Cl.6 this specific variable is one of the least important contributors.The retraining of these models was made using only the variables that had a significant contribution relative to other variables.Thus, the number of variables differ in the new tests.Our approach was to include a minimum of four important variables in the second batch of tests even if the importance of one variable was far higher than the other variables.We set the maximum number of variables equal to 10 if the importance of these variables was equivalent.The new tests' results are presented in Table 5 (chlorinated systems) and Table 6 (chloraminated systems).In the variables' column, the variables used for the training are shown with their order of importance in the initial tests (where all variables were used).
These new tests had worse average performance results in comparison to their equivalent initial ones (all variables used) except for test Cl.25 that performed better than Cl.14 and Clm.18 that had equal performance with test Clm.6.The worst performance drop appeared in the LogitBoost models where their performance was reduced by up to 26% .This drop could be, potentially, explained by the fact that in both LogitBoost tests the free chlorine variable (Cl) is absent.However, in most of the cases the performance drop was insignificant which indicates that the model could be applied with less input variables, and consequently with less computational effort for data preprocessing and training, and still produce accurate results (Cl.18,Cl.20,Clm.19,Clm.20,Clm.22).
In the chlorination systems, the above results indicated that the Adaboost without any data augmentation tests (Cl.3 becoming Cl.18) had the best performance.It provided the best performance without requiring any additional computational cost for balancing the training dataset, which is the case for test Cl.6 where SMOTE augmentation is required.In terms of computational cost and data preprocessing time, Cl.18 could be considered the best model as it provided insignificantly worse results (an average of just 1%) comparing to Cl.3 but uses only  A sensitivity analysis was conducted based on the best four tests (Cl.3,Cl.18,Clm.6,Clm.18)by retraining them with one permutated variable at a time.For each new model, the difference between the MCC value of each retrained model and the MCC value of the initial test was calculated.Thus, the larger the negative MCC difference the higher the variable's contribution in the model.A positive difference indicates that this variable has negative influence and should be removed from the input dataset.The results of this process are shown in Table 7.
For all the CRF tests (chloraminated systems Clm.6, Clm.18) each unique variable has some contribution in the model's prediction, as Table 7 shows.However, for the Adaboost model Cl.3 there were variables that their absence increased the model's MCC indicating that they could be excluded from the training input.These variables were the turbidity in the raw water (1.6% improvement), the free chlorine in the DWTPs (1.1% improvement), and the ICCs in the DWTPs (1.3% improvement).In chlorinated systems, free chlorine was the most important variable for all the tests as its absence reduced the MCC performance by 13% for Cl.3 and 17% for Cl.18.In chloraminated systems, total chlorine was the most important variable as its absence reduced the MCC performance by 10% for Clm.6 and 12% for Clm.18.Overall, the average MCC drop in the CRF tests was 4% and in the AdaBoost tests was 1%.This indicates that apart from the chlorine variables (free for chlorination, total for chloramination) the models could still produce reliable results when one of the other input variables is not available.

| Refining the chlorine model
The variables' importance test for the Cl.3 indicated that AdaBoost with no augmentation could improve its performance if one of the FreeCl_DWTP, the ICC_DWTP, or the Turbidity_Raw is removed.Therefore, the performance results of these three tests are presented in Table 8.In the same table, the performance results of another test where all three variables were removed (Cl.29 test) is presented.
Table 8 indicates that the overall average performance was improved when one of these variables was permutated.In addition, these three tests improved each one of the performance metrics with the Cl.26 test having the best overall performance improvement.However, there was no performance improvement when all three of them were permutated (Cl.29).Cl.26 was the best out of all these tests and overall it was the best out of all the predictive model tests implemented in the chlorinated systems SRs.

| Validation of model results using risk ranking
To verify the model's performance, data for January through November 2021 was used to predict SR low residual events for the following month, and this result was compared to actual system measurements.Table 9 shows the number of the correctly predicted Low-Risk SRs and High-Risk SRs for the best two models for both disinfection types.Note that this table shows the overall results for all the SRs, that is, there are 11 different class predictions for each SR, one for each month.
Ensemble decision trees produce in their outputs a probability of an SR being in one of the two classes based on the number of trees in the ensemble that belongs to one of them.Therefore, the larger the number of trees classifying an SR in the High-Risk class the higher the probability of low residual for that SR.The risk ranking for each SR per month was implemented using the best predictive models for each disinfection system which were, as Table 9 demonstrates, Cl.26 for the chlorination SRs and Clm.10 for the chloramination SRs.Table 10 shows the number of the top-20 highest risk ranked SRs, as ranked by the predictive model, that actually experienced low residual at that month during the investigation period.We should bear in mind here, that the top-20 SRs' risk ranking list contains different SRs at each month of the investigation period.
The model performance (Table 10) was impacted by months with relatively few (lower than 20) low residual events, such as March, April, and June.The arbitrary selection of 20 top high-risk SRs might not be appropriate for triggering interventions in all cases and could perhaps be modified to better fit the actual occurrence of low residual events over time.

| Comparing the best performing models with the last months measurement approach
A final comparison between the best predictive models and the "last-month" approach that is commonly used by water utilities to prioritize interventions in SRs has been made.For this comparison, an assumption that an SR that has failed in the last month will also fail in the following month has been made (i.e., an SR that fails in the month of February will also fail in the following month) and the same performance metrics were used.The results of the "last month" for both the chlorination and the chloramination SRs are presented in the following table (Table 11).These results clearly indicate that this approach performance is worse than the performance of all the different tests of the model presented in Sections 3.1 and 3.2.

| Machine learning to predict low residual events in service reservoirs
The ML based predictive models for both the chlorination and the chloramination SRs provided high levels of performance with only a few of the tests having unacceptable average performance metrics.The best chlorination model was found to be one that used the AdaBoost ML algorithm with 20 out of the 21 available variables (only turbidity in the raw water was excluded), without the use of an augmentation method to account for data bias.The best chloramination model was the one that used a CRF ML algorithm with all 21 available variables and the ADASYN augmentation method.
Overall CRF based models outperformed the other algorithms.There was just one CRF test with an average performance below 0.7.Regarding the other models, Rus-Boost was the second-best overall algorithm, AdaBoost was the third, and LogitBoost the last.AdaBoost precision performance was significantly dropped (up to 50%) when augmentation methods were used, a finding which indicates that, when the test dataset is unbalanced, generating a fully balanced training dataset using augmentation methods misleads AdaBoost algorithm training into predicting many false positives.A similar behavior was noticed for the LogitBoost algorithm which also increased its false positive predictions when the training imbalance was reduced using augmentation methods.In addition, this algorithm's performance dropped significantly when the input variables were reduced and for this reason LogitBoost was the worse algorithm for this water quality problem.
Water distribution systems are complex reactors for water quality with a number of competing and ill-defined processes taking place concurrently.A prediction with accuracy of approximately 80% is significantly better than most other predictive approaches to modeling water quality in water storage tanks.

| Input variables
The importance of the variables was different for each model, with free chlorine and total chlorine measured in the storage tank in the previous month being consistently one of the most significant variables in the chlorination and chloramination systems, respectively.Intact cell counts, raw water turbidity, and disinfectant residual at the WTP all contributed to improving the prediction, but their impact was less significant and the exclusion of one or more of these additional would still result in a valid and useful model for this particular dataset.This finding emphasizes the need to perform disinfectant residual monitoring at storage tanks with even weekly grab samples being sufficient to support this type of predictive model.
The tests exploring permutations of the input variables indicated that all the variables have some contribution to the model's decision trees and, thus, it could be suggested that the more variables are included, the better the model's performance could be.Given the number of different potential causes of low chlorine events in a specific SR, including for example low chlorine in DWTP finished water, changes in water age, and elevated temperature, the performance of the ML approach for prediction utility-wide in over 1000 SRs is remarkable.The ability of the top performing ML models to make quite accurate predictions despite lacking other information about the cause (or requirement for the user to presuppose a cause) is a key strength of the approach.It is likely that different model permutations are predicting certain types of failures better than others, which could also explain why certain parameters (e.g., raw water turbidity) that could plausibly have an impact have not played a significant role in these models because such root causes of SR low chlorine events are not as common in the dataset.
Future work could therefore extend the variables considered here, for example, to include TOC and metals data that were not available for this research, or to consider subsets of the SR assets grouped geographically.This future work should not necessarily be restricted to only data collected and managed by the water utility.Rainfall data was found to contribute to all models, but with significant variation across the different model formulations.Overfitting to input parameters is also a possibility in these types of ML modeling approaches, thus the importance analysis is a critical step to include in any such analysis.
The analysis showed that flow cytometry was a useful input variable.But this is not a parameter required by regulations so may not be widely available at DWTP or SRs.It is important to note that usefully accurate predictions were not dependent on the availability of flow cytometry parameters.

| Practical implications
One of the advantages of ensemble decision trees is that they offer human interpretable insight.Each weak decision tree of the ensemble algorithm can be extracted as an image.This example shows the split criteria of the data in the first two leaves of this decision tree.Operators and decision-makers can examine such information over a number of the weak decision trees to gain an understanding of the reasons that the predictive model made a certain classification decision.The risk ranking list that the predictive model generates could be used to direct the water utilities interventions in the SRs that consistently fail.However, it is important to recognize that these types of ML models do not provide details about the root cause of each individual predicted SR low chlorine event.A further analysis of the causes of disinfectant residual deterioration would be required to identify the factors that that caused it, or development of a model with root cause identification as its goal.This research does not definitively recommend one particular model for all utilities to predict low residual events as the results indicated that there will always be a trade-off between increasing the TPR performance and decreasing the MCC index and precision by generating false positives.The optimal trade between these metrics is heavily dependent on the requirements of the proactive management approach that water utilities would like to follow and should be decided on a managerial level, based on the available financial sources for interventions.
Collecting and integrating the various data to create the final dataset used for the predictive model required significant effort and collaboration between the authors and the utility.However, once the dataset was completed, the required time to train the model with a different ML algorithm was less than 10 min.Hence, the investment and change required to unlock the potential of these types of data-driven applications, to better manage drinking water quality and more, is in better storage, linkage, and accessibility of existing data.
Classification approaches, like risk ranking, are a useful outcome for utilities in the management of water quality.Based on weekly grab samples from tanks, the ability to precisely predict a numerical disinfectant residual in the following month would be limited.However, the utility does not need a precise numerical prediction of disinfectant residual to take action, an indication of high risk for low residual is sufficient.A list of the top 20 highest risk SRs is helpful in prioritizing maintenance activities.Many utilities use historical events or previous month's measurements as the prioritization criteria for current interventions.However, it has been demonstrated that machine learning methods like the one used in this study can perform better than the historical event approach (Kyritsakas et al., 2023).
This paper presents a data-driven predictive model that uses only available water quality monitoring data so there was no requirement for hydraulic or other mechanistic models or the calibration, etc. associated with their use.These machine learning approaches (random forest and boosting selected here) can be transferred and applied to other utilities if there are sufficient discrete water quality monitoring data over a time scale suitable to capture past events, which are critical for learning aspects of the approach.Similarly, these types of machine learning approaches have been demonstrated as valuable for prediction and risk ranking of other drinking water quality parameters beyond disinfectant residual.There is significant potential for machine learning methods to mine useful and actionable information from historical drinking water quality data.

| CONCLUSIONS
This paper demonstrated the ability of data-driven methodologies to predict SRs disinfection residual risk class 1 month in advance.Water quality data collected from different parts of the DWDS and rainfall data were utilized as inputs in the model, and different machine learning methodologies and formulations were applied to decrease the imbalance of the dataset (most of the SRs belonged in the low-risk class) and to identify the key variables that yield better predictions.Based on the results the key findings are: • The predictive model reached an overall average (of True positive rate, Precision, and Matthews correlation coefficient) performance of 0.82 for the chlorination SRs and 0.76 for the chlorination SRs.• The importance of different input variables, and their combination, was complex and varied across the different model formulations explored.The selection methodology identified the input variables required to produce results with minimal performance drop and decreased computational time.• The input variables sensitivity analysis indicated that free chlorine and total chlorine are consistently one of the most important input variables in the chlorinated and the chloraminated systems, respectively.
The month ahead risk ranking outputs of the model are well suited to enabling proactive management by providing sufficient early warning to arrange for additional sampling, flushing, cleaning, or other interventions for the highest ranked SRs.The Machine Learning approaches applied are of the "open box" type, hence the form of the decision trees can be interrogated to gain deeper understanding of the contributing factors and mechanisms that might have contributed to water quality deterioration through the parameter importance analysis, further targeting strategic management and decisionmaking.The data-driven nature of the approach means that the methodology is generic; it could be readily applied to the SRs of other water utilities or different areas of the DWDS depending on data availability.Methodologies like the one presented are an important first step on the pathway toward the Digital Water era.

F
I G U R E 1 A schematic simplified diagram of the model implementation.

F
I G U R E 2 Schematic for inputs and outputs for a certain month n.T A B L E 2 Number and percentage of events in the monthly scaled dataset for January 2012-November 2021.

F
I G U R E 3 Variables' importance for tests Cl.3 (left) and Cl.6 (right).The importance estimates are calculated using the decrease in node impurity at each tree.T A B L E 5 Prediction accuracy with reduced number of variables for the year 2021 in the chlorinated systems.
Input and outputs of the predictive model for month n.
T A B L E 1 Prediction accuracy in the chloraminated systems for the year 2021.
Prediction accuracy with reduced number of variables for the year 2021 in the chloraminated systems.Importance of each variable in the predicting model best tests for chlorinated (left) and chloraminated systems (right).
five input variables in the training period.thechloramination systems, the best test was Clm.10 with a performance average of 0.75.There were three other models with a performance metrics average of 0.73 (Clm.6,Clm.14,Clm.18) with Clm.18 being considered better than the other two as it required fewer input variables.T A B L E 6 T A B L E 9 Best models service reservoir predictions for the testing period (January 2021-November 2021).Last month approach performance metrics.
a February 2021 had 13 low chloramination events.bMarch 2021 had zero low residual events in both systems.T A B L E 1 1