Development of water quality prediction model for water treatment plant using artificial intelligence algorithms

Hwisu Shin; Yonghoon Byun; Sangwook Kang; Hitae Shim; Sueyeun Oak; Youngsuk Ryu; Hansoo Kim; Nahmchung Jung

doi:10.4491/eer.2023.198

Environ Eng Res > Volume 29(2); 2024 > Article

Shin, Byun, Kang, Shim, Oak, Ryu, Kim, and Jung: Development of water quality prediction model for water treatment plant using artificial intelligence algorithms

Research

Environmental Engineering Research 2024; 29(2): 230198.

Published online: June 25, 2023

DOI: https://doi.org/10.4491/eer.2023.198

Development of water quality prediction model for water treatment plant using artificial intelligence algorithms

Hwisu Shin, Yonghoon Byun, Sangwook Kang, Hitae Shim, Sueyeun Oak, Youngsuk Ryu, Hansoo Kim, Nahmchung Jung^†

Technological Development Department, Dohwa Engineering, 06178, Republic of Korea

^†Corresponding author: E-mail: ncj9191@dohwa.co.kr, Tel: +82-2-6490-2872, Fax:,

Received April 7, 2023 Revised June 19, 2023 Accepted June 24, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Accurate prediction of water quality changes in the water treatment process is an important factor for optimal decision-making process such as the design, operation, and diagnosis of water treatment facilities. This study developed an Artificial Intelligence (AI) algorithm model predicting dissolved organic carbon (DOC) removal and disinfection byproducts formation, and comparatively analyzed existing empirical models and prediction results to examine the applicability of AI algorithm techniques to water quality prediction in the water treatment process. We enhanced empirical models for predicting DOC removal and disinfection byproduct formation in Korea water purification plant. Six AI algorithm techniques were applied and tested using real-world data. All AI algorithm models outperformed the original empirical models in predicting DOC removal and byproduct formation. In terms of the DOC prediction model, multi-layer perceptron (MLP) showed the optimal performance (R² = 0.9795; root mean square error [RMSE] = 0.0365 mg/L). MLP also showed the optimal performance in disinfection byproduct formation prediction (R² = 0.9781; RMSE = 0.0008 mg/L. As a result, the prediction performance of AI algorithm models improved with larger sample sizes. By securing data samples for approximately 1 year, these models were confirmed to outperform empirical models.

Keywords: AI algorithm, Disinfection byproduct, DOC, Empirical model, Water treatment

Graphical Abstract

Keywords: AI algorithm, Disinfection byproduct, DOC, Empirical model, Water treatment

1. Introduction

Water treatment facilities stably produce the required quantity and quality of purified water. Such facilities should be planned considering various conditions such as facility standards under the Water Supply and Waterworks Installation Act, water source and raw water quality, target water quality, purification methods, surrounding environments, and operation and management techniques. Among these conditions, planners/engineers highly emphasize the selection of process and scales for optimal water treatment to achieve purified water quality targets according to the raw water quality. As the Republic of Korea faces large seasonal variations in raw water turbidity, frequent occurrences of algae, and strict water quality standards, it has become important to prepare techniques to select water treatment methods and combined process to minimize risks at the design stage of water treatment facilities. The optimal method is either to install pilot plants combined with multiple water treatment processes, conduct experiments on single process, or conduct pilot tests to review the impacts of successive processes, and finally to select the most effective water treatment process [1]. However, as such a method is time and cost-consuming, there is a limit to its application at the design stage. As an alternative, data-based decision-making techniques via prediction modeling have been studied [2], and along with the recently expanded implementation of water quality sensors and remote monitoring systems in water treatment facilities, there have been more studies on AI algorithm approaches to determine operating parameters (e.g., coagulation, precipitation, and disinfection) or predict the treated water quality by using data on water treatment process operation. Wu and Lo [3] developed a daily poly-aluminum chloride (PAC) dosage rate determination model using multi-layer perceptron (MLP) and adaptive network–based fuzzy inference system (ANFIS) by setting input variable scenarios for coagulant dosage rates, raw water turbidity, filtered water turbidity, and filtered water chromaticity/color. Heddam et al. [4] developed an alum dosage rate determination model by using radial basis function (RBF) and general regression neural network (GRNN) and setting raw water turbidity, conductivity, pH, water temperature, dissolved oxygen concentration, and UV absorbance as input variables. Kennedy et al. [5] used RBF, GRNN, simulated annealing MLP (SA-MLP), and genetic algorithm MLP (GA-MLP) to predict the turbidity of effluents from settling basins that change in line with the dosages of disinfectant, PAC, and alum in the coagulation treatment process at water treatment facilities. Wang et al. [6] used the random forest (RF) technique to develop a model to predict the turbidity of effluents from settling basins by setting dissolved oxygen concentration, oxygen consumption, raw water pH, raw water temperature, influent flow rates, coagulant dose rate, and raw water turbidity as input variables. Dunnington et al. [7] applied existing physically based model and AI models (e.g., multiple linear regression, gradient-boosted regression tree ensemble, and neural network) to predict dissolved organic carbon (DOC) concentrations remaining after the coagulation process, and then examined prediction accuracy, analysis possibility, and accessibility of AI algorithm in the water treatment processes. Evidently, efforts have been made regarding the application of diverse AI techniques to determine the main factors for optimal design, operation, and management of water purification plants, and to predict impacts on treatment capacities and efficiency in line with variations in raw water quality.

This study utilized AI algorithm techniques, predicted water quality changes and treatment efficiency in the water treatment process, and compared the prediction accuracy of the AI algorithm model with existing empirical models, to examine the applicability of AI algorithm techniques in optimal design and operational decision-making processes of water purification plants. We targeted DOC and disinfection byproducts (i.e., the main water quality items of interest in water treatment and the targets to be controlled) to examine the applicability of the techniques.

In water treatment, the quality of treated water and the selection, design, and operation of treatment processes are influenced by DOC. In terms of process operation, DOC increases the demand for coagulants and oxidants required for water treatment. Additionally, it increases membrane fouling and decreases activated carbon usage. [8, 9]. From the perspective of treated water quality, a combination of injected chlorine for disinfection processes in water purification plants and control of iron and manganese, along with organic matter, influence disinfection byproduct formation and biological stability in water distribution networks/systems [10]. Therefore, understanding changes and characteristics of DOC concentrations in raw water can be considered essential for increasing satisfaction with the target water quality of tap water and the treatment efficiency in water purification plants. In the coagulation process, several empirical and semi-empirical models regarding DOC removal have been suggested [7,11–14]. However, these models have been derived from Jar-Test test results without considering differences in raw water characteristics and various operational and environmental factors that can affect DOC removal in the coagulation/precipitation process in actual water purification plants; hence, they are less accurate when applied to water purification plants in Korea.

On the other hand, in the water treatment process, DOC causes taste and odor, and reacts with chlorine in the disinfection process, generating disinfection byproducts such as trihalomethane (THM) and halo acetic acid (HAA) [15]. The US EPA classifies several disinfection byproducts, such as CHCl₃, bromoform, and dichloroacetic acid, as Group B2 to categorize them as potentially carcinogenic toxic substances [16]. Therefore, it is crucial to predict and suppress the formation of disinfection byproducts in the water treatment process. In this respect, the US EPA proposed disinfection byproduct formation models based on chlorine injection points in the water treatment process. However, those models are based on operational characteristics of water purification plants in the US, which are different from conditions in Korea, and application scopes of models are inappropriate for operation features of water purification plants in Korea; hence, there are limitations in directly applying these models in Korea.

Therefore, in this study, we utilized data related to actual operation and water quality measurement in water purification plants in Korea, and developed empirical DOC removal and disinfection byproduct formation models which are appropriate for the conditions in Korea. Furthermore, by utilizing AI algorithm techniques such as decision tree (DT), RF, support vector machine (SVM), k-nearest neighbor (KNN), gradient boosting regressor (GBR), and MLP, we predicted DOC removal in the coagulation process and the formation of disinfection byproducts in clear well, compared each empirical model and AI algorithm technique, and examined the applicability in the water treatment process.

2. Research Methods

In this study, models applied with an AI algorithm for water quality prediction in the water treatment process was developed, and the applicability of several AI algorithms were examined by comparing the training results with the test data, and we evaluated the usefulness of AI algorithm compared to empirical model. The research process for the development of AI algorithm and empirical models for predicting water quality in water treatment processes is presented in Fig. 1. First of all, data for developing each model is collected, and the collected data set is preprocessed. The process for developing an AI includes separating the training and testing data, selecting appropriate hyperparameters for each AI, deriving the best AI algorithm, and predicting water quality. The empirical model was formulated considering various operational environmental conditions and estimated using the Newton-Raphson method.

2.1. Raw Data & Preprocessing

In this study, we performed an analysis by utilizing actual measured and experimented water quality data for each process in the M water purification plant in Korea. The M water purification plant supplies an average of 190,000 m³/d of purified water. This plant uses the standard treatment process of admixture, coagulation, precipitation, and filtration, and the advanced treatment process of post-ozone and activated carbon filtration are linked; purified water is finally produced through chlorine disinfection.

We collected data including sensor-based water quality measurements, experimental, flow, and meteorological data to predict DOC removal and formation of disinfection byproducts per water treatment process. Data were collected for 3 years (from January 5, 2015 to December 30, 2019) and are presented in supplementary material (Table S1).

Before developing the model, a process called data preprocessing is performed to remove and compensate for outliers and missing values that may occur due to various factors such as mechanical problems, communication errors, and equipment maintenance in the data collected for each process. Outliers are defined as negative values in measurement variables, values of unreasonable magnitude and values that are consistently the same as other water quality parameters, considering the characteristics of the measurement variables.

2.2. AI Algorithm Model

AI algorithm is a widely used methodology in data analysis and prediction modeling. They function by allowing computers to learn relationships in existing data, which can then be used to analyze new data and make predictions. Various AI algorithm techniques have been developed and applied, and the most suitable algorithm for problem-solving depends on the data features and problem types. For example, SVM and KNN are useful for classification problems, while linear regression and decision trees (DT) are appropriate for regression problems. Additionally, ensemble models like random forest (RF) and gradient boosting regression (GBR) can be used depending on the data features. Therefore, it is crucial to select appropriate algorithms to gain a comprehensive understanding of their respective features before analysis.

In the interdisciplinary area, such as the water treatment processes, of exploring a large quantity of data in order to discover meaningful information from it, in the form of empirical model, patterns and rules, different techniques are used; hydraulic, chemical and biological theories. Because the empirical models are made by simply assuming various and complex water treatment mechanism and by using instantaneous experimental data that do not reflect all operating and environmental conditions, the decision makers question the results of the modeling. Together with these weaknesses of empirical models using only limited experimental data, a wide range of environmental problems resulted in AI algorithms emerging as a promising new direction of research that is intended to make sense of multi-variate and time series data [17]. Such AI algorithms have been proved to be attractive in many areas. In this study, we utilized DT, RF, SVM, KNN, GBR, and MLP models to examine the applicability of AI algorithm techniques in the water treatment process. We predicted DOC removal in the coagulation/precipitation process and the formation of disinfection byproducts in the clear well. The characteristics of the machine learning techniques used in this study are summarized in Table S2.

2.2.1. AI algorithms

2.2.1.1. DT

DT is a supervised learning model that use specific classification rules to classify data, and regress; as the resultant model has a tree structure, it is referred to as a DT. The minimum unit of entities consisting of the tree structure is referred to as the Node. DT is useful as it enables easy confirmation of complex models due to its facile comprehension and interpretation processes.

In this study, we used a DT model based on Scikit-learn. We used the classification and regression trees (CART) algorithm to create binary trees, and ‘Random state’ was set to 42 and the final number of ‘Leaf Node’ was set to 10, to obtain the same results from the same parameters whenever DT is created.

2.2.1.2. RF

RF is a type of ensemble-learning method utilized for classification and regression analysis, and it is possible to output classification or regression analysis results from multiple DTs constructed in the learning processes. RF has several strengths in that it can calculate the importance of characteristics of the data through node impurity, be utilized for classification and regression problems, and enables easy treatment of missing values.

In this study, we used a Scikit-learn-based RF model. We applied the mean squared error (MSE) criterion as the function regarding the evaluation of the classification function of the model. “n_estimators” refers to the number of trees consisting of RF and was set to a value from 100–1000 to achieve good performances and accurate prediction. In this model, we compared results by changing the values within suggested ranges and selected the following: “n_estimators” = 500.

2.2.1.3. SVM

SVM is a model defining a baseline for classification, i.e., a decision boundary. It can be utilized for linear and non-linear classifications. It is possible to perform classifications by utilizing a support vector and a hyperplane corresponding to data points in training sets closer to a decision boundary. A model is implemented by determining a linear line in which a margin, (i.e., the distance between the support vector and the decision boundary) is maximized. It has several strengths in that it is less likely to be overfitting and is easier to use than neural networks.

In this study, we used the Scikit-learn-based SVM model. A non-linear relationship between input values and target variables was modeled by using the RBF among the kernel functions between the support vector and hyperplane.

2.2.1.4. KNN

The KNN algorithm is the simplest machine-learning classification algorithm and is used under the assumption that data with similar features tend to belong to similar categories. When new data are given, it can predict new data through the information on KNNs from old data. It is simple, efficient, and rapid in the training process, and shows excellent performance in numerical data classification.

In this study, we used the Scikit-learn-based KNN model. “n_neighbors” is the number of nearest neighbors to be used for prediction; five “n_neighbors” were used for the model. As the weight function used for applying predicted values to the model, we used “uniform” under the condition that all neighbors had the same weight.

2.2.1.5. GBR

The GBR algorithm is a machine-learning algorithm that combines several weak models (such as DT), enables more accurate prediction, and creates a powerful model. This algorithm combines several multiple weak models to create a stronger model, which improves accuracy compared to individual models, and is effective for complex data sets since it can handle non-linear relationships between variables.

In this study, we used the Scikit-learn-based GBR model. As a strong model for overfitting, we applied 10,000 to “n_estimators” indicating the number of boosting, and set “max_depth” (i.e., maximum depth of regression estimation) to 4. We also selected five as the minimum number of samples required for internal node splitting, and applied 0.01 (i.e., lower than the standard value of 0.1) to “learning_rate”, representing the contribution to each tree, for more accurate prediction.

2.2.1.6. MLP

MLP has a form in which several layers consisting of perceptrons are sequentially attached, and is referred to as a forward artificial neural network. It is composed of input, hidden, and output layers, and the input layer has more than one layer. After multiplying the input value (X) by the weight (W) and adding the bias (B), the resulting value (Y) can be derived through an activation function. Since it has non-linear layers, it is possible to construct a highly complicated model, and shows excellent performances when hyperparameters are appropriately adjusted [18].

In this study, we used the Scikit-learn-based MLP model. In this model, we used relu, which is an activation function and has strength in terms of the non-linear model prediction. We applied 1000 to “max_iter”, representing the maximum number of iterations of a model, and 0.01 to “alpha” to prevent data overfitting. We made the hidden layer, which is an important factor in the model, have a 2-layer structure with 50 or 100 neurons, and set “n_iter_no_change”, which selects the number of iterations of learning sets, as 50 times.

2.2.2. Data separation

The preprocessed data was divided into ‘Training set’=0.7 and ‘Test set’=0.3 ratios, and the machine learning techniques used in this study, including Decision Tree, Random Forest, SVM, KNN, GBR, and MLP, were trained using the Training data. Subsequently, the trained models were validated using the Test data to generate a water treatment plant water quality prediction model.

2.2.3. Machine learning and testing

Hyperparameters are variables set in a model to achieve the optimal learning model in machine learning, and they play a crucial role in determining the model’s performance. In this study, the hyperparameters for each AI algorithm were determined as shown in the following Table S3.

2.3. Empirical Model

2.3.1. Empirical DOC removal model

In the coagulation/precipitation process, DOC is removed along with suspended solids and colloidal substances in water through the precipitation of flocs formed by the physicochemical action of coagulants. Edwards [11] developed an empirical DOC removal model based on a Langmuir model through a Jar-Test. The model enables classification of DOC as sorbable and non-sorbable, and the ratio can be shown by the SUVA function, as represented by Eq. (1).

(1)

f_{N D O C} = K_{1} ({S U V A}_{r a w}) + K_{2}

where f_NDOC is the fraction of nonsorbable DOC, K₁ is the empirical fitting constant 1[mg·cm/L], K₂ is the empirical fitting constant 2 and SUVA_raw is the specific UV absorbance of raw water [L/mg·cm]. Following determination of K₁ and K₂, sorbable DOC can be calculated as following Eq. (2).

(2)

{D O C}_{s o r b} = (1 - f_{N D O C}) {D O C}_{r a w}

where DOC_sorb is the sorbable DOC concentration [mg/L], DOC_raw is the raw water DOC concentration. When coagulants come in contact with DOC, the ultimate DOC distribution between adsorbents and solutes occurs when the system is in equilibrium. The DOC removed from water is the value after deducting the remaining sorbable DOC in equilibrium from the sorbable DOC. Based on the Langmuir model, the adsorption equilibrium between coagulant dosage and DOC can be expressed as Eq. (3).

(3)

\frac{{D O C}_{s o r b} - {D O C}_{e q}}{C_{d}} = \frac{a \cdot b \cdot {D O C}_{e q}}{1 + b \cdot {D O C}_{e q}}

where DOC_eq is the sorbable DOC in solution at equilibrium [mg/L], C_d is the coagulant dosage [mM Al³⁺ or Fe³⁺], a is the maximum DOC sorption/mM coagulant [mg/mM Al³⁺ or Fe³⁺], b is the sorption constant for sorbable DOC [L/mg]. The amount of DOC remaining after coagulation is the sum of the remaining DOC_eq in equilibrium and DOC_raw in raw water multiplied by f_NDOC (i.e., the ratio of non-sorbable DOC), which can be determined by following Eq. (4).

(4)

{D O C}_{f i n a l} = {D O C}_{e q} + f_{N D O C} \cdot {D O C}_{r a w}

Edwards [11] assumed the maximum sorption of flocs, a, as a function of pH and defined the empirical relationship between them as following the Eq. (5).

(5)

a_{E d w a r d s} = x_{1} \cdot p H + x_{2} \cdot {p H}^{2} + x_{3} \cdot {p H}^{3}

where x_n represents the fitting constants, pH represents the final process pH value. However, DOC concentrations in water are influenced by diverse operational and environmental conditions, such as raw water features, seasonal influences, and operational characteristics of water purification plants. An accurate prediction of DOC removal should consider diverse empirical coefficients that can affect DOC variations [13].

In this context, we assumed that coagulant dosage ( ), pH, and temperature would interfere with or affect the sorption equilibrium processes of DOC and defined the maximum sorption a (a_proposed) as following the Eq. (6).

(6)

a_{p r o p o s e d} = x_{0} + x_{1} \cdot p H + x_{4} \cdot T e m p e r a t u r e + x_{5} \cdot C_{d}

2.3.2. Empirical disinfection byproduct formation model

Until recently as the disinfection byproducts generated after the disinfection process are classified as toxic substances, studies have been conducted on predicting occurrences of disinfection byproducts in the water treatment process and the diverse influential factors that can suppress disinfection byproduct formation [19]. The US EPA suggested a model to predict disinfection byproducts generated by chlorine injection in the water treatment process [20, 21]. Disinfection byproducts were classified into THMs and HAA, as shown in the following equations. The functions of DOC, UVA, Bromide, pH, temperature, free chlorine, and reaction time were calculated using the following Eq. (7).

(7)

T H M = A \cdot {(D O C \cdot U V A)}^{a} \cdot {({C l}_{2})}^{b} \cdot {(B r^{-})}^{c} \cdot {(d)}^{(p H - 7.5)} \cdot {(e)}^{(T e m p - 20)} \cdot {(t)}^{f}

where THM is the trihalomethane concentration [μg/L], DOC is the dissolved organic concentration, UVA is the ultraviolet absorbance at 254nm [1/cm], Cl₂ is the chlorine dose [mg/L], Br⁻ is the bromide concentration [μg/L], is the reaction time [h], (A,a,b,c,d,e,f) are the fitting constants.

We used actual water quality data for the M water purification plant and developed a model for predicting disinfection byproduct concentrations and formation processes based on the disinfection byproduct formation model suggested by the US EPA.

2.3.3. Model estimation

The model can be estimated by deriving unknown parameters that minimize the sum of squared errors between the actual values and predicted values, as represented by the following Eq. (8).

(8)

A = \sum_{i = 1}^{n} {[Y_{i} - {\hat{Y}}_{i} (ζ_{i} : χ)]}^{2}

where represents the sum of squared errors, Y_i represents the i-th observed value measured in the actual water treatment process, and Ŷ_i(ξ_i : χ) represents the i-th predicted value calculated by the model. χ = (χ⁰, χ¹, … χ^K)is the vector of unknown parameters that we want to estimate, and ξ_i = (ξ_i¹, ξ_i², …, ξ_i^K)represents the explanatory variable vector representing the operational characteristics that affect the variability of the target water quality. Here, the parameter χ minimizing the sum of the error squares is calculated according to the following Eq. (9).

(9)

\frac{\partial A (\bar{χ})}{\partial χ^{k}} = 0, (k = 1, \dots, K)

The χ̄ that satisfies the above two equations at the same time can be called (χ̄⁰, χ̄¹, …, χ̄^K), and in this study, it was derived using the Newton Rapson method.

2.4. Model Performance Evaluation

The prediction accuracy obtained from the models were evaluated by calculating the R-squared (R²), Mean Absolute Percentage Error (MAPE), and Root Mean Square Error (RMSE) for the linear regression of the measured values and predicted values. In the equation (Eq. (10–12)), Y_i and Ȳ_i represent the actual values and predicted values of the data set, respectively, Ȳ_i represents the mean of the actual values, and n represents the number of data sets. R² indicates the goodness of fit of the prediction performance, with higher values closer to +1 indicating better performance. MAPE and RMSE indicate better prediction performance as they approach 0, and increasing positive values indicate lower prediction performance.

(10)

R^{2} = 1 - \frac{\sqrt{Σ_{i = 1}^{n} {(Y_{i} - {\hat{Y}}_{i})}^{2}}}{\sqrt{Σ_{i = 1}^{n} {(Y_{i} - {\hat{Y}}_{i})}^{2}}}

(11)

M A P E = \frac{1}{n} \sum_{i = 1}^{n} \frac{| Y_{i} - {\hat{Y}}_{i} |}{Y_{i}}

(12)

R M S E = \sqrt{\frac{Σ_{i = 1}^{n} {(Y_{i} - {\hat{Y}}_{i})}^{2}}{n}}

3. Results and Discussion

3.1. DOC Removal

3.1.1. Empirical model

Edwards [11] suggested an empirical model for predicting DOC removal by using Coagulation and Softening Database data, which were obtained from 39 water source areas and 608 measurement points from the American Water Works Association. However, this model is the outcome based on the Jar-Test for raw water in the United States, and Korea has different raw water characteristics and operational features of water purification plants compared to those of the US; hence, it is difficult to directly apply the model to the conditions in Korea due to the possibility of lower prediction. As a result of comparing measured DOC values from the M water purification plant with prediction values of the model of Edwards [11], R² was −0.270. As shown in Fig. 2, there was a general tendency that predicted values overestimated compared to observed values. Therefore, we used the pH and temperature of raw water, and DOC, UV254, and coagulant dosage that measured at the receiving well and settling basin in the M water purification plant to estimate the appropriate Edwards–Calib model for the M water purification plant. Furthermore, we suggested a proposed model for more precise DOC removal prediction in actual water plants by additionally considering pH, as suggested by Edwards [11] and empirical coefficients such as temperature and coagulant dosage in the process of quantifying maximum sorption of flocs. As the M water purification plant employs alum-based coagulants of PSO-M, PAC, and Al₂(SO₄)₃, we developed a model in consideration of only alum. Table 1 shows the details of parameter values and error estimates derived for each model.

As confirmed in Fig. 2 the Edwards–Calib model and the proposed model accurately predicted observed values. This result appeared to be based on the fact that DOC observation data at the settling basin (which was a target) and the DOC concentration of raw water (i.e., one of the explanatory variables of the model) showed high correlations (r = 0.790). Compared to the Edwards–Calib model which only considered pH, the proposed model which considered pH, temperature, and coagulant dosage showed higher prediction accuracy. Considering the code condition of the estimated model, it can be confirmed that the maximum sorption increased along with the higher temperature and coagulant dosage, indicating higher DOC removal efficiency. On the other hand, for the suggested model, when pH increased, maximum adsorption decreased linearly; in terms of the Edwards–Calib model, maximum adsorption decreased when pH increased up to 7.5 and increased when the pH was > 7.5, indicating a nonlinear trend.

3.1.2. AI algorithm model

We utilized 733 datasets among 1048 datasets, which were collected from January 5, 2015 to December 30, 2019, for training a model, and utilized the remaining 315 datasets (which were not used for training) for performing predictions. Fig. 3 shows the training and test results of the DOC removal model. As confirmed in Fig. 3, the DOC removal model showed very excellent performances in all techniques (R² > 0.94, RMSE < 0.066 mg/L, and MAPE < 2.49%). In particular, RF showed the optimal prediction performance among the models (R² = 0.9972, RMSE = 0.0142 mg/L, and MAPE = 0.572%). On the other hand, as a result of testing for prediction performance by utilizing a trained DOC removal model, and data that were not used for training the model, all AI algorithm techniques showed excellent performances (R² > 0.901, RMSE < 0.0783 mg/L, and MAPE < 3.36%). When training the model, RF showed the optimal performance; however, after the test, MLP showed the best prediction performance among the models (R² = 0.9795, RMSE = 0.0365 mg/L, and MAPE = 1.513%). Comparing the results of predicting DOC removal using the AI algorithm technique to those of the empirical models (Fig. 2), the performances of AI algorithm prediction techniques were generally consistently superior to those of the empirical models.qkr

We also analyzed the performance depending on the sample size per AI algorithm. Fig. 4 a) shows the changes in prediction performance in accordance with gradual increases in the sample size from 90 to 1048. When the number of samples was < 360, the variation in prediction performances depending on the number of samples was large. In particular, when the sample number increased from 90 (i.e., the minimum number of data samples for training a model) to 180, prediction performances decreased by > 8.9% on average. However, when the number of data samples was 270, the results were similar to the performances of empirical models; when the number of data samples was 360, it showed consistently high performances. In other words, as the number of data increased, more information regarding the non-linear relationship between variables was provided, and each model identified basic patterns of the data, leading to more accurate predictions. In conclusion, if datasets regarding water purification plant operation and water quality for approximately 9 months are able to be secured, it is possible to create a model with similar performances to those of empirical models, implying that it is possible to develop a AI algorithm model that is applicable to actual conditions in water purification plants.

3.2. Disinfection Byproduct Formation Model

3.2.1. Empirical model

In this study, based on the EPA study [21], we suggested a model to predict disinfection byproducts generated by chlorine injection in clear well in the M water purification plant. EPA suggested an empirical model to predict disinfection byproduct formation by measuring changes in the concentrations of generated disinfection byproducts after the reaction of chlorine with raw or treated water at each chlorine injection point according to the contact time and water temperature and pH changes. However, the characteristics of raw water and water purification plant operation in this model differ from those in Korea; as the model was derived based on experimental results, it is less appropriate to apply operational data of actual water purification plants to the model. In this sense, we developed a model that is appropriate for features of water purification plant operation and water quality in Korea, based on data from the M water purification plant. Furthermore, by utilizing AI algorithm techniques, we predicted disinfection byproduct formation and conducted a comparative analysis with empirical models to examine the applicability.

The M water purification plant measures and controls nine disinfection byproduct indicators, including THM, CHCl₃, CHCl₂Br, CHBr₂Cl, C₂H₃Cl₃O₂, dibromoacetonitril (C₂HBr₂N), dichloroacetonitrile (C₂HCl₂), and HAA. Among them, we estimated models regarding the formation of disinfection byproducts such as THM, CHCl₃, and CHCl₂Br, as we were able to obtain accumulated water quality data regarding the byproducts with significance for analysis during the period from January 5, 2015 to December 30, 2019.

First, the disinfection byproduct generation model was evaluated using the model parameters provided in the literature and M water purification plant data. Table 2 under the ‘Original’ column provides details of parameter values and error estimates of the literature models on disinfection byproducts, and Fig. 5 (a, b, c) shows visualized results regarding correlations between observed and predicted values.

The literature models regarding the formation of disinfection byproducts such as THM, CHCl₃, and CHCl₂Br resulted in R² values of −109.5, −108.7, and −7.506, respectively. These negative results indicate a tendency of overestimating the observed values in the predictions. Furthermore, the MAPE values are 311.24%, 247.62%, and 88.26% for the respective disinfection byproduct models, indicating significantly low prediction accuracy. In other words, it can be concluded that applying the parameter suggested in the literature to the M water purification plants might be challenging.

For this reason, we derived parameters that reflect the operating conditions of M purification plants using the Newton Raphson method. The details of the error estimates using the derived parameters are presented in Table 2 under the ‘Proposed’ column, and the visualization of the correlation between observed values and predicted values is shown in Fig. 5(d, e, f).

The derived models regarding the formation of disinfection byproducts such as THM, CHCl₃, and CHCl₂Br resulted in R² values of 0.692, 0.628, and 0.596, respectively. It can be assumed that the prediction reliability was high considering that water quality measurement data from the actual water purification plant were utilized rather than experimental data. On the other hand, when confirming the signed of each model, the value of f(time) (i.e., a parameter of contact time between treated water and chlorine) was positive, which explains the phenomenon in which disinfection byproduct formation increased along with longer contact time, and which is consistent with the experimental results of US EPA [21]. Similarly, it explained the tendency in which disinfection byproduct formation increased when the DOC concentration (i.e., the precursor of disinfection byproducts in water), residual chlorine concentration, temperature, and pH were higher.

3.2.2. AI algorithm model

Fig. 6 shows the training results of the model on disinfection byproduct formation, wherein all techniques showed the following performances: R² > 0.88, RMSE < 1.9043 μg/L, and MAPE < 22.91%. In particular, the GBR technique showed the optimal prediction performance among the models (R² = 0.9999, RMSE = 0.0407 μg/L, and MAPE = 0.244%). On the other hand, as a result of testing for prediction performance by utilizing a trained model on disinfection byproduct formation, and data that were not used for training the model, all techniques except DT showed excellent performances (R² > 0.92, RMSE < 1.5707 μg/L, and MAPE < 9.50%). When training the model, GBR showed the optimal performance; however, after the test, MLP showed the optimal prediction performance among the models (R² = 0.9781, RMSE = 0.8486 μg/L, and MAPE = 5.05%). Comparing the results of predicting disinfection byproduct formation using the AI algorithm technique to those of the empirical models (Fig. 5(d, e, f)), the performances of AI algorithm prediction techniques were generally consistently superior to those of the empirical models.

We also analyzed the performance depending on the sample size per AI algorithm. Fig. 4 b) shows changes in the prediction performances of models in accordance with gradual increases in the sample size from 90 to 1048. When the sample number was > 270, we determined the prediction performance (R² > 0.938); there were fluctuations in the error range between predicted and measured values (between −1.4~5.1%) in line with larger number of data samples. Even with datasets with 90 data samples, it was possible to create a model with similar performances to those if the empirical models, and when datasets with > 270 data samples were used, the model showed improved performances by a maximum of 41.3%, a minimum of 28.6%, and an average of 35.2% based on R², compared to the performances of empirical models. Regarding the AI algorithm model for DOC prediction, the number of data increased, more information about the non-linear relationship between variables was provided, and as a result, each model was able to identify basic patterns of the data, leading to more accurate predictions. In conclusion, if a dataset of a minimum of 9 months is able to be secured regarding water quality, it is possible to create a model with a similar performance to those of empirical models, implying that it is possible to develop a model on disinfection byproduct formation with a higher performance if data for > 9 months are secured.

The developed model for disinfection byproduct formation in this study provides information on concentration prediction and inhibition of formation of disinfection byproducts which are generated during chlorine disinfection. In other words, without directly measuring disinfection byproducts, it is possible to predict the concentrations of generated disinfection byproducts in advance and utilize the data for controlling disinfection byproducts. Among the factors influencing disinfection byproduct formation, temperature and bromide are natural values (measured values), contact time is a designed value, and pH is an operating parameter in the coagulation process; these factors are difficult for water purification plant operators to control. Therefore, to suppress disinfection byproduct formation, the following measures for controlling disinfection byproducts can be considered: injection of an appropriate amount of chlorine to minimize microorganism growth and the occurrence of disinfection byproducts; or injection of an appropriate amount of coagulant to improve DOC removal rates, and introduction of advanced treatment processes.

4. Conclusion

An optimal design and operation of water purification plans require prediction of the ways in which treatment efficiency and treated water quality change in accordance with variations in raw water quality and the operating environment in the planned water treatment process. As it is difficult to consider pilot plants due to budget and time restraints, studies have been conducted on prediction modeling using experimental and operational data. Furthermore, recently, there have been more attempts to introduce AI algorithm techniques by utilizing big data on water purification plants to determine the optimal operation parameters and predict the quality of treated water. In this study, we utilized actual operation data from the M water purification plant and developed empirical and AI algorithm models that are appropriate for the M water purification plant and that can predict DOC removal in the coagulation/precipitation process and the occurrence of disinfection byproducts due to chlorine disinfection at Clear well. Furthermore, by conducting a comparative analysis of the prediction accuracy of the developed AI algorithm model and the empirical models, we examined the applicability of AI algorithm techniques in the water treatment process.

Based on the suggested model by Edwards [11], we utilized actual water quality data on the M water purification plant and calibrated it for predicting DOC removal. Furthermore, compared to the model of Edwards [11] which assumes maximum sorption only with a function of pH, we attempted to achieve more accurate prediction performances, and suggested a model in consideration of pH, temperature, and coagulant dosage. The Edwards–Calib model and the proposed model accurately predicted measured values; however, the proposed model, which considers factors influencing DOC removal, demonstrated higher prediction performances.

In addition, as a result of predicting DOC removal by utilizing AI algorithm techniques such as DT, RF, SVM, KNN, GBR, and MLP, RF showed the optimal training performance and MLP showed the optimal test result. We also confirmed that all AI algorithm techniques had superior prediction performances to those of the empirical models. Comparing the performances of the AI algorithm techniques based on the sample size showed that the prediction performance improved along with a larger sample size, and the performance became similar to that of the empirical models when the sample size was approximately 270. This implied that it is possible to build an AI algorithm model that is superior to an empirical model if datasets on water purification plant operation and water quality of > 9 months can be secured. As a model based on datasets of > 1 year showed similar performances, it is considered possible to create an efficient model with datasets of > 1 year.

To predict disinfection byproduct formation, we expressed changes in the concentration of disinfection byproducts generated after the reaction between treated water and chlorine as functions of DOC, residual chlorine, bromide, pH, temperature, and reaction time, and developed a prediction model on THM, CHCl₃, and CHCl₂Br based on water quality data in clear well in the M water purification plant. All models showed high prediction reliability when considering the prediction of water quality of actual water purification plants. It also explained the tendency in which disinfection byproduct formation increased, when DOC concentrations, residual chlorine concentrations, temperature, and pH were higher and the response time was longer. On the other hand, as a result of predicting disinfection byproduct formation using AI algorithm techniques, GBR presented the optimal training performance and MLP showed the optimal test result. All AI algorithm techniques were confirmed to have superior prediction performances to those of empirical models.

In this study, we developed empirical and AI algorithm models for DOC removal in the coagulation/precipitation process, and for the formation of disinfection byproducts in clear well. As DOC and disinfection byproducts are important water quality indicators in operating water purification plants, operators should identify and control the status of DOC removal and disinfection byproduct formation these plants. However, it is difficult to automatically measure such water quality items with measuring instruments; thus, measurements through laboratory tests are required, which have not been managed in actual water purification plants. The empirical and AI algorithm models developed in this study can be useful for decision-making on water purification plant operation as they enable prediction in advance without directly measuring water quality items. In particular, as DOC removal model results can be applied as input variables in a model for disinfection byproduct formation, models can connect with each other. It is also possible to utilize the results as basic data for decision-making on process development for controlling disinfection byproducts, coagulant dosage, and chlorine injections.

Furthermore, we examined the applicability of AI algorithm techniques to water treatment processes. We confirmed that all AI algorithm techniques utilized for the analyses had consistently superior prediction performances to those of empirical models, and that the applicability of AI algorithm was high in water treatment sectors in terms of prediction performances. However, the higher prediction performance of AI algorithm techniques requires large-scale datasets to be reliable secured. Therefore, securing a minimum of 1 year of datasets can guarantee similar performances to those of empirical models, and a higher performance requires a large number of datasets. The findings should be of interest to and garner support from relevant authorities in terms of policy and budget.

Supplementary Information

eer-2023-198-Supplementary.pdf

Acknowledge

This work was supported by the World Class 300 Project(R&D) (P0012997, Development of Engineering Platform and Application Software for Infrastructure Design) of the MOTIE(Korea).

Notes

Author Contributions

H.S. (Principal Researcher) developed the conceptualization, methodology, and wrote the manuscript. Y.B. (Researcher) helped in developing the conceptualization, methodology, and wrote the manuscript. S.K. (Department Head), H.S. (Senior Researcher), S.O. (Researcher) and Y.R. (Researcher) helped in developing the methodology of the study, H.K. (Research Director) provided valuable research insights into the study. N.J. (Research Fellow) provided valuable research insights into the study, reviewed the manuscript, and helped with publishing.

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

1. Byun YH, Shin HS, Kim HY, Jung NC. Basic study on development of drinking water treatment process simulators. J. Korean Soc. Water Wastewater. 2021;35:351–365. https://doi.org/10.11001/jksww.2021.35.5.351

2. Sahu P, Londhe SN, Kulkarni PS. Modelling dissolved oxygen and biochemical oxygen demand using data-driven techniques. Environ. Eng. Res. 2022;28:210541. https://doi.org/10.4491/eer.2021.541

3. Wu GD, Lo SL. Predicting real-time coagulant dosage in water treatment by artificial neural networks and adaptive network-based fuzzy inference system. Eng. Appl. Artif. Intell. 2008;21:1189–1195. https://doi.org/10.1016/j.engappai.2008.03.015

4. Heddam S, Bermad A, Dechemi N. ANFIS-based modelling for coagulant dosage in drinking water treatment plant: a case study. Environ. Monit. Assess. 2012;184:1953–1971. https://doi.org/10.1007/s10661-011-2091-x

5. Kennedy MJ, Gandomi AH, Miller CM. Coagulation modeling using artificial neural networks to predict both turbidity and DOM-PARAFAC component removal. J. Environ. Chem. Eng. 2015;3:2829–2838. https://doi.org/10.1016/j.jece.2015.10.010

6. Wang D, Chang X, Ma K, Li Z, Deng L. Estimating effluent turbidity in the drinking water flocculation process with an improved random forest model. Water Supply. 2022;22:1107–1119. https://doi.org/10.2166/ws.2021.213

7. Dunnington DW, Trueman BF, Raseman WJ, Anderson LE, Gagnon GA. Comparing the Predictive performance, interpretability, and accessibility of machine learning and physically based models for water treatment. ACS ES&T Eng. 2020;1:348–356. https://doi.org/10.1021/acsestengg.0c00053

8. Son HJ, Hwang YD, Roh JS, et al. Application of MIEX® pre-treatment for ultrafiltration membrane process for NOM removal and fouling reduction. Water Sci. Technol.-Water Supply. 2005;5:15–24. https://doi.org/10.2166/ws.2005.0034

9. Son HJ, Kim SG, Seo CD, Yoom HS, Ryu DC. Evaluation of treatability on DOC and THMs according to periodic cumulative filling of granular activated carbon (GAC). J. Korean Soc. Environ. Eng. 2017;39:513–518. https://doi.org/10.4491/KSEE.2017.39.9.513

10. Hua G, Reckhow DA, Abusallout I. Correlation between SUVA and DBP formation during chlorination and chloramination of NOM fractions from different sources. Chemosphere. 2015;130:82–89. https://doi.org/10.1016/j.chemosphere.2015.03.039

11. Edwards M. Predicting DOC removal during enhanced coagulation. J. Am. Water Work Assoc. 1997;89:78–89. https://doi.org/10.1002/j.1551-8833.1997.tb08229.x

12. Harrington GW, Chowdhury ZK, Owen DM. Developing a computer model to simulate DBP formation during water treatment. J. Am. Water Work Assoc. 1992;84:78–87. https://doi.org/10.1002/j.1551-8833.1992.tb05886.x

13. Kastl G, Sathasivan A, Fisher IA, Van Leeuwen J. Modeling DOC removal enhanced coagulation. J. Am. Water Work Assoc. 2004;96:79–89. https://doi.org/10.1002/j.1551-8833.2004.tb10557.x

14. Moomaw CL. Predictive models for coagulation efficiency in DBP precursor removal [dissertation]. Colorado: Univ. of Colorado; 1992.

15. Brezonik P, Arnold W. Water chemistry: an introduction to the chemistry of natural and engineered aquatic systems. Oxford University Press; 2011. p. 501–503.

16. Westerhoff P, Rodriguez-Hernandez M, Baker L, Sommerfeld M. Seasonal occurrence and degradation of 2-methylisoborneol in water supply reservoirs. Water Res. 2005;39:4899–4912. https://doi.org/10.1016/j.watres.2005.06.038

17. Yu SI, Rhee C, Cho KH, Shin SG. Comparison of different machine learning algorithms to estimate liquid level for bioreactor management. Environ. Eng. Res. 2023;28:220037. https://doi.org/10.4491/eer.2022.037

18. Wu J, Chen S, Liu X. Efficient hyperparameter optimization through model-based reinforcement learning. Neurocomputing. 2020;409:381–393. https://doi.org/10.1016/j.neucom.2020.06.064

19. Zhu G, Zhang S, Bian Y, Hursthouse AS. Multi-linear regression model for chlorine consumption by waters. Environ. Eng. Res. 2021;26:200402. https://doi.org/10.4491/eer.2020.402

20. Water Treatment Plant Model Version 2.0 User’s Manual. 2001. [Cited 05 March 2023]. Available from: https://www.epa.gov/sites/default/files/2017-03/documents/wtp_model_v._2.0_manual_508.pdf

21. Amy GL. Empirically based models for predicting chlorination and ozonation by-products: Trihalomethanes, haloacetic acids, chloral hydrate, and bromate. US Environmental Protection Agency; 1998. p. 68–85.

Fig. 1

Flow chart for Model Simulation procedure.

Fig. 2

Empirical DOC removal model results. a) Edwards model, b) Edwards-Calib, c) Proposed model.

Fig. 3

AI algorithm DOC removal model results. a) DT, b) RF, c) SVM, d) KNN, e) GBR, f) MLP.

Fig. 4

a) AI algorithm DOC removal model results by data sample size. & b) AI algorithm THM formation model results by data sample size.

Fig. 5

Empirical THM formation literature model results. a) THM, b) CHCl₃, c) CHCl₂Br. & Empirical THM formation derived model results. d) THM, e) CHCl₃, f) CHCl₂Br.

Fig. 6

AI algorithm THM formation model results. a) DT, b) RF, c) SVM, d) KNN, e) GBR, f) MLP.

Table 1

Empirical DOC removal model estimation results

Factor	Edwards	Edwards-Calib	Proposed model
k1	−0.075	−0.126	−0.166
k2	0.560	0.730	0.643
x0 - const	-	-	31.962
x1 - pH	284.000	1933.403	−4.086
x2 - pH²	−74.200	−509.637	-
x3 - pH³	4.910	33.875	-
x4 - temp	-	-	0.170
x5 - alum dosage	-	-	50.668
b	0.147	0.083	0.094
MAE	176.554	0.086	0.078
MAPE	17.129	5.398	4.903
R²	−0.270	0.839	0.873

Table 2

Empirical THM formation model estimation results

Factor	THM		CHCl₃		CHCl₂Br
Factor	Original	Proposed	Original	Proposed	Original	Proposed
A	17.70	10.66	101.0	11.71	7.570	2.895
a – DOC·UVA	0.475	0.402	0.615	0.396	0.443	0.319
b – Cl₂	0.173	0.282	0.699	0.344	0.563	0.268
c – Br⁻	0.246	0.047	−0.468	−0.183	0.074	0.195
d – pH	1.316	1.571	1.099	1.377	1.355	1.676
e – temp	1.036	1.029	1.035	1.023	1.030	1.025
f – time	0.366	0.296	0.336	0.421	0.281	0.340
MAE	31.36	1.425	18.68	1.001	4.021	0.820
MAPE	311.2	16.25	247.6	14.585	87.74	19.833
R²	−109.5	0.692	−108.7	0.628	−7.506	0.596