Effluent Quality Prediction of Papermaking Wastewater Treatment Processes Using Stacking Ensemble Learning

Advanced process modeling methods have been used for prediction and monitoring of key quality indices in wastewater treatment processes. However, single conventional models usually have limited precision accuracy when predicting the effluent indices in papermaking wastewater treatment processes. To achieve a better prediction accuracy and robustness, we propose a stacking ensemble learning (SEL) method which utilizes the advantages of the internal base-learning models. The method combines base-learning algorithms including partial least squares, support vector regression, and artificial neural networks with a meta-learning algorithm, which is a multiple-response linear regression in this work. To evaluate the model performance in practical applications, both real wastewater data and simulation wastewater data are used for modeling. The predicted effluent indices include effluent suspended solid (SS<inline-formula> <tex-math notation="LaTeX">$_{\mathrm {eff}}$ </tex-math></inline-formula>), effluent chemical oxygen demand (COD<inline-formula> <tex-math notation="LaTeX">$_{\mathrm {eff}}$ </tex-math></inline-formula>), effluent ammonia concentration (<inline-formula> <tex-math notation="LaTeX">$\text{S}_{\mathrm {NHeff}}$ </tex-math></inline-formula>), and effluent nitrate concentration (<inline-formula> <tex-math notation="LaTeX">$\text{S}_{\mathrm {NOeff}}$ </tex-math></inline-formula>). Compared with base-learning algorithms and other ensemble learning methods, the results demonstrate that SEL significantly improves the prediction accuracy and reduces the prediction errors, which provides a new way to achieve real-time monitoring of wastewater treatment processes.


I. INTRODUCTION
The key to improving the quality management efficiency in wastewater treatment processes (WWTPs) heavily relies on the implementation of effective real-time monitoring of the effluent concentrations [1]. In recent years, hardware sensors in WWTPs have been exposed to a series of shortcomings in the monitoring process, such as significant time lags and high maintenance costs [2]. On the contrary, soft sensing methods can save measurement cost and improve monitoring quality in WWTPs, which is not only economical and reliable but also have a dynamic response [3]- [5]. For example, soft sensors can make real-time predictions for key WWTP variables including the concentrations of suspended solids (SS), The associate editor coordinating the review of this manuscript and approving it for publication was Liangxiu Han . chemical oxygen demand (COD), total phosphorus (TP), total nitrogen (TN), and daily sewage sludge. For some variables with periodic features, variables information can be divided into the periodic component and the residual component based on periodic analysis. Soft sensors combining with periodic analysis can achieve a better prediction performance [6]. Nowadays, increasing attention has been paid in advanced monitoring techniques in WWTPs [7], [8].
In recent years, the main conventional methods including partial least squares (PLS), support vector regression (SVR), and artificial neural networks (ANN) have been used for the prediction of wastewater effluent indices. However, the complex characteristics in wastewater treatment processes, such as nonlinearity, time-varying characteristic, and uncertainty make it difficult to obtain a satisfactory prediction result using these conventional methods. For example, the linear method such as PLS usually shows bad modeling performance when nonlinear characteristic exists in the WWTP data [9]. Embedding kernel functions into PLS has been considered as an effective way for improving the prediction performance, in which the original data is transformed into a high-dimensional feature space by nonlinear mapping [10]. Compared with PLS, ANN has better nonlinear fitting performance and adaptive learning ability, which allows ANN to be successfully applied in WWTPs [11], [12]. However, there always exist low efficiency and local minimum problems in the ANN modeling process. To improve its modeling performance, the original ANN model needs to combine with other optimization methods such as fuzzy subtractive clustering and optimize fuzzy rule [13]. Although SVR has proven to work well under limited data sets, the computational cost is relatively large for large-scale data sets [14]. Aiming at curbing this limitation of the conventional SVR, the LSSVR algorithm has been proposed and it can provide a more effective solution by transforming the optimization problem into a set of linear equations problem [15]. Reducing the computing complexity, the improved LSSVR model can be successfully applied for predicting the wastewater effluent indices [16].
However, conventional models inevitably have some limitations. Without further optimization, none of the original models has the capability to interpret the complex characteristics of wastewater treatment processes. Moreover, there always exists a contradiction between model complexity and its generalization ability for limited samples. It is difficult and usually impossible for an over-optimized model to reach high prediction performance for all the data sets. Fortunately, it has been confirmed that ensemble learning methods could improve prediction accuracy without making the model too complicated [17]. Rather than transforming a single model and hoping the modified model to display its full potential, ensemble learning methods combine different types of models' advantages to achieve a better prediction performance. By considering various viewpoints of training data and multiple training principles, ensemble learning methods can be of great benefit for excavating the potential information between WWTP variables so the model's generalization ability is greatly increased. Ensemble learning methods have been an important direction of process modeling in future.
All of the conventional models are also called base-learners in ensemble learning methods which improve prediction ability by diversifying its base-learners [18]. At present, ensemble learning methods are generally divided into three types: Bagging, Boosting and stacking ensemble learning (SEL) [19]- [21]. For Bagging and Boosting, the emphasis is mainly placed on the data resampling technique [22], [23]. Thus, the diversity between all of the base-learners is focused on the multiformity of the training samples. Unlike Bagging and Boosting, SEL pays more attention to the diversity of training principles. More specifically, SEL integrates several distinct base-learning algorithms through a meta-learning algorithm, which aims to improve the prediction accuracy and generalization capability. From the viewpoint of diversity, SEL has a better prospect [24].
If the well-trained base-learning algorithms with higher prediction accuracy are prerequisites to SEL, the metalearning algorithm determines the quality of SEL to a degree. Multi-response linear regression (MLR) has been confirmed as the most suitable meta-learning algorithm in SEL. Different from the voting or average methods in Bagging and Boosting, SEL uses MLR to further generalize the output values of the base-learning algorithms. Previous research has shown that the main superiority of MLR depends on its powerful function for reducing variance and bias of different base-learning algorithms [25].
In recent years, SEL has been successfully applied to the industrial field as a real-time prediction method. Divina developed an approach for short-term electricity consumption forecasting based on SEL. Compared with other conventional methods, the proposed method realized an efficient and promising way for solving the forecasting accuracy problem [26]. Khairalla proposed a modified SEL method to predict the average growth rate of total oil demand, which was superior to other benchmark methods in the aspects of error rate and directional accuracy [27]. Sun successfully applied SEL to the river ice forecasting field and obtained a better prediction result with higher accuracy [28]. In this work, a novel SEL algorithm is proposed to predict the wastewater effluent indices. This article is organized in the following manner. In Section 2, the training and testing processes are illustrated in more details, then the modeling principles of base-learning algorithms and meta-learning algorithm are briefly introduced. In Section 3, data processing and parameter optimization are illustrated first, and then other ensemble learning methods are introduced for comparison. To evaluate the prediction performance of SEL, both real wastewater data and simulation wastewater data are used for modeling. Finally, the conclusions are given in Section 4.

II. METHODS
SEL can improve estimation ability by combining the advantages of several different algorithms. In this work, SEL can be divided into two parts. The first part contains base-learning algorithms and the second part is the meta-learning algorithm.
The base-learning algorithms should be efficient, diversiform and simple. As a prerequisite for building an ensemble model, strong learning ability of base-learning algorithm is helpful to improve the predicted performance of SEL. In terms of training principle, the diversity between each base learner should be as large as possible, which enables SEL to interpret data characteristics from multiple perspectives. Moreover, lower computational complexity will be beneficial to further improvement and optimization. Based on the above criteria, PLS, SVR, and ANN are chosen as the candidate base-learning algorithm. Compared with the simple average and voting strategy, MLR has access to a further generalization result, which is also the most commonly used metalearning algorithm at present. VOLUME 8, 2020

A. TRAINING AND TESTING PROCESS
For the training process, three base-learning algorithms including PLS, SVR, and ANN are defined as ζ 1 , ζ 2 , and ζ 3 , respectively. As the meta-learning algorithm, MLR is defined as ξ . In the training process, five-fold cross validation approach is adopted.
Original training set D= {(x 1 , y 1 ), (x 2 , y 2 ), · · · , (x m , y m )} is randomly divided into five data sets with the same size as D 1 , D 2 , D 3 , D 4 , D 5 and x i represents sample feature, y i represents the target value. Among which, four data sets are used for training base-learners, and the remaining data set is used to verify the predicted performance, then optimizing the specific parameters of base-learner h (j) i from the i-th learning algorithm ζ i . The above-mentioned process needs to be carried out five times. Then, the well-trained base-learner h (j) i is used to predict the samples which not participate in modeling, and the prediction result can be expressed as h The secondary training set produced by the three base-learning algorithms can be expressed as 3 (x i )), and y i is still the target value in the original training set D. The D is used for training the meta-learner h = ξ (D ) by the meta-learning algorithm ξ .
For the testing process, five base-learners h (j) i get corresponding prediction results for original test data set D test , which can be described as h i (x t ), and h (5) i (x t ). By averaging all of the prediction results, a prediction vector is obtained as follows: Three base-learning algorithms produce three prediction vectors, which can be described as Because y i in D test has not changed, the secondary test set can be expressed as By applying the D into the meta-learner h , the final prediction result H (x) = h (D ) is obtained. The implementation process of SEL and the formation process of the secondary data set is shown in Figures 1 and 2, respectively.

B. BASE-LEARNING ALGORITHMS 1) PARTIAL LEAST SQUARES
Partial least squares algorithm has been widely used in regression, mainly by finding the reasonable latent variables. Assume that input matrix is X = [x 1 , x 2 , · · · , x q ] n×q and output matrix is Y = [y 1 , y 2 , · · ·, y p ] n×p . Among them, n is the number of the samples, q and p represents the number of input variables and output variables, respectively. To better exploit variance structures of process, X and Y are projected into a lower dimensional space as follows:  where P and Q are loading matrices, T and U are latent matrices which carry enough variation information in X and Y, E and F represent residual matrices.

2) SUPPORT VECTOR REGRESSION
Support vector regression algorithm is the modified version of support vector machine (SVM) which can be used to solve linear and non-linear regression tasks. It performs better for small-scale data set by using the structural risk minimization principle. The key to SVR lies in finding suitable mapping function ϕ(x) between input vectors and output vectors. With the help of mapping function ϕ(x), the training data T = {(x 1 , y 1 ), (x 2 , y 2 ), · · ·, (x i , y i ), · · · , (x n , y n )} is mapped into a high dimensional feature space. Then, an optimized regression function can be constructed in the new feature space as follows: where w is the weight vector and b represents the bias term. By introducing insensitive loss function ε, the errors within a specified tolerance range can be neglected. Then, the slack variables ξ i and ξ * i are added into Equation (3). The regression task can be transformed into an optimization problem as follows: By adding Lagrangian coefficients β i and β * i , the weight vector w can be expressed as follows: Finally, the SVR can be defined as the following regression function: where K (x, x ) corresponds to the kernel function. The radial basis function (RBF) was used in this work, which can be expressed as follows: where γ is kernel parameter which is used to control the radial range of the kernel function.

3) ARTIFICIAL NEURAL NETWORKS
Artificial neural networks algorithm has a powerful self-learning and self-adaptive ability to reduce prediction error. Figure 3 shows the classic topology structure of artificial neural networks with three layers including input layer, hidden layer, and output layer. Firstly, it is assumed that the number of input layer nodes, hidden layer nodes, and output layer nodes is N , M , and Q, respectively. Moreover, x i represents the i-th input value in the input layer, w ij represents the weight value from the input layer to the hidden layer, w jk represents the weight value from the hidden layer to the output layer and θ j corresponds to the threshold value. The input information is firstly propagated forward from the input layer to the hidden layer then the error is propagated backward according to weight values and transfer function f (x). Through repeated correction of weight values, the predicted values are gradually closer to the actual values. During the whole process, x j represents the output value of the j-th neuron in the hidden layer which can be defined as follows: The output value of the k-th neuron in the output layer is written as y k , which can be defined as follows: The network weights and the thresholds need to be adjusted depending on the minimum mean square error (MSE) which is defined as follows: where l is the number of training samples, t n corresponds to the expected value of the neural node and y n denotes the predicted value. The MSE will be reduced to certain extent by repeating the back-propagation mechanism. Another stopping criterion of the algorithm is that the training time reaches its maximum.

C. META-LEARNING ALGORITHM
As the most efficient meta-learning algorithm, MLR combines weight values with output values coming from base-learning algorithms to obtain a further generalization result. The specific form can be expressed as follows: where y denotes the final prediction result, w k corresponds to the weight value, and x k corresponds to the output value from the base-learning algorithm. All of the weight values are obtained in the training process. By choosing appropriate weight values, MLR makes the square sum of the difference between the predicted values and the real values as small as possible. The formula is shown as follows: where n is the number of samples, k is the number of base-learning algorithms, y (i) represents the real value of the i-th sample, and k j=1 w j x (i) j represents the predicted value of the i-th sample.

D. MODELING PERFORMANCE INDICES
To determine whether the final results have a better prediction accuracy, three evaluation indices including determinate coefficient (R 2 ), mean absolute percentage error (MAPE), and root mean square error (RMSE) are used in this work, which is calculated from Equations (13), (14), and (15), respectively.
where y i is measured value,ŷ i is predicted value, andȳ i is the mean value of y i .

A. DESCRIPTION OF TWO DATA SETS
The modeling data used in this work include real wastewater data and simulation wastewater data. The actual wastewater data were collected from a papermaking WWTP in China. The data collection system includes various probes, signal acquisition card and power relay output board. As shown in Figure 4, the data include the following variables: wastewater flow rate (Q), influent suspended solid (SS in ), effluent suspended solid (SS eff ), pH, temperature (T ), influent chemical oxygen demand (COD in ), effluent chemical oxygen demand (COD eff ), and dissolved oxygen (DO). The statistics are listed in Table 1. Among the eight variables, SS eff and COD eff are response variables (target values) which need to be controlled. Both COD eff and SS eff variables met the Chinese national effluent release standards. The rest of variables (sample features) are explanatory variables which are closely related to the response variables. The specific correlation coefficients between response variables and explanatory variables are presented in Table 2. These variables determine the degradation efficiency of pollutants by affecting the microbial activity of active sludge then indirectly influence the trend of SS eff and COD eff . The simulation wastewater data were generated from benchmark simulation model no. 1 (BSM1) which is designed to simulate a wastewater treatment system. As shown in Figure 5, BSM1 consists of five biological reactors and one secondary sedimentation tank. BSM1 can provide diverse control strategies under three weather conditions including dry weather, rainy weather and storm weather. In this article, the dry weather data were used for simulation. The sampling period is 14 days and the sampling interval is 15 minutes. Finally, 1345 samples were generated and each of the samples includes ten variables. Table 3 displays the specific process variables among which effluent ammonia concentration (S NHeff ) and effluent nitrate concentration (S NOeff ) are response variables and the rest are explanatory variables.
At the beginning of the modeling process, Jolliffe's three parameters method was used to detect the outliers [29]. Then, the processed data were normalized to ensure the values of different features have the same dimension. The corresponding transformation formula is as follows: where X corresponds to the original sample vector, µ represents the mean value of the original sample vector, and σ is the standard deviation of the original sample data. Then, all the data were divided into training data and test data. For the real wastewater data, the first 120 samples were used as training data and the rest 50 samples were used as test data. For the simulation wastewater data, the first 672 samples were used for training, and the remaining 673 samples were used for testing.

B. PARAMETER OPTIMIZATION
Although PLS has been successfully applied in many industrial processes, the calculation process for obtaining latent variables is still lack of a uniform standard. In this work, with the help of the index of variable importance in the projection scores [30], the scores of latent variables higher than 1 are considered as valuable latent variables for modeling.  The Kernel function is of key importance in SVR. Among all of the kernel functions, RBF is the most commonly used one. Its main advantage is that even if prior knowledge of the data is absent, the prediction effect is still robust. Therefore, RBF was adopted in this work. The kernel parameter γ directly affects the complexity of data distribution in the higher dimensional space, and hence needs to be confirmed first. If γ is too small, it will perform like linear kernel function. On the contrary, it will perform like polynomial kernel function. Another important parameter of SVR is the regularization parameter C also known as penalty factor, which is used to achieve a compromise between empirical risk and confidence level. If C value is too high, the SVR tends to result in over-fitting phenomenon. On the contrary, smaller C value is often accompanied with under-fitting problems. The grid searching method [31] was adopted to search for a suitable combination of C and γ in this work.
The key to constructing a well-performed ANN lies in finding a suitable network structure and activation function. Considering the data size is not large, the three-layer network structure was chosen in this work. The number of hidden layer nodes, the activation function of hidden layer, and output layer were selected according to the actual prediction performance. The specific parameters of the base-learning algorithms are shown in Table 4.

C. ENSEMBLE LEARNING MODELS COMPARISON
To compare SEL with other types of ensemble learning methods in terms of prediction accuracy, random forest (RF) model based on bagging algorithm and adaptive boosting (AdaBoost) model based on boosting algorithm were constructed by using the same wastewater data. The base-learners of RF and AdaBoost are both decision trees. Considering that prediction of wastewater effluent indices is essentially a regression task, the CART (classification and regression tree) is selected as the decision tree which is not pruned during the modeling process. The main difference between RF and AdaBoost is that the former integrates decision trees in a parallel manner, while the latter integrates decision trees in a serial manner. In terms of RF, the main tuning parameters include the number of decision trees n tree and random variables used at each split m try . In terms of AdaBoost, the main tuning parameters include the number of iterations n T and learning rate of base-learners ν. All of above-mentioned tuning parameters were obtained by grid searching method.

D. RESULTS AND DISCUSSION
To directly observe the prediction performance of SEL, we provide prediction figures for real wastewater data and simulation wastewater data. As shown in Figures 6 and 7, the data points of the two cases are both well modeled VOLUME 8, 2020   Tables 5 and 6. It can be seen that SEL achieves the highest prediction accuracy for two cases. For COD eff , compared with the base-learning algorithms of PLS, SVR, and ANN, the RMSE of SEL is reduced by   13.97%, 6.18%, and 14.49%, respectively. For SS eff , the RMSE of SEL is reduced by 14.46%, 6.58%, and 10.13%, respectively. In terms of R 2 , the prediction accuracy of SEL is also improved significantly range from 5.88%-26.32%, 6.25%-21.43% for COD eff and SS eff . Meanwhile, SEL also demonstrated its superiority of ensemble learning compared with RF and AdaBoost, specifically with the minimum RMSE value (4.25) and the maximum R 2 (0.72) for COD eff , the minimum RMSE value (0.71) and the maximum R 2 (0.68) for SS eff .
For S NHeff , compared with the base-learning algorithms of PLS, SVR, and ANN, the RMSE of SEL is reduced by 63.53%, 49.18%, 48.33%, respectively. For S NOeff , the RMSE of SEL is reduced by 57.14%, 42.31%, 47.37%, respectively. In terms of R 2 , the prediction accuracy of SEL is also improved significantly range from 4.76%-15.79%, 4.88%-10.26% for S NHeff and S NOeff . Compared with RF and AdaBoost, SEL has the best prediction results, specifically with the minimum RMSE value (0.31) and the maximum R 2 (0.88) for S NHeff , the minimum RMSE value (0.30) and the maximum R 2 (0.86) for S NOeff .
Considering the fact that base-learning algorithms have their algorithm learning preference, their predictive capability may be limited for the papermaking wastewater effluent indices. Depending on different circumstances, SEL firstly uses the original training set to train base-learning algorithms, then uses the secondary training set generated by base-learning algorithms to train the meta-learning algorithm. In other words, the output values of base-learning algorithms are the input features of meta-learning algorithm. Based on the prediction results of the base-learning algorithms, SEL achieves a better generalization for the effluent indices. In addition, compared with other ensemble learning methods, SEL has higher prediction accuracy. From the viewpoint of theoretical analysis, the superior prediction performance of SEL can be mainly attributed to the following points: (1) Through exploring the feature space from different perspectives, SEL can provide a more comprehensive analysis for complex characteristics in the wastewater data; (2) The SEL makes use of the advantages of different baselearning algorithms while gets rid of their relatively worse prediction drawback, so as to reduce the risk of trapping into local minima; (3) Compared with base-learning algorithms, SEL expends the hypothesis space in modeling process, which may be closer to the real hypothesis of the wastewater treatment process.
(4) From the viewpoint of the diversity and integration of base-learning algorithms, SEL can combine different types of base-learners corresponding to various base-learning algorithms compared with other ensemble learning methods, and constructing a meta-learning algorithm is a more reasonable way than adopting statistically averaging, which makes SEL a better capacity of generalization.
Although SEL has tremendous potential for improving the prediction accuracy in wastewater effluent data, the step of parameter optimization will be a time-consuming work. In this work, there are three groups of parameters need to be determined, which results in an increasing running time of SEL as shown in Table 7. The running time of SEL mainly spends on the ANN base-learning algorithm. Because ANN uses the back-propagation mechanism for optimizing the hyper-parameters, the network weights and the thresholds in each neuron need to be revised several times to finally reach the precision requirement, which immediately causes the increment of the total running time. In general, ensemble learning methods take more time than base-learning algorithms for training and prediction. Among ensemble learning methods, the running time of RF and AdaBoost is relatively shorter compared with SEL. This is mainly because the base-learners of RF and AdaBoost are relatively simple which makes RF and AdaBoost have low computational complexity.
With the ever-increasing amounts of data in WWTPs, the complexity and running time will directly affect its practical application and further development. In future work, besides optimizing the execution efficiency of SEL, more combinations of base-learning algorithms will be studied. Various machine learning algorithms, such as Gaussian process regression, relevance vector machine, gene expression programming, and evolutionary polynomial regression, can be integrated with the existing SEL. Meanwhile, the data set should be expended to further validate the prediction accuracy of SEL.

IV. CONCLUSION
In this work, the stacking ensemble learning is used for modeling the papermaking wastewater treatment process. By predicting the COD eff and SS eff from real wastewater data as well as S NHeff and S NOeff from wastewater simulation data, the proposed SEL method has successfully interpreted the complex characteristics of wastewater data. The prediction hypothesis space of SEL is more comprehensive for the wastewater treatment process. During the prediction process, the meta-learning algorithm enables SEL to obtain a further generalization result, which makes use of the advantages of each base-learning algorithm while avoiding the risk of trapping into local minima. The simulation results show that SEL has a better prediction ability compared with base-learning algorithms including PLS, SVR, ANN and other ensemble learning methods including RF and AdaBoost. Therefore, applying SEL into the real-time monitoring of wastewater treatment processes has practical reference value and realistic signification. Future work will be focused on the improvement in execution efficiency. Besides, more base-learning algorithms should be tried for stacking ensemble learning and sample size will be further expanded to verify the applicability of the proposed method.

(Hongbin Liu and Chen Xin contributed equally to this work.)
FENGSHAN ZHANG received the Ph.D. degree from Nanjing Forestry University, Nanjing, China, in 2010. He is currently the Chief Engineer with Shandong Huatai Paper Company Ltd., Dongying, China. His research interests include process modeling and engineering.
MINGZHI HUANG received the Ph.D. degree from the South China University of Technology, Guangzhou, China, in 2010. He has been a Professor with South China Normal University, China, since 2017. He is the author of more than 100 journal articles. His research interests include process modeling, process monitoring, fault detection and diagnosis, and wastewater treatment.