A Feasible Data-Driven Mining System to Optimize Wastewater Treatment Process Design and Operation

: Achieving low costs and high efﬁciency in wastewater treatment plants (WWTPs) is a common challenge in developing countries, although many optimizing tools on process design and operation have been well established. A data-driven optimal strategy without the prerequisite of expensive instruments and skilled engineers is thus attractive in practice. In this study, a data mining system was implemented to optimize the process design and operation in WWTPs in China, following an integral procedure including data collection and cleaning, data warehouse, data mining, and web user interface. A data warehouse was demonstrated and analyzed using one-year process data in 30 WWTPs in China. Six sludge removal loading rates on water quality indices, such as chemical oxygen demand (COD), total nitrogen (TN), and total phosphorous (TP), were calculated as derived parameters and organized into fact sheets. A searching algorithm was programmed to ﬁnd out the ﬁve records most similar to the target scenario. A web interface was developed for users to input scenarios, view outputs, and update the database. Two case WWTPs were investigated to verify the data mining system. The results indicated that efﬂuent quality of Case-1 WWTP was improved to meet the discharging criteria through optimal operations, and the process design of Case-2 WWTP could be reﬁned in a feedback loop. A discussion on the gaps, potential, and challenges of data mining in practice was provided. The data mining system in this study is a good candidate for engineers to understand and control their processes in WWTPs.


Introduction
The wastewater collection and treatment in China have been dramatically developed since 2005, resulting in 54 billion tons of wastewater per year treated by over 3500 municipal wastewater treatment plants (WWTPs) in 2013 [1,2]. As the consequence of the increasing construction of WWTPs, a number of experienced engineers are required to operate the facilities effectively and safely. Meanwhile, the discharging criteria of WWTPs are tightened for stepwise objectives to protect the surface water quality in China [1][2][3]. Thus, process upgrading was stimulated to reserve more redundancy via longer retention time and larger volume of tanks than the necessity. The complicated process and unsatisfied operation triggered the significant increase of energy use in wastewater treatment facilities. Although the average energy intensity of the WWTPs in 2009 was as low as 0.254 kWh·m −3 [2], two-thirds of the WWTPs were in capacities of 1 × 10 4 −5 × 10 4 m 3 ·d −1 with electricity consumption in the range of 0.06−1.42 kWh·m −3 [1]. Moreover, the energy based on chemical oxygen demand (COD) removal  [4]. Thus, to improve the process design and operation in WWTPs is still a serious challenge in China. To deal with above challenges with limited human and capital resources in the low-income countries, a good solution is to transfer matured techniques from the scientific community to engineering practice, for example, the techniques of instrumentation, control, and automation (ICA), which had been well-established in the last century [5,6]. However, the attempts to apply ICA in WWTPs in China had encountered many challenges, for example, high investment but low maintenance of instruments, lack of skilled operators with ICA knowledge, conflicts between designers and operators, and so on. Currently, most of the central control systems in WWTPs in China are only capable of acquiring data rather than taking actions in the process [3]. Without necessary experience and sufficient technical supports, operators in WWTPs may often lose their trust on ICA system when failure responses appeared, for example, fault signals of instruments without regular maintenance. This critical situation requests more feasible methods to restore the engineers' trust by involving them in the technical development, rather than offering them a sound solution to plug and play.
Data mining, usually as the core component of a data-driven decision support system, follows the concept of knowledge discovery in databases and has gained increasing interest in optimizing WWTP processes [7]. Data mining contains a series of technologies to extract useful information from a large database or data warehouse to figure out hidden correlations and possible rules [7]. The data mining procedure is highly affiliated with engineers' operations and habits; thus, both scientists and engineers could share their interests on such a kind of optimizing tool. Data-driven models can support and deepen our understandings of process operations via various mathematical and empirical tools, including artificial intelligence, machine learning, induction learning, and statistical analysis [8,9]. Even using a very simple type of data mining, such as statistical analysis, could be effective to solve painful problems in practice [10].
Data mining in WWTPs has been intensively explored in the past, including pumping station control [11][12][13][14], aeration control [15][16][17], anaerobic digester operation [18][19][20], plant-wide process operation [20][21][22], and effluent water quality prediction [23][24][25]. The data-driven model is free of mechanism and lack of general descriptions of the process dynamics; thus, integrating data mining techniques with mathematic models and the expert systems has become a reasonable choice [26][27][28]. However, although scientific research provided sufficient powerful tools, process operators and managers hesitated to use the tools in their daily operations. As pointed out by a critic review of knowledge-based systems in WWTPs, currently matured and productive data-driven techniques are still limited in simple and classic models [28]. Thus, developing a simple and effective data mining system to satisfy the demands from process engineers is quite necessary to apply such a state-of-art scientific tool to deal with more critical and challenging process operations in WWTPs than ever.
In this study, a data mining system was implemented to optimize the process design and operation of WWTPs in China. In order to develop a feasible data mining system under current constraints, the following two features of the system were focused on: a simple and effective algorithm that is capable of running on the database from small scales to big data platforms; and a friendly web user interface (UI) for users to operate, maintain, and update the data warehouse online. An integral procedure of data mining was proposed for implementation, including data collection and cleaning, data warehouse, data mining, and web UI. A data warehouse was demonstrated on the one-year process data in 30 WWTPs. The performance of data mining was verified in two case WWTPs, in which the effluent water quality and process design were improved. Lastly, the gaps, potentials, and challenges of data mining system in practice were briefly discussed.  Figure 1 shows the conceptual scheme of a data mining system in WWTP. The system consists of data collection, cleaning, warehouse, and mining. First, the process information and raw data were collected from the WWTPs. Data cleaning rules were applied to identify the data missing events, typos, and other mistakes for data quality control. Second, after cleaning, the data were organized into a process database that consisted of the process information, state variables, water quality, and operational parameters. Six derived parameters as sludge removal loading rates were calculated from original variables (Sections 2.2 and 3.1). Third, a series of fact sheets were generated from the process data and compiled into a data warehouse for data mining. Five classification rules from expertise were used as constraints to accelerate the searching algorithm (Sections 2.3 and 3.2). Fourth, a searching algorithm was programmed to find out the best fits of the input scenario in the data warehouse. Usually, five records in the data warehouse with the highest matching rates were selected as the suggestions (Section 2.4). Finally, a friendly web UI was developed to assist the users to input target scenario, execute data mining, and interpret the system outputs (Section 2.5). The system outputs can be used for process design and operation, as explained in case studies in Sections 3.3 and 3.4.

Protocol of Data Mining System
Water 2018, 10, x FOR PEER REVIEW  3 of 14 collected from the WWTPs. Data cleaning rules were applied to identify the data missing events, typos, and other mistakes for data quality control. Second, after cleaning, the data were organized into a process database that consisted of the process information, state variables, water quality, and operational parameters. Six derived parameters as sludge removal loading rates were calculated from original variables (Sections 2.2 and 3.1). Third, a series of fact sheets were generated from the process data and compiled into a data warehouse for data mining. Five classification rules from expertise were used as constraints to accelerate the searching algorithm (Sections 2.3 and 3.2). Fourth, a searching algorithm was programmed to find out the best fits of the input scenario in the data warehouse. Usually, five records in the data warehouse with the highest matching rates were selected as the suggestions (Section 2.4). Finally, a friendly web UI was developed to assist the users to input target scenario, execute data mining, and interpret the system outputs (Section 2.5). The system outputs can be used for process design and operation, as explained in case studies in Sections 3.3 and 3.4.

Data Collection and Cleaning
The raw data were collected from the online database available in WWTPs, as well as the laboratory analysis in field surveys in necessity. The scope of the data collection covered the process information, operational parameters, state variables, influent and effluent water quality, and so on. For example, the process data may consist of hydraulic retention time (HRT), dissolved oxygen (DO), mixed liquor suspended solids (MLSS), flow rates of return sludge (RS) and mixed liquor (IR), chemical oxygen demand (COD), suspended solids (SS), and so on. Because of the different intervals of online data and laboratory analysis, daily average values were used in this study. An example of the raw data in one month is shown in Table S1.
Data cleaning mainly deals with the challenge of missing data. For example, the online data recording might be interrupted, some parameters might be absent in daily laboratory analysis, and outliers often occurred as a result of the process faults or instrumental calibration. There were many efforts against data loss by improving analytical tools of process parameters, for example, the various methods to monitor biological oxygen demand in five days (BOD5) [29]. However, very few of the conceptual sensors have been applied in practice. As one of the alternatives, influent BOD5 could be accurately predicted from regularly analyzed parameters using soft sensor algorithms like an artificial neural network (ANN) and multiple linear regressions (MLR) [23]. For simplicity, an equation like BOD5 = a•COD + b could be used to interpolate the missing data [30]. Moreover, a statistical method was helpful to remove the outliers by three times the standard deviation from the average values.

Data Collection and Cleaning
The raw data were collected from the online database available in WWTPs, as well as the laboratory analysis in field surveys in necessity. The scope of the data collection covered the process information, operational parameters, state variables, influent and effluent water quality, and so on. For example, the process data may consist of hydraulic retention time (HRT), dissolved oxygen (DO), mixed liquor suspended solids (MLSS), flow rates of return sludge (RS) and mixed liquor (IR), chemical oxygen demand (COD), suspended solids (SS), and so on. Because of the different intervals of online data and laboratory analysis, daily average values were used in this study. An example of the raw data in one month is shown in Table S1.
Data cleaning mainly deals with the challenge of missing data. For example, the online data recording might be interrupted, some parameters might be absent in daily laboratory analysis, and outliers often occurred as a result of the process faults or instrumental calibration. There were many efforts against data loss by improving analytical tools of process parameters, for example, the various methods to monitor biological oxygen demand in five days (BOD 5 ) [29]. However, very few of the conceptual sensors have been applied in practice. As one of the alternatives, influent BOD 5 could be accurately predicted from regularly analyzed parameters using soft sensor algorithms like an artificial neural network (ANN) and multiple linear regressions (MLR) [23]. For simplicity, an equation like BOD 5 = a·COD + b could be used to interpolate the missing data [30]. Moreover, a statistical method was helpful to remove the outliers by three times the standard deviation from the average values.

Data Warehouse Construction
The data warehouse is organized in a database with fact sheets for searching the best matches of the input scenario (Supplementary M1 and Tables S2-S5). Six derived variables (Equation (1)), as sludge removal loading rates (L s ), were calculated from the original variables in the process database (Table S6). The water quality indices for calculation consisted of COD, BOD 5 , SS, ammonia nitrogen (NH 3 -N), total nitrogen (TN), and total phosphorus (TP). Only the derived variables instead of all the original variables in the data warehouse were used for data mining, in order to simplify the complexity of the data searching algorithm.
where L s,i is the sludge loading rates as the ith derived variable; Q is the influent flow rate; V is the volume of bioreactors; HRT is the hydraulic retention time of the process, which equal to V/Q; MLSS is the concentrations of mixed liquor suspended solids in the bioreactor; C inf,i and C eff,i are concentrations of ith derived variable in influent and effluent, respectively. The index i is in range of 1−6, representing COD, BOD 5 , SS, NH 3 -N, TN, and TP, respectively. Classification of WWTPs in the data warehouse was applied to set constraints to narrow the data searching field during data mining. Instead of searching all the records in the data warehouse, only the records of WWTPs in the similar classification with the input scenario were firstly searched for matching the user's input. With valuable suggestions from both scientists and engineers, five classification rules were selected to group the WWTPs in the data warehouse (Table S7) where I, F, and D represent the data organization for input scenario, fact sheets, and data warehouse, respectively. WTP j is the information of the jth WWTP, PR m is the value of the mth original process variable in the date t, C k is the value of the kth classification index, x i is the ith derived variable (as sludge removal loading rate) calculated from the input scenario, and (t, y i ) is the ith derived variable calculated from the original variables in data warehouse at date t. There are 30 WWTPs, 6 derived variables, 5 classification indices, and 24 original parameters in the demonstrated data warehouse. Thus, subscript i, j, k, and m in this study are 6, 30, 5, and 24, respectively.

Data Mining
The simple idea of data mining in this study is to dig out the most similar records in the data warehouse to the user's scenario. Here, the similarity was estimated by the Manhattan distance of the six derived variables as sludge removal loading rates (Table S6). An input scenario from different users generally consisted of process information, influent characteristics, and effluent criteria. The five classification indices and six derived variables could be determined by the process shown in the Supplementary Method M1.
The data mining algorithm was given in Supplementary Method M1 and Figure S1. Briefly, it includes six steps. First, penalty function of the input scenario (Equation (3)) was given according to the five WWTP classification rules, which helps to change the optimization with constraints to a non-constrain minimization question. Second, the Manhattan distance between the input scenario and each record in the data warehouse was calculated through Equation (4). Third, the sum of the Manhattan distance and penalty function was determined as the objective function via Equation (5). Fourth, the five best-hits were sorted up using the quick sorting of the objective function values. Fifth, the derived parameters of the best-hits were normalized into a single matching rate in percentile (Equation (6)), which was ranked again through sorting algorithm to obtain the final output. At last, the five best-hit records were shown in the sequence of the matching rates.
where W, E, OF, and T are the penalty, Manhattan distance, objective function, and matching rate, respectively; min n=5 means searching five records with minimized OF values; x and y represent values of second parameters in the input scenario and data warehouse; i is the index of second parameters and ranged from 1 to 6; j is the index of records in the data warehouse and reaches up to 10,000; g is the penalty constant, which is 1 in case of true classification (e.g., processes are same) and 0 if false; k is the index of classification rules from 1 to 5; n is the index of the best hits from 1 to 5; y i,n is the nth hit in the outputs; w i is the weight for the ith derived variable and is 1 in this study; and T n in percentile is the matching rate of the nth hit to the input scenario.

Web User Interface
A web-based and friendly UI of the data mining system was developed as a website application by a commercial coding company in Beijing of China. The data mining software consists of a server subsystem and a client subsystem, the structure of which is shown in Figure S2. The server subsystem was used for data input, storage, calculation, and data warehouse design, and the client subsystem was designed for the users to input WWTP information, to execute the sorting, and to review the output parameters. The client subsystem was developed in HyperText Markup Language (HTML 5), and the server was programmed using the MongoDB Application Programming Interface (API). A web UI as shown in Figure S3 offered users to input the scenario and review the outputs on the website.

Verification by Case Study
Two WWTPs in Wuxi City, Jiangsu Province of eastern China were selected to verify data mining. Case-1 WWTP has the total capacity of 45,000 m 3 ·d −1 using the anaerobic-anoxic-oxic (AAO) process. The industrial wastewater accounts for 15% of the influent, the variation coefficient of which was designed as 1.3. The process operation was poor because of the overloading of influent and lack of experienced engineers, resulting in frequent violation of the discharging criteria. Thus, the purpose of data mining is to optimize the process operation to improve the effluent water quality.
Case-2 WWTP has a total capacity of 40,000 m 3 ·d −1 and industrial wastewater is about 40% of the process influent. The WWTP was requested to upgrade the conventional activated sludge (CAS) process to AAO process for the sake of enhancing biological nutrient removal to meet the Grade I-A criteria. Thus, data mining in this case is to assist the process designers to determine the key process parameters. The influent water quality and discharge criteria for the two cases are shown in Supplementary Table S8.

Data Collection and Cleaning
In total, 30 WWTPs in a good operational state were selected to build the data warehouse for demonstration (Table S9). The dominant process type was AAO, which was operated in 18 WWTPs. SBR, OD, and MBR processes were used in six, four, and two WWTPs, respectively. In the 30 WWTPs, there were 11 WWTPs that had no primary clarifier, 10 WWTPs that had an industrial influent ratio higher than 40%, and 21 WWTPs that were located in eastern China. The daily process data of the 30 WWTPs in 2014-2015 were collected to prepare a database with original variables.
Missing data were recognized and compensated as described in Section 2.2. For example, BOD 5 was often not recorded on weekends, thus a linear correlation between BOD 5 and COD was used to interpolate the missing data. The ratio of BOD 5 /COD is varied in a small range in WWTPs [1,3], for example, 0.3−0.6 in influent and 0.2−0.5 in effluent of the WWTPs in China [30]. Assuming BOD 5 /COD as constant in a short period such as in a week, the missing BOD 5 can be approximated from the COD value in the same day.
After the data cleaning, the original variables were cross-checked to identify possible mistakes like typos. The first rule is that partial concentration should be lower than the total one, for example, BOD 5 < COD, NH 3 -N < TN, and PO 4 -P < TP. The second rule is that ratios of influent concentrations of combined water quality indices over COD were in a reasonable range. For example, the ratio of BOD 5 /COD, SS/COD, TN/COD, and TP/COD can be used to identify the abnormal inputs and typos.

Data Warehouse and Analysis
The demonstrated data warehouse contained about 10,000 pieces of records in one-year daily data of 30 WWTPs. Each record documented 30 variables including 24 original variables and 6 derived variables (Supplementary Method M1). Figure 2 shows the bi-correlation of the derived variables in the data warehouse. Generally, a weak correlation existed among COD, BOD 5 , and SS. The correlation is related to the COD components, in which biodegradable suspended solid is the major component in influent COD. TN had a weak correlation to NH 3 -N because of the similar reason between COD and BOD 5 . Although the derived variables showed weak and positive bi-correlation, all the plots had a dispersed distribution of points, indicating that the six derived variables could be approximated as independent variables for data mining.
When calculating the Manhattan distances, setting x as zero in Equation (3), we can get the maximum Manhattan distance, which represents the essential operational state of the WWTP process. Thus, the deviation of the maximum Manhattan distances of each WWTP reflects its unique feature. As shown in Figure 3, the absolute value of the maximum Manhattan distance indicates the influent loading level of the process, whereas the box shape implies the dispersion and limits of the loadings. In the Figure, No. 11 WWTP had the largest height of the box, which implied a strong variation of influent loadings. The process applied in No. 11 WWTP was Orbal oxidation ditch with the capacity of 50,000 m 3 ·d −1 . Because of the large influent fluctuation, a surface aeration control system had been developed using an optimal energetic model to compromise the aeration and impelling of the aerators [31]. The smallest height of the box was in No. 5 WWTP, which applied AAO process with almost constant influent at 2000 m 3 ·d −1 . All the plants had some outlier points in the cross label in the figure, indicating possible abnormal operational states during the period, for example, shocking influent flow rate, very high COD and SS influent due to illegal discharge, and so on.
The correlation is related to the COD components, in which biodegradable suspended solid is the major component in influent COD. TN had a weak correlation to NH3-N because of the similar reason between COD and BOD5. Although the derived variables showed weak and positive bi-correlation, all the plots had a dispersed distribution of points, indicating that the six derived variables could be approximated as independent variables for data mining. When calculating the Manhattan distances, setting x as zero in Equation (3), we can get the maximum Manhattan distance, which represents the essential operational state of the WWTP process. Thus, the deviation of the maximum Manhattan distances of each WWTP reflects its unique feature. As shown in Figure 3, the absolute value of the maximum Manhattan distance indicates the influent loading level of the process, whereas the box shape implies the dispersion and limits of the loadings. In the Figure, No. 11 WWTP had the largest height of the box, which implied a strong variation of influent loadings. The process applied in No. 11 WWTP was Orbal oxidation ditch with the capacity of 50,000 m 3 •d −1 . Because of the large influent fluctuation, a surface aeration control system had been developed using an optimal energetic model to compromise the aeration and impelling of the aerators [31]. The smallest height of the box was in No. 5 WWTP, which applied AAO process with almost constant influent at 2000 m 3 •d −1 . All the plants had some outlier points in the cross label in the figure, indicating possible abnormal operational states during the period, for example, shocking influent flow rate, very high COD and SS influent due to illegal discharge, and so on.     Figure 4 shows the operational parameters of Case-1 WWTP. In December 2015, the influent flow rates fluctuated in the range of 55,000-89,000 m 3 ·d −1 , with the average at 65,000 m 3 ·d −1 . The volatile suspended solids in mixed liquor (MLVSS) and MLSS were close to 650 and 2400 mg·L −1 , respectively. The typical DO concentration in the aerobic tank was 2.1 mg·L −1 . The effluent water quality of COD, TN, and NH 3 -N was about 50, 25, and 12 mg·L −1 , respectively (Figure 4a-d), which often violated the Grade I-A discharging criteria (Table S8). including process parameters (ORP, DO, MLSS, RS, and IR), HRT (anaerobic, anoxic, and aerobic), and effluent quality (COD, SS, TN, TP, and NH3-N). The possible suggestions for process operations were identified by comparing the current state and the recommended states. For example, the activated sludge should be increased from 2400 mg•L −1 to over 4000 mg•L −1 , which was possible by reducing the discharge of excessive sludge or by adding coagulants before the second clarifier to enhance the sedimentation. Moreover, the DO concentration should be increased to be higher than 3 mg•L −1 to enhance the nitrification, which could be achieved by supplying more air flow to the bioreactors. The flow rates of RS and IR can also be optimized according to the recommendation. After one month of practicing (Figure 4e−h), the MLSS was improved to over 5000 mg•L −1 and effluent water quality indices were better than the discharging criteria.   A data mining system was used to assist the operators to optimize the process operation. The input parameters included influent water quality, process type, volumes of reactors, MLSS, and discharging criteria. The output data (Table 1) consisted of five best-hit records to the input scenario, including process parameters (ORP, DO, MLSS, RS, and IR), HRT (anaerobic, anoxic, and aerobic), and effluent quality (COD, SS, TN, TP, and NH 3 -N).

Data Mining for Operational Optimization
The possible suggestions for process operations were identified by comparing the current state and the recommended states. For example, the activated sludge should be increased from 2400 mg·L −1 to over 4000 mg·L −1 , which was possible by reducing the discharge of excessive sludge or by adding coagulants before the second clarifier to enhance the sedimentation. Moreover, the DO concentration should be increased to be higher than 3 mg·L −1 to enhance the nitrification, which could be achieved by supplying more air flow to the bioreactors. The flow rates of RS and IR can also be optimized according to the recommendation. After one month of practicing (Figure 4e-h), the MLSS was improved to over 5000 mg·L −1 and effluent water quality indices were better than the discharging criteria. Note: a ORP is oxidative-reduction potential in mV; DO is dissolved oxygen in mg·L −1 ; MLSS is mixed liquor suspended solids concentration in mg·L −1 ; RS is the ratio of return activated sludge over influent flowrate; IR is the ratio of internal return of mixed liquor over influent flow rate. b HRT is the hydraulic retention time of bioreactor in h. Ana, Ano, and Oxic represent the anaerobic, anoxic, and aerobic zones in the bioreactor. c COD, SS, TN, NH 3 -N, and TP are water quality indices for chemical oxygen demand, suspended solids, total nitrogen, ammonia nitrogen and total phosphorous, respectively. d T is the matching rate as shown in Equation (6), 100% means a perfect match. Table 2 shows the output of the data mining system for the process design in Case-2 WWTP. HRT, RS, and IR were almost the same in the five best-hits, implying that the designed process parameters are similar to the configuration of the WWTPs in the data warehouse. Effluent water quality of COD, SS, and TP was also similar in the output, indicating effective removal of such contaminants by the designed process. However, there were still obvious differences between input and hits on operational parameters (ORP, DO, and MLSS) and effluent water quality (TN and NH 3 -N), especially after considering the seasonal effects. Values of ORP, DO, and MLSS in winter were all higher than those in summer, partially because of the excessive aeration and accumulated biomass in the bioreactors. Such a process operation is very popular in practice to enhance the nitrification, the efficiency of which would be inhibited by the low temperature in winter. In general, the optimal strategy in the mind of process operators can be reflected but also hidden in the records of the data warehouse.

Data Mining for Process Design
The suggestions on HRT are also important for process designers because HRT can determine the size of tanks, which is strongly associated with capital investments. The operational parameters in the output suggested flexible and resilient process designs, which were helpful to provide possibility and convenience to operators to control the process. Using the data mining output as a feedback suggestion to the proposed process design, the process designers are able to improve their design works continuously.

Gaps between Scientific Knowledge and Practical Demands
The current studies of data-driven models in WWTPs mainly focused on intelligent algorithms. For example, five optimization algorithms have been compared to predict effluent suspended solids [24], including multi-layered perceptron, k-nearest neighbor, multivariate adaptive regression spline, support vector machine, and the random forest. Similar optimizing algorithms have been applied to control pumps to save energy, for example, data-driven prediction model [11] and neuro-fuzzy inference system algorithm [12]. Process operation and control was also a concern of the researchers. A data mining method has been already developed in anaerobic treatment processes to predict methane production [19], for example, using an adaptive neuro-fuzzy inference system algorithm [32] and neural network with a particle swarm optimization algorithm [18]. The algorithm using self-organizing maps (SOM) and neural gas (NG) was used to optimize plant-wide process control [33]. Although accurate models and various systems have been developed to optimize the WWTP process, most tools have not been applied by the process operators and managers in China. They seemed to be frequently challenged by uncertain effluent water quality and maintenance of instruments, thus ignored learning and applying advanced techniques.
A feasible solution to heal the gaps between scientific knowledge and practical demands is to screen the matured techniques from candidates. A critic review of knowledge-based techniques in WWTPs suggested simple but productive techniques using a Gartner hype cycle graph, including classic intelligent models (artificial neural network and fuzzy logic), simple statistical analysis (regression, principal component analysis, and partial least square fitting), and non-smart tools (mass balance and control chart) [28]. The above techniques have been well-established and applied in practice. For example, an artificial neural network with rule induction has been successfully verified in the TELEMAC project of the European Union, which improved the control efficiency and sensor reliability in WWTPs [19]. Principle component analysis has been used to decompose the multi-dimensional data for process visualization in order to diagnose process faults [19]. Partial least square fitting has already been popularly applied in sensor calibration [34]. Clustering based on rules has been used to integrate artificial intelligence and statistic tools into an explicit knowledge model to supply a holistic operation and description of the process [21]. In general, face to face between scientist and engineer is necessary to diffuse the state-of-art tools, such as data mining, to the designers, operators, and managers in WWTPs [35], which is also the incentive of this study.

Advances in the Data-Driven Models in WWTPs
Opposite to mathematical models, which ignore unimportant physical and anthropic details during the abstract of the physical world, data-driven models describe process dynamics from the process states containing sophisticated effects from diverse sources. Data-driven models have been examined in certain units of WWTPs, for example, data mining coupled with an artificial immuno-neural network to control pumping stations [11,12], and neural network-particle swarm optimization to maximize methane production in anaerobic digester [18]. Soft-sensors have been also developed via data mining to estimate the indirect signals, for example, a BOD soft sensor was developed to avoid absence of aeration in bioreactors [23]. Data mining techniques had the functions of illustrating trends of time-series variables, predicting effluent water quality, and suggesting optimal [25]. An impressive example was using time-series data mining to estimate the leakage of the sewer system, which has been successfully implemented in Brussel City [36].
Although data-driven models could favor process optimization and control, risks and faults might also be introduced to the process from improper records in data warehouse. The data-driven models were free of mechanism and lack of general description of the process dynamics; thus, integration of mechanism models, data-driven techniques, and expertise were in urgent need. The description of processes could be more accurate and robust by using an integral system rather than individual techniques, which has been verified in two full-scale WWTPs in 2002 [26]. Another interesting attempt to deal the challenge was to apply the multi-agent algorithm to extract data from the database to generate real-time decisions autonomously [37]. It was still necessary for an expert system to be online 24 h/7 d to notice the operators and supervise the decision-making process. In summary, the fusion of data-driven models with mathematic models and the expert systems represent a promising direction to explore new solutions to optimize the process in WWTPs.

Potential of Data Mining in the Control of WWTP
Aeration and pumping occupied about 50% and 15% of total electricity consumption in typical municipal WWTPs, respectively [3], demanding optimal control of the two units. The potential of data mining techniques on process control in WWTPs has been revealed by previous studies [28], but most currently available studies remained in the stage of process simulation and theoretical analysis. For pumping station control, optimization algorithms included artificial neural network with mixed-integer nonlinear programming [13], artificial neural network with particle swarm optimization [14], and artificial immuno-neural network with diverse optimization algorithm [11,12]. For aeration process control, optimal airflow could be determined from historical data through data mining [15], for example, using autonomous regression spline model to achieve very low aeration intensity in a WWTP in the Mideast area [16]. Another example was using a model-free reinforcement learning agent to optimize the DO set points in bioreactors [17]. Although the performance of the above systems was inspiring, it was quite difficult to apply the data-driven models in practice because of the sophisticated calculation and background knowledge required. A successful example might be the simple fuzzy logic model with a web UI in a WWTP in German, where operational decisions on pumps were generated daily to save energy up to 18.5% [22].
The simplest model performs the best in practice. A statistical analysis revealed that optimal control of aeration and pumps could achieve the sustainable goal of 20-40% energy saving [38]. In actual fact, the simple and classic algorithms worked well in WWTPs [39], for example, feedforward-feedback aeration control in MBR saved 20% of energy consumption [40] and a surface aeration control in OD achieved energy saving of 10% [31]. Energy saving of up to 25% was reported using a decision support system with daily energy auditing [41]. The system consisted of charts, key performance indices, scenario analysis, fully logic induction, and supports from experts. The simple and effective data mining system in this study is also a meaningful step of marching through the Gartner hype cycle graph of knowledge-based techniques in WWTPs [28].

Challenges and Perspectives
Although the data mining system has good potential in practice, there are still serious challenges with its application. First, the performance of data mining strongly depends on the data volume and data quality. The data warehouse demonstrated in this study only emphasized the removal efficiency, which was the current concern of the operators. Considering the new ambitions of WWTPs such as energy-neutral operation, more valuable variables in high data quality need to be included in the data warehouse to achieve solid analyses and generate multiple suggestions. For example, energy performance indices, process variables (SRT, oxygen uptake rates, etc.), and environmental impacts (biogas production and various emissions) could be selected as new derived variables for searching. The open structure of the database in this study allows extension of new variables in the data warehouse.
Second, the uncertainty of data mining should be concerned and clarified. The outputs of data mining were strongly related to the structure of data warehouse. For example, the dominant scenarios in the data warehouse in this study were AAO process (18 out of 30) and location of eastern China (21 out of 30), which were in the simlar classification to the two case WWTPs. This accordance supported the data mining to cover enough records in the data warehouse. For other processes or locations, data mining might fail because of the limited volume of records. Moreover, the noise in the process data without careful identification, such as operational faults and improper decisions, may introduce disturbances into the operation. Data quality in the data warehouse should be more of a concern in future studies, requiring new methods to quantify and control the quality. Anyway, data quality control and data volume expansion usually had to be compromised in a limited investment in practice.
As the experience of this study, data mining in about 7000 records (one-year data of 18 WWTPs) can generate a reasonable suggestion, although it might be far from an accurate suggestion. With the increasing volume and quality of the data warehouse, the data mining outputs could be more accurate and reliable than the current stage.
Third, data mining needs to be oriented to and collaborative with the process operators. Current data mining techniques are not smart enough to deal with sophisticated scenarios. For example, it is easy to extract the seasonal variation from influent flow by cluster analysis [42], but it is quite difficult for data-driven models to identify process malfunction or wrong operation. WWTPs are difficult to control because of their complicated, dynamic, and large delay features [3], thus subjective judgment from engineers' experience is necessary to find out the solution. Joining the process engineers in the development of a data mining system can build the trust of process engineers on the techniques, making the data mining more acceptable and applicable in practice.

Conclusions
In this study, data mining technology has been applied to optimize the design and operation of wastewater treatment processes. An integral procedure was proposed to implement the system, which includes data collection and cleaning, data compiling and warehouse, data mining, and web UI. Currently, a demonstration data warehouse was established from the process data in 30 WWTPs in China. Sludge removal loading rates of key water quality indices were selected as derived parameters for data mining. The bi-correlations and distribution properties of derived parameters in the data warehouse were illustrated and analyzed. As a decision support tool for operators and designers, the data mining system was verified in two case studies, including one case for optimizing process operation and one case for improving process design. A detailed discussion was provided on the gaps, potentials, and challenges of data mining, as well as other data-driven models, in practice. The data mining system in this study is a good example and also a verified solution for engineers to understand and control their processes in WWTPs.

Supplementary Materials:
The following are available online at http://www.mdpi.com/2073-4441/10/10/1342/ s1, Method M1: Organization of data warehouse and an example of data mining; Method M2: Data mining algorithm; Figure S1 Brief description of the algorithm of data mining; Figure S2 User interface (UI) of the web application to achieve human-machine interface; Figure S3 Typical software interface of the supervisory data mining system for WWTPs; Table S1 An example of collected process raw data from WWTPs; Table S2 Process information of the WWTPs in data warehouse; Table S3 The classification from the process information of the WWTPs; Table S4 The original variables in records of the data warehouse after data cleaning; Table S5 The derived variables in records of the data warehouse for searching;