A Cluster-Stacking-Based Approach to Forecasting Seasonal Chlorophyll-a Concentration in Coastal Waters

Prediction of Chlorophyll-a (Chl-a) concentration is significant for marine ecology and environmental protection. This paper presents an integrated approach to forecast seasonal Chl-a concentration in coastal waters. Before modeling, feature construction procedures, such as simplification, combination, and normalization, are conducted to identify the potentially significant features. The feature extraction method based on Random Forest (RF) and eXtreme Gradient BOOSTing (XGBoost) is applied to select relevant variables. Then, we propose a Cluster-stacking-based approach which includes a station-oriented clustering model and a stacking-based regression model. The former model is used to divide the observation stations into several groups, thus partitions the study region into several sub-regions and the study dataset into several subsets according to the corresponding stations. In each subset, single regression models including K-Nearest Neighbor (KNN), Support Vector Regression (SVR), Multi-Layer Perceptron regression (MLP) and XGBoost are established in level 0 space and integrated by RF in level 1 space via stacked generalization. We compare the performance of the Cluster-stacking model with that of Cluster-KNN, Cluster-SVR, Cluster-MLP, Cluster-XGBoost and the regression stacking model without cluster. The model evaluation shows that the Cluster-stacking-based approach outperforms others in forecasting Chl-a concentration with a coefficient of determination (R2) of 0.848 and a mean absolute error (MAE) of 0.665 ug/l.


I. INTRODUCTION
Since phytoplankton is a fundamental component of oceanic food webs [1], Chl-a concentration which is considered as an indicator of phytoplankton biomass reflects the primary productivity of waters. The changes of Chl-a concentration reflect the dynamic changes of water physical and chemical properties [2]. Therefore, the prediction of Chl-a concentration is of great significance for marine ecology and environmental protection [3], such as preventing the red tides and improving water quality. Hence, an effective method to forecast Chl-a concentration is urgently needed to achieve an early warning.
Currently, researches on Chl-a concentration forecasting are commonly based on data-driven methods. Previous The associate editor coordinating the review of this manuscript and approving it for publication was Jing Bi . studies [2], [4] have shown that Chl-a concentration is related to a variety of different water environmental factors, such as temperature, nutrients, and human activities. The datadriven approach is to study the regularity of Chl-a concentration and other environmental factors from observed data. Zeng et al. [5] applied an artificial neural network model and a multi-varieties regression model with weekly collected data to predict Chl-a concentration in urban lakes. Liu et al. [4] used a multivariate statistical approach to model the influence of 18 chemistry variables on Chl-a in a lake with monthly collected data. Malek et al. [6] accessed four predictive ecological models, including fuzzy logic, recurrent neural network, hybrid evolutionary algorithm, and multiple linear regressors to forecast Chl-a concentration with biweekly collected data in a lake. However, the above studies did not achieve ''real'' predictions. That is, they just explored the relationship between Chl-a concentration and other water quality environmental variables in the same period but did not make future predictions, thus they could not give an early warning.
To forecast days-ahead Chl-a concentration and give an early warning, some studies predicted Chl-a concentration of the next periods based on historical data of related factors. For example, Wang et al. [7] designed a Chl-a predicting model with dynamic neural network models in a simulated indoor environment. The prediction performance of the proposed models was examined from 1-hour ahead to 35-hour ahead. Park et al. [8] gave a 7-day early warning prediction of Chl-a concentration in reservoirs using weekly water quality observation data. Their method is based on artificial neural networks and support vector machines. Besides, many other machine learning approaches such as backpropagation artificial neural network, support vector regression and random forests have also been applied to forecasting Chl-a concentration in lakes and rivers [9], [10].
This paper focuses on seasonal Chl-a concentration in coastal waters which is different from indoor environments, reservoirs, rivers or lakes. Due to the high cost of the data collection, the acquisition of monitoring data is limited [11]. As Huang [12] estimated, long-term and large-scale analysis can describe the developing trend of Chl-a concentration better than small range data. Besides, since Chl-a biomass in ocean surface is regulated by a complex interaction of physiological, oceanographic and ecological factors [13], to get an accurate prediction, we applied as many relevant factors as possible in experimental data. Based on these datasets, we propose an effective hybrid approach to give a one-seasonahead prediction of Chl-a concentration from four-season historical observation data.
Hybrid models have been applied in many industries. Most of these models can be regarded as a combination of different data preprocessing, modeling and optimization techniques [14]. For instance, to forecast the exchange rate, Shen et al. [15] proposed an improved deep belief network (DBN) combined with continuous restricted Boltzmann machines, which updated the classical DBN to fit continuous data. In the field of air pollution prediction, Zhu et al. [16] proposed a hybrid model based on complementary ensemble empirical mode decomposition, particle swarm optimization and gravitational search algorithm, support vector regression, generalized regression neural network and grey correlation analysis, to forecast daily PM 2.5 concentration. Bi et al. [17] presented an integrated forecasting method equipped with noise filtering and data frequency representation to predict the amount of workload in future time slots. Although these methods cannot be directly applied to the prediction of Chl-a concentration, they provided approaches to integrating different models and avoiding the insufficiency of single ones.
The Chl-a concentration forecasting approach presented in this paper, the Cluster-stacking-based approach, combines the advantages of both station-oriented clustering and stacking-based regression to improve forecasting performance. The major contributions of this work are as follows.
(1) It proposes a feature extraction procedure for coastal water quality factors including simplification, combination, normalization, and importance analysis.
(2) It proposes a station-oriented clustering model to group the water quality observation stations. By means of clustering, the comprehensive influences of biological, physical and chemical reactions in waters are reflected to that of geographical differences.
(3) It presents a regression-stacking model to deal with regressors, thus the forecast of Chl-a concentration can avoid the insufficiency of single models and achieve an acceptable performance.
(4) It proposes an integrated approach to forecasting Chl-a concentration in coastal waters that includes the station-oriented clustering and the stacking-based regression. The extensive experimental results show that the Clusterstacking-based approach achieves better performance than single models with or without clustering.
The rest of this paper is organized as follows. Section II describes the study region, data source, and data preprocessing. Particularly, data preprocessing includes simplification, combination, normalization, and feature importance analysis. Section III details the Cluster-stacking-based approach which includes a station-oriented clustering model and a stackingbased regression model. The former model is applied to group the observation stations and the latter one is used to predict Chl-a concentration in each group. Section IV discusses the experimental results. Section V concludes the paper and suggests the directions for future works.

A. STUDY AREA AND DATA SOURCE
The study area of this paper is Weihai sea area (latitude 36 • 30 N to 37 • 45 N and longitude 121 • 15 E to 122 • 50 E), which is located in the eastern part of Shandong Peninsula, China [18]. Weihai coastal region has many industrial enterprises and it is also a significant region in terms of population, agriculture and marketing systems [19]. In this paper, we collected an amount of historical water quality monitoring data, including space-time factors, biological factors, chemical factors, physical factors, meteorological factors, hydrological factors, and some other kinds of water quality factors, from the 1st season of 2015 to the 4th season of 2018.
Chl-a distribution in the Weihai coastal area has prominent regional characteristics [20]. Fig.1 illustrates the general distribution of monitoring stations used in this study. As can be seen from Fig.1, the distribution of Chl-a near shore is higher compared with that far shore, particularly in Rushan. And the Chl-a concentration is similar in a relatively large area of seas, such as north of Huancui, east of Rongcheng, south of Wendeng, and south of Rushan.
Apart from spatial distribution characteristics, Weihai coastal region also has temporal distribution characteristics of seasonal average Chl-a concentration. As shown in Fig.2, Chl-a concentration in winter (the 4th season) is lower than that in spring, summer, and autumn. The possible causes  of this phenomenon are the influences of sea wind, cold water mass and atmospheric pressure in different seasons. The spatial and temporal characteristics of Weihai coastal waters make the prediction of Chl-a concentration possible. TABLE 1 lists the observed water indicators used in this research. Since the frequency of sampling is not fixed, usually one to three times a season, we then give the average value of each indicator seasonally to ensure the coherence of the dataset. The column Mean and Skew show the mean value and skewness coefficient of each indicator respectively.
Water indicators in this study can be categorized into spatial, temporal, biological, chemical, physical, meteorological, hydrological, and anthropic factors. For clarity, we locate the spatial-temporal feature by five parameters including longitude (x), latitude (y), sampling depth (Depth), year (Year), and season (Season) of the record date. The biological indicator, Chl-a, is the main object of the study. Physical and chemical indicators listed on the table (NO 2 -N, NO 3 -N, NH 4 -N, DIN, DIP, TN, TP, Si, pH, Sal, DO, COD, Petro, SS, T, Tr) have all been proved to be associated with water quality [4], [8]- [10], [21]. Apart from common physicochemical factors in related studies, some meteorological and hydrological characteristics such as seasonal average wind speed (WindS), wind direction (WindD), watercolor (WaterC), weather (weather), sea condition (SeaC) were also contained. To be specific, WindS and WindD are essential indicators for marine phytoplankton presenceabsence prediction during calm weather [22]. The amount of dissolved and particulate material influences WaterC [23]. weather is of four types, cloudy, fog, rainy, and sunny. According to the sea surface conditions, the shape of wave crests, the degree of ruptures, and the number of foam waves, SeaC is divided into ten levels (0-9).
A multitude of anthropic influences has significantly altered the environmental conditions and the diversity of marine biological communities [24]. However, understanding and predicting the combined impacts of single and multiple stressors are particularly challenging [21]. In this research, impacts of human were concluded as four indicators, including the population of the nearest town or street from the observation station (NP), function name of the location of the observation station (FN) and type of the place where observation stations are located (FL) and the type of the nearest sewage (ST). Six types of FL are agriculture, conservation, industrial, harbor, recreation, and reservation. Five types of ST are the land stream, rural drain, sea farming, town drain, and others.

B. DATA PREPROCESSING
Since nonlinear functions can describe most of the relationships between predictors and target values better than simple linear regressors [14], feature construction and feature extraction are needed before modeling. In this section, three different feature construction procedures, including simplification, combination, and normalization, are used to identify potentially significant features. Then, we applied a feature importance analysis method based on RF and XGBoost for feature extraction.

1) SIMPLIFICATION
We use the Individual Water Quality Index (IWQI) class [25] to evaluate the grades of single factor concentrations. Classification criteria used in this study are listed in APPENDIX and the calculation formula for IWQI is described as (1).
Given a water quality observation factor i, IWQI i is the desired IWQI of i. C i is the concentration of i. C Ui is the upper bound of factor concentrations of a specific water quality grade while C Li is the lower bound. IWIQ Ui and IWIQ Li are the corresponding individual water quality index values of C Ui and C Li , respectively. Then, we simplified the numerical IWIQ of different factors into four rates as IWIQ COD , IWIQ DIP , IWIQ DIN , and IWIQ Petro . Based on these, IWIQ Max is constructed, which is the max value of the four grades described above. Simplification factors are shown in TABLE 2.

2) COMBINATION
There are many parameters to assess water quality according to the national and international criteria and standards. It has been proved that nutrient concentrations and the ratio of them affect Chl-a concentration primarily [26]. Thus, various ratios of different types of nutrients were constructed to improve the precision of model forecasting. In our study, the combination method is described as (2), in which, a and b are two different nutrient concentration factors and a/b is the constructed feature represents a nutrient ratio. TABLE 2 also shows the constructed indicators.

3) NORMALIZATION
As can be seen in  and different order of magnitude. If the analysis is performed directly with original index values, the role of the highervalue indicators in the comprehensive analysis would be highlighted and the effect of lower-level indicators would be relatively weakened. Therefore, to ensure the reliability of forecasting, the original data needs to be normalized. We applied the min-max normalization method [27] described as formula (3) to deal with all features, including origin features and constructed features.
In (3), x * is the normalized value, x is the original value, x min and x max are respectively the minimum and maximum value of the corresponding feature. Fig.3 compares data distributions measured before and after min-max transformation. Taking COD as an example, Fig.3 shows two histograms of data distributions with a kernel density estimate and maximum likelihood Gaussian distribution fitting curve.

4) IMPORTANCE ANALYSIS
To select relevant variables, we acquire the importance scores of each indicator by two tree-based models. Let x = (x 1 , x 2 , · · · , x n ) be a vector of water quality parameters, where x i describes the samples of water quality observation factor i, i ∈ [1, n]. Appling XGBoost [28] and RF [29] models on x, we can get the importance scores of each indicator, as XG i and RF i . For each of these indicators, its importance score is the average of the scores acquired from the above two models [14]. The calculation method is described as (4), where XG i and RF i are the importance scores of water quality observation factor i calculated by XGBoost and RF, respectively. IM i is the importance score of factor i used in this study, which is the mean value of XG i and RF i . Then we dropped the least essential indicators to increase the computation speed and generalization performance. The two base models, XGBoost and RF, generate feature importance scores in different ways. For the XGBoost model, it generates importance score XG i based on the frequency at which x i is used to segment the study data during node segmentation of trees established [30]. RF calculates the importance score RF i by determining the average contribution of x i on each tree [31]. Calculating the average value of XG i and RF i can reduce the influence of the difference caused by different importance analysis methods.

III. METHODOLOGY
The proposed Cluster-stacking-based approach includes two parts, a station-oriented clustering, and a stacking-based regression.

A. STATION-ORIENTED CLUSTERING MODEL
Cluster technique can help to reveal hidden information in a large dataset, hence considering clustering results as input variables for a regression model can improve model performance [32]. In addition, since Chl-a in sea surface has geographical and dynamic features [33], it is necessary to segment the study region by grouping the stations. So, we first propose a station-oriented clustering model to cluster observation stations into several groups, by which, the study region can be partitioned into several sub-regions. The procedure is described as Algorithm 1.

Algorithm 1 The Station-Oriented Clustering
The input of Algorithm 1 contains a water quality observation dataset D = s 1 , s 2 , · · · , s q and a set of clustering methods M = m 1 , m 2 , · · · , m p , p ≥ 1, where s i , 1 ≤ i ≤ q, represents a time series data of seasonal average Chla concentration of a water quality observation station, and q is the scale of historical data used in this study; m i ∈ M represents a specific clustering method, and p is the total number of clustering methods used in this study. The output is a cluster set C = C 1 , C 2 · · · , C k best , where C i ∈ C contains several water quality observation stations, and k best is the optimal number of clusters. An appropriate number of clusters is one of the most influential factors on the results of clustering models. Thus we use Algorithm 1 to select not only the optimum clustering method m best but also the optimum number of clusters k best .
We first let k = 1, 2, · · · , √ q (Step 1). For k j ∈ k and clustering method m i ∈ M (Step 2): Train the clustering model by method m i in total k j clusters based on dataset D (Step 3). Then calculate the Calinski-Harabasz score (Score CH ij ) [34] of the model via (5), where k j is the amount of clusters, B m i k j is the covariance matrix between categories, W m i k j is the covariance matrix of intra-category data and tr is the trace of a matrix. A high CH score means a small covariance of data within categories and large covariance between them. The higher the value of Score CH , the better the performance of a clustering model will be (Step 4). Let Score CH ij p× √ q as a set of CH score got through the loop. Then compare the elements in it (Step 5) and determine the parameters m best and k best corresponding to the optimal clustering model, which is the model with maximum Score CH (Step 6). Divide observation stations into k best number of clusters by the model m best as C = C 1 , C 2 , . . . , C k best (Step 7).
In this paper, we set p = 3 and M = {m 1 , m 2 , m 3 } where m 1 to m 3 are respectively set to K-means Cluster, Agglomerative Cluster, and Gaussian Mixture Cluster. K-means Cluster is a distance-based clustering method [35]; the Aggregative Cluster is a hierarchical clustering method [36], and the Gaussian Mixture Cluster is based on Gaussian mixture method [37]. Each of the three approaches has its advantages for water quality observation data.
K-means model is applicable to cope with the feature of a large amount of water quality observation data because of its fast convergence with high efficiency. Compared with K-means, the Gaussian mixture cluster is more general and can form clusters of different sizes and shapes, which is conducive to clusters with a different number of observation stations. The main advantage of hierarchical clustering is that it can find the hierarchical relationship between clusters, which is beneficial to analyze observation stations with geographical proximity. According to the clusters of observation stations, the sampling dataset we used is thus divided into several smaller sub-datasets, which lays a foundation for the establishment of the regression model.

B. STACKING-BASED REGRESSION MODEL
To avoid the insufficiency of single regression model, stacking strategy that combines multiple regressors or classifiers for predictions has been proved helpful [38]. Hence, we propose a stacking-based regression model to forecast seasonal average Chl-a concentration from a water quality observation dataset. The principle of the integrated approach is to combine the predictions of several base learners to improve the robustness and generality of a single learner [39]. The stacking-based regression model can be concluded as Algorithm 2.
In Algorithm 2, the input contains a water quality observation dataset D = (x, y), in which x = (x 1 , x 2 , · · · , x n ) and y = (y 1 , y 2 , · · · , y n ). x i = x i,1 , x i,2 , · · · , x i,m represents a sample of water quality observation data including historical records of 4 seasons, x i,p , 1 ≤ p ≤ m, represents the value of a water quality factor, and m is the total number of water quality factors used in this study; y i is the observed value of Chl-a concentration corresponding to x i ; n denotes the scale of the dataset. In addition, the stacking-based regression model is composed of two levels. The input also contains a set of basic regression learning machines, where L j , 1 ≤ j ≤ k and a new one L new are applied in level-0 and level-1, respectively, k is the number of basic regression learning machines in level-0. The output is the predicted value of one-season-ahead Chl-a concentrationŷ = ŷ 1 ,ŷ 2 , · · · ,ŷ n , whereŷ i represents the predicted value of y i .
First, we divide the original dataset D into D train (80%) and D test (20%) randomly (Step 1), apply k-fold cross-validation to D train and divide it into k subsets as {D 1 , D 2 , · · · , D k } (Step 2). For each basic learning machine L j in level-0 (Step 3): Let dataset D j , 1 ≤ j ≤ k as a validation set, and D −j = D − D j as a training set in level-0 (Step 4). Then we train the regression model L j on dataset D −j (Step 5), and apply L j to forecast the samples in D j , recording the forecasting result asŷ (j train ). At the meantime, apply L j to forecast the samples in D test , recording the forecasting result asŷ (j test ) (Step 6). Reconstruct the forecasting results asŷ train andŷ test , and let it be a new training set and testing set for layer-1 respectively (Step 7). That is, D new = ŷ train ,ŷ test is the dataset used in level-1 space (Step 8). Applying a new base learning machine L new on D new in level-1 space, we can then obtain the forecasting valueŷ = ŷ 1 ,ŷ 2 , · · · ,ŷ n of Chla concentration (Step 9).
The stacking-based regression model combines KNN, MLP, SVR, XGBoost, and RF. In this paper, k-fold is set to 4, and the regression models above are respectively used as L 1 , L 2 , L 3 , L 4 and L new . The stacking modeling process is shown in Fig.4. Within the five models, KNN is a representative of ''lazy learning''. The complexity of the K-nearest neighbor regression is low [40], which is applicable to the water quality observation data whose volume increases year by year. MLP is based on the present understanding of the biological nervous system. It is a massive parallel system composed of many processing elements connected by links of variable weights [41] and it has been used widely for water quality factor forecasting [42], [43]. SVR is a model based on a Support Vector Machine to solve regression problems [44]. Regress analysis of SVR is performed via introducing a loss function to project the data into a higher-dimensional feature space for linear regression through a nonlinear mapping. As compared to the MLP model normally with multiple local minima, the SVR gives a unique solution resulting from the convex nature of the optimality problem [45]. XGBoost is an integrated algorithm based on classification and regression tree [28]. As a massively parallel and distributed machine learning solution, XGBoost runs faster for model exploration. It aims to prevent over-fitting and optimize the computation resources, which makes the model performs well when applied to the water quality observation dataset with many features. Furthermore, an early stopping strategy is adopted for MLP and XGBoost to prevent overfitting. RF is a bagging algorithm based on decision trees and introduces random attribute selection in the training process of a decision tree [29]. A random forest can describe both linear and nonlinear relationships, without any additional assumptions concerning either the independent or the dependent variables [46]. This enables us to make Chl-a concentration prediction without knowing the correlation between water quality variables.

C. CLUSTER-STACKING-BASED APPROACH
Based on the models above, we give the Cluster-stackingbased approach for the Chl-a concentration prediction in coastal waters, which is illustrated as Algorithm 3. It is an integration of the station-oriented clustering model and the stacking-based regression model, for the usage of how to transform data into an appropriate input matrix, and how to combine Algorithm 1 and Algorithm 2 to forecast Chl-a concentration in coastal waters.
The input is the initial version of the water quality observation dataset D ini = (x ini , y ini ). In D ini , x int = x ini 1 , x ini 2 , · · · , x ini n and y ini = y ini 1 , y ini 2 , · · · , y ini n , where 2 , · · · , x ini i,m represents a sample of water quality observation data including all observed factors apart from Chl-a, x ini i,p , 1 ≤ p ≤ m, represents the value of a water quality factor, and m is the total number of water quality factors used in this study; y ini i represents the corresponding Chl-a concentration with x ini i ; and n denotes the scale of the dataset. In particular, the dataset D ini here is preprocessed already. The outputŷ is the forecasting value of Chl-a concentration.
We first convert D ini into D sta by grouping the dataset D ini according to the station number, making it applicable to Algorithm 1 (Step 1). In D sta = s 1 , s 2 , · · · , s q , s i represents a time series data of seasonal average Chl-a concentration of a water quality observation station, and q is the scale of historical data used in this study. Then let D = D sta as an input, we can get the cluster set C = {C 1 , C 2 , · · · , C K } through Algorithm 1 (Step 2). Convert D ini into D pre = (x pre , y pre ) by combining each sample with four-season historical records which have the same station number, making it applicable to Algorithm 2 (Step 3). In D pre , x = x

IV. RESULTS AND DISCUSSION
We wrote Python code for the integration of the clustering and stacking approach. To complete the seasonal Chl-a concentration forecasting, we set the input variables as the historical records of four seasons before and the output variable as the one-season-ahead seasonal average Chl-a concentration. All historical records were handled with simplification, combination, and normalization. Python packages including sklearn, XGBoost, pandas, numpy, matplotlib and seaborn were used for these methods. Meanwhile, we set all the parameters not mentioned explicitly in this paper by the default value in the above packages. Fig.6 ranks the top 20 most important indicators obtained from RF ( Fig.6(a)) and XGBoost (Fig.6(b)), combined with the average score of these two methods in Fig.6(c). According to the analysis for the survey data of the current season, the significant variables for forecasting one-season-ahead Chl-a concentration is the average sea condition in the last one season (L1_SeaC), followed by average Chl-a concentration in the same season of last year (L4_Chl-a), and the average total nitrogen concentration in last one season (L1_TN).

A. FEATURE IMPORTANCE ANALYSIS
As the most important indicator for one-season-ahead Chl-a forecasting, SeaC reflects the sea surface conditions, the shape of wave crests and their degree of bursting, and the amount of foam waves, which is closely related to the Chl-a concentration. As for the Chl-a concentration (L4_Chl-a), the possible reason is the seasonal difference in growth mechanism of phytoplankton. In spring, seawater phytoplankton is in the period of proliferation, while phytoplankton is in decline in autumn because their growth and reproduction ability are diminished. N , P, Si or the combination factors of them are listed at the top 5 critical factors, such as L1_TN and L4_Si/P. Land source input is the primary source of nutrients in the coastal waters and the main influencing factor of phytoplankton activities. When phytoplankton has steady growth and reproduction ability, they need to absorb nutrients, so N, P, and Si become three of the significant influencing factors. This importance analysis score list also explains the importance of nutrients to plant growth.
As spatial indicators, longitude and latitude of the observation stations are both listed in the top 20, which verifies VOLUME 8, 2020 the necessity of our clustering steps. Besides, many types of chemical and physical indicators or the combination of them are on the list. However, pH is not an important indicator because of its slight change. In Fig.6, some meteorological indicators, hydrological indicators, and indicators related to human impacts are also listed, which indicates the importance of these indicators in the forecasting of Chl-a concentration. It should be noted that constructed factors are shown in none of these three figures. But it doesn't mean that they are not essential. Simplification factors are not listed in the figures but also in top 50%. Categorical variables are not on the list because they are encoded in One-Hot, which decomposes a single indicator into multiple indicators. Hence, these constructed factors are all taken into consideration in this study.

B. STATION-ORIENTED CLUSTERING RESULTS
We tried to determine the appropriate architectures for each clustering model until Score CH reached the optimum case. Fig.7 shows the comparison of three different clustering models, k-means cluster, Aggregative cluster and Gaussian Mixture cluster. In total, different clustering models were evaluated to find the optimal one. We can see from this figure that for any kind of clustering method, Score CH is changing as the number of cluster changes. Finally, as the red point shows, the optimal clustering model among the three models is K-means with seven clusters.
Applying K-means on the coastal waters quality observation datasets, we obtained the clusters of stations. Fig.8 shows the 7 types of sub-regions corresponding to the clusters. We can see from the figure that stations in the same cluster are basically very close in location, further confirming that the cluster idea is practical.
It should be noted that stations of the third type of subregion (blue circle points) are partitioned into 2 parts, as well as the first type (red star points). The possible reason is that the east side of Rongcheng and Southside of Wendeng are both containing bays, which influence the Chl-a concentration in current, wind, and other factors. And both of the two parts are open sea area may be the possible reason for the situation in the third type of sub-region.

C. MODEL EVALUATION
To verify the effectiveness of the Cluster-stacking approach in Chl-a concentration prediction, we compared the performance of the Cluster-stacking model with that of other 5 models, which are Cluster-KNN, Cluster-MLP, Cluster-SVR, Cluster-XGB, and the regression stacking model without cluster. We divided the available dataset, both observed and constructed features, into two sub-sets for training (80%) and testing (20%), and tried to determine the best parameters for each model until they reached the optimum case of R 2 and MAE. The mathematical equations of these statistical indicators are described as (6) and (7).
In (6) and (7),ŷ i denotes the predicted value of the ith observation sample, y i is the real value of the ith observation sample,ȳ i is the average value of all observation samples, and n is the number of the observation samples. The closer R 2 is to 1 and the lower MAE values are, the better the model performance will be. The evaluation results of the models are graphically compared in Fig.9 in the form of scatter plots. The scatter plots can be partitioned into 6 groups corresponding to the 6 models. In each group, the left scatter plot and the right one shows the results of a model on the training set and testing set, respectively. The horizontal and vertical axis respectively express VOLUME 8, 2020  the observed value and predicted value. The black line represents y = x. The red line is the linear regression fitting of the observed value and predicted value. i.e., when a point falls on the black line, the prediction is exact. The closer the red line is to the dark line, the more accurate the prediction is .  TABLE 3 and TABLE 4 list the training and testing results of the 6 models in 7 types of sub-region and the whole region, concerning the evaluation of MAE and R 2 . The results in the column of whole region is the integration of forecasting results in all the sub-regions. In these tables, the bold values show the optimal one. As can be seen, Cluster-stacking model shows the best performance with the maximum R 2 and minimum MAE in most of the sub-regions, both in training and testing phases.
According to the test results of Fig.9, TABLE 3, and  TABLE 4, combined with the station-oriented cluster results, the performance of the regression stacking model with k-means clustering is better than those of any other motheds in whole region, with R 2 of 0.848 and MAE of 0.665 ug/l in the testing phase.  In the training phase, each model in this study can fit the data reasonably well. For all types of sub-regions and the whole region, the Cluster-stacking-based model in the training phase performs well with R 2 higher than 0.95 and MAE less than 0.5 ug/l. Only MLP obtained a little better regression in sub-region of type 2, which can be negligible. However, as for in testing phase, not all models perform well.
MLP performs better in R 2 than other methods in sub-region of type 2, 4, and 7; yet when concerning MAE, KNN performs best in sub-region of type 3 and 4, MLP performs best in sub-region of type 2 and 7. SVR has stable performance in all sub-regions. In total, when referring to the whole region, the Cluster-stacking model shows the best overall performance with the highest R 2 and lowest MAE.

V. CONCLUSION
This paper is based on the historical water quality observation data collected from 2015 to 2018 in Weihai coastal waters. The following conclusions can be drawn from the study.
First, we propose a set of feature extraction methods for water quality factors including simplification, combination, normalization and importance analysis, and a feature selection method based on RF and XGBoost to select relevant variables of Chl-a concentration. The feature importance analysis shows that sea condition in the last season (L1_SeaC) plays an essential role in forecasting seasonal average Chl-a concentration, followed by average Chl-a concentration in the same season of the previous year (L4_Chl-a) and average value of total nitrogen concentration in the last one season (L1_TN).
Second, we propose a station-oriented clustering model to group the observation stations. Due to the spatial characteristic of Chl-a concentration in coastal waters, station clustering has been a crucial step of Chl-a prediction. By means of clustering, the dataset of waters can be partitioned into several subsets, thus the influence of biological, physical and chemical reactions in waters can be simplified to that of geographical differences. In our study, we have tested many cluster models, such as k-means cluster, Aggregative cluster and Gaussian Mixture cluster, with a great number of parametersets, and find that k-means model performs the best. Since the clustering has shown its importance in Chl-a concentration prediction, it can also be treated as a useful step to forecasting many other kinds of water quality factors. Furthermore, it can provide a meaningful idea in water regions partition.
Third, to forecasting seasonal average Chl-a concentration in coastal waters, we present an integrated approach that includes station-oriented clustering and stacking-based regression. The main idea of the station-oriented clustering is to partition the whole water region into several sub-regions which has similar regularity of Chl-a concentration evolution. In each sub-region, the stacking-based regression is applied to forecast Chl-a concentration and avoid the insufficiency of single models. The evaluation demonstrates that the Clusterstacking-based method outperforms the approaches of clustering with single regression models such as Cluster-KNN, Cluster-SVR, Cluster-MLP, Cluster-XGBoost, as well as the regression stacking model without cluster.
The evaluation demonstrates the feasibility of the Clusterstacking-based approach in forecasting Chl-a concentration in coastal waters, but we noticed that there is still a gap between prediction and observation data, so it deserves more efforts to explore the features and refine the model. Besides, how to apply the approach proposed to a broader range of waters will also be our future work. TABLE 5 shows the classification criteria used in this study, which are excerpted from the Water Quality Index of Weihai [25]. WEIJIA JIA received the bachelor's degree in computer science and technology from Northeastern University, China, in 2017. She is currently pursuing the master's degree in computer science and technology with Shandong University, Weihai, China. Her research interests involve data processing, feature extraction, and machine learning application.

APPENDIX
JIE CHENG received the Ph.D. degree in computer science and technology from Shandong University, Weihai, China. She is an Associate Professor with the School of Mechanical, Electrical and Information Engineering, Shandong University. Her research interests are data science and data engineering.
HONGZHI HU received the Ph.D. degree in physical chemistry from Beijing Normal University. He is currently a Senior Engineer with the Weihai Marine and Fishery Monitoring and Hazard Mitigation Center. He is mainly engaged in the marine environment and marine resources monitoring and evaluation work. VOLUME 8, 2020