Gaussian Process Regression Tuned by Bayesian Optimization for Seawater Intrusion Prediction

Accurate prediction of the seawater intrusion extent is necessary for many applications, such as groundwater management or protection of coastal aquifers from water quality deterioration. However, most applications require a large number of simulations usually at the expense of prediction accuracy. In this study, the Gaussian process regression method is investigated as a potential surrogate model for the computationally expensive variable density model. Gaussian process regression is a nonparametric kernel-based probabilistic model able to handle complex relations between input and output. In this study, the extent of seawater intrusion is represented by the location of the 0.5 kg/m3 iso-chlore at the bottom of the aquifer (seawater intrusion toe). The initial position of the toe, expressed as the distance of the specific line from a number of observation points across the coastline, along with the pumping rates are the surrogate model inputs, whereas the final position of the toe constitutes the output variable set. The training sample of the surrogate model consists of 4000 variable density simulations, which differ not only in the pumping rate pattern but also in the initial concentration distribution. The Latin hypercube sampling method is used to obtain the pumping rate patterns. For comparison purposes, a number of widely used regression methods are employed, specifically regression trees and Support Vector Machine regression (linear and nonlinear). A Bayesian optimization method is applied to all the regressors, to maximize their efficiency in the prediction of seawater intrusion. The final results indicate that the Gaussian process regression method, albeit more time consuming, proved to be more efficient in terms of the mean absolute error (MAE), the root mean square error (RMSE), and the coefficient of determination (R2).


Introduction
Seawater intrusion (SI) in coastal aquifers is a complex physical phenomenon, consisting of several physical processes. A number of approaches have been proposed to simulate SI, considering different components. Dispersion mechanisms and water density changes are considered critical components in the accurate representation of SI [1]. Both mechanisms are incorporated in the mathematical description of what is known as variable density (VD) models. Although accurate, VD models are CPU intensive and entail long runtimes because the resulting model equations are solved using complex numerical methods (e.g., finite differences and finite element methods). e time-consuming simulations hinder the exploitation of the high accuracy VD models in applications which require a large number of iterations, such as coastal groundwater management, parameter estimation, sensitivity analysis, and uncertainty analysis. Because of the long runtimes, it is also rather impractical to incorporate VD models in real-time systems, e.g., decision support systems [2]. A common method to tackle the duration problem is the use of very fast approximation models, which could efficiently substitute the original VD models, without compromising the accuracy of the results. ese models are usually called surrogate models, metamodels, model emulators, lower fidelity models, proxy models, and response surfaces [2][3][4][5].
Surrogate model practice is based on the notion that original model response(s) could be approximated by a computationally more efficient model, for a range of values of the selected model variables [5]. In the present study, the Gaussian process regression (GPR) as a surrogate model for SI is examined. Rajabi and Ketabchi [6] summarized the advantages of GPRs compared with other surrogate models in the following: (i) GPRs provide both an approximation of the original high-fidelity model results and a probabilistic estimate of the approximation uncertainties [7,8], (ii) GPRs' structure is relatively simple based on the mean and covariance functions [9], (iii) GPRs are flexible with regard to the probability distributions of the input data, (iv) GPRs can efficiently cope with models of different complexity [10,11], (v) GPRs provide the ability to calculate the mean and standard deviation, and (vi) GPRs provide the ability to incorporate prior knowledge of the outputs in the metamodel construction process [12].
e GPR results are compared with other widely used methods, specifically, linear regression (LR), support vector machine regression (SVMR), binary regression decision tree (BRDT), and ensemble tree learners (ETL). It should be noted that the examined methods are all univariate. A Bayesian optimization is employed in all surrogate models, to improve their efficiency. e remainder of this study is structured as follows: Section 2 presents a brief survey of the related work. In Section 3, the seawater intrusion model is described, whereas section 4 presents the proposed Gaussian process regression scheme and the Bayesian optimization process. In Section 5, the experimental evaluation is provided, and finally, section 6 concludes the paper.

Related Work
Approximation models have been widely used during the last decade in water resources (e.g., [13,14]) and especially in groundwater modelling. Razavi et al. [5] and Asher et al. [2] performed an extended review of surrogate model applications in water resources field. Regarding coastal aquifers, surrogate models have been widely used for the prediction of SI, substituting the complex fluid flow and transport processes. For example, Bhattacharjya et al. [15] employed artificial neural networks (ANN) to approximate densitydependent flow in coastal aquifers. In more recent studies, Roy and Datta [16] used the fuzzy C-mean clustering method to predict SI, while Lal and Datta [17] investigated the ability of Support Vector Machine regression (SVMr) to predict the location of SI toe and concluded that the method surpasses other widely used metamodelling methods, such as genetic programming (GP). A significant number of relative studies are devoted to the use of surrogate models in coastal aquifer management problems to cope with the computational burden, which arises from simulation-optimization schemes [18]. A wellestablished metamodel which is very common in coastal aquifer management literature is the artificial neural networks (ANNs) [5,18]. Bhattacharjya and Datta [19] employed an ANN model to approximate a densitydependent model in a genetic optimization framework. Rao et al. [20] and Kourakos and Mantoglou [21] incorporated ANNs in a simulation-optimization scheme to replace the SEAWAT numerical code. Kourakos and Mantoglou [22] proposed a pumping optimization method based on modular neural networks and an Evolutionary Annealing Simplex optimization algorithm. Ataie-Ashtiani et al. [23] combined a simulation-optimization procedure with ANNs to develop an efficient model for the multiobjective management of groundwater lenses in small islands. Christelis and Mantoglou [24] used cubic radial basis functions (RBFs) in two adaptive metamodeling frameworks: (1) the adaptive-recursive approach and (2) the metamodel-embedded evolutionary strategy. e latter proved to be computationally more efficient, providing solutions near the global optimum. Christelis et al. [25] employed two surrogate-based optimization (SBO) frameworks, under restricted computational budgets to improve the efficiency of optimization algorithms in problems of moderate and large dimensionalities. In a more recent study, Christelis and Mantoglou employed variable-fidelity surrogate models and evolutionary algorithms to calculate the maximum allowed pumping rates in coastal aquifers. Sreekanth and Datta in several studies [26][27][28] examined genetic programming (GP) as a potential surrogate model in multiobjective management of SI in coastal aquifers and compared the proposed method with modular neural network metamodels. Roy and Datta examined several surrogate models to predict SI in coastal aquifers and employed all these models in coastal aquifer management problems [29][30][31]. A review of surrogate models, focusing on SI and coastal aquifer management, is presented by Roy and Datta [32].
Gaussian process metamodels have been widely used in engineering optimization applications, but according to Razavi et al. [5] and Asher et al. [2], they have not yet attracted much attention in groundwater field, particularly SI. Stone [33] presented a Bayesian emulation methodology as an alternative to Monte Carlo in the analysis of stochastic groundwater models. Zhang et al. [34] employed an adaptive Gaussian process-based method to identify contaminant source in groundwater problems, whereas Crevillén-García et al. [35] used Gaussian process method to perform uncertainty analysis in a convectively enhanced dissolution process model. Raghavendra and Deka [36] used GPR and adaptive neuro fuzzy inference system (ANFIS) to forecast groundwater level time series. In a recent study, Rajabi and Ketabchi [6] used Gaussian process emulators in a simulation-optimization framework to address the computational challenges arising from the large number of the required simulations. Roy and Datta [37] incorporated three metamodels, particularly ANFIS, GPR, and multivariate adaptive regression spline (MARS), in a multiobjective optimization framework to quantify the influence of sealevel rise on coastal aquifer management. In this specific study, they concluded that the ANFIS-based metamodel proved to be more efficient and inexpensive compared with the other two metamodels. e authors also performed a comparative analysis between several surrogate models [38], including GPR, in a coupled simulation-optimization methodology under parameter uncertainty. In this specific study, they concluded that the GPR metamodels and their ensemble (EGPR) proved to be more efficient in terms of prediction compared with other similar methods, such as the MARS metamodel and the regression tree (RT) metamodel.

Seawater Intrusion Model
3.1. Variable Density and Salt Transport Model. As mentioned in section 1, VD models are based on the spatial variability of groundwater density, which ranges from saline water density to freshwater density. e driving force of the seawater/freshwater mixing is the dispersion mechanism, which results in the existence of a transition zone across the entire coastline. e width and exact position of the zone depends on the aquifer parameters and the pumping regime. In the current study, thermal and viscosity effects are neglected and the density changes are attributed only to concentration effect. e flow and solute transport equations are used to describe mathematically the VD model. e two equations form a coupled differential equation system, which could be expressed as follows [39]: where ρ is the fluid density, q is the specific discharge vector, ρ s is the density of water entering from a source of leaving through a sink, q s is the volumetric flow rate per unit volume of porous medium representing sources and sinks, S f is the specific storage, h f is the freshwater head, n is the porosity, C is the solute concentration, D is the hydrodynamic dispersion tensor, v is the fluid velocity vector, and C s is the solute concentration of water entering or leaving through sources and sinks, respectively. Because solute reaction is not considered, fluid density is only a function of the solute concentration C, according to the following equation: in which ρ o is the freshwater density, ε is the density difference ratio (equation (4)), C o is the reference concentration, and C s is the maximum concentration. In this study, the following values are used for the parameters of equation (3): ρ o � 1000 kg/m 3 , C o � 0 kg/m 3 , and C s � 35 kg/m 3 . e density difference ratio is expressed as where ρ s stands for the maximum seawater density. In this study, we consider ρ s � 1025 kg/m 3 . e Darcy flux term q of equation (1) for constant viscosity and freshwater properties could be expressed as where q x , q y , and q z are the components of the specific discharge in the principal directions, K fx , K fy , and K fz are the components of the freshwater hydraulic conductivity in the same directions, and ρ f is the freshwater density. Equations (1) to (7) are the mathematical representation of the VD approach of seawater intrusion. e wellestablished SEAWAT code is used to solve numerically the aforementioned equation set. SEAWAT is a modular finite difference computer code created by USGS, which couples MODFLOW and MT3DMS, to solve iteratively the fluid flow and solute transport equations [39].

Coastal Aquifer Case Study.
e VD model is applied on a rectangular-shaped unconfined aquifer. e dimensions of the aquifer model are L � 7000 m, W � 3000 m, and d � 25 m. e examined aquifer geometrically resembles a real coastal aquifer located at the central eastern part of the Greek island Kalymnos, specifically the elongate aquifer underlying the Vathi valley [40,41]. It should be noted that the examined model is an abstraction of the real aquifer, which could be considered as a typical aquifer example for the Aegean Greek islands, in terms of size and shape. Figure 1 outlines the conceptual model of the aquifer model. A hydrostatic boundary condition (BC) is assigned on the seaside boundary. e aquifer is bounded by impermeable geological formations, with the exception of the inland boundary, where a specified flux BC is applied to simulate the lateral inflow from the adjacent aquifer. e groundwater is replenished by a constant recharge, which is uniformly distributed along the entire surface of the aquifer. For simplification purposes, the aquifer is considered homogeneous and an anisotropic factor is assumed, which represents the differential permeability along the vertical direction. Table 1 presents the values of the basic fluid flow and solute transport parameters. An initial simulation for approximately 200 yr without pumping was performed, until steady flow/steady transport conditions are achieved. e final hydraulic head and concentration values of this simulation are used as the initial conditions for the SI simulations, which are related to the training of the surrogate models. All simulations in the current paper are considered steady state, regarding the fluid flow conditions. is assumption resulted in relatively brief VD simulations, which allowed for the creation of an adequate sample for the calibration of the surrogate models. e duration of each simulation was approximately 1-2 min. e simulations were performed in an i7-4770 quad-core processor 3.4 GHz, with 8 GB RAM.
Computational Intelligence and Neuroscience e 0.5 kg/m 3 iso-chlore is considered as an indicative surface, representing the seawater intrusion wedge. Specifically, the location of the intersection of the iso-chlore and the aquifer bottom, known as the toe of the wedge, is used as a measure of the seawater intrusion extend. Further details concerning the calculation of the toe are discussed in the following sections.

Gaussian Process Regression.
Gaussian process regression (GPR) is a nonparametric kernel-based probabilistic model [8]. Just like other Bayesian methods, GPR do not aim at finding "best-fit" models of the data by relating the underling function f(x) to a specific form (e.g., linear or quadratic). Instead, they calculate posterior predictive distributions for new test inputs. Such an approach enables the quantification of uncertainty as regards model estimates, as well as leveraging the understanding of the uncertainty to improve the robustness of predictions on future test points [43]. Gaussian processes can be considered as the extension of multivariate Gaussians to infinite-sized collections of variables of real value. More specifically, a Gaussian process is a collection of random variables f(x) : x ∈ X defined by its mean function μ(x) and a covariance function k(x, x ′ ) so that e above statement can be rewritten as follows: where each dimension of the Gaussian corresponds to an element x from the index set X. Furthermore, the respective component of the random vector represents the f(x) value. Typically, the prior distribution over functions f(·) is expected to be a zero-mean GP prior. Consider a training set L � (x i , y i ) n i�1 of i.i.d. examples from some unknown distribution, where x i ∈ R d and y i ∈ R. A GPR model assumes that a response y i satisfies the following equation: where ϵ i are i.i.d. noise variables, so that ϵ∼N(0, with mean value and covariance defined as respectively. Note that K(X (u) , X) ∈ R n×n is defined as . . , n. e same hold for the K(X, X), K(X (u) , X (u) ), and K(X, X (u) ) cases.
Additionally, y � y 1 , . . . , y n T , X (u) is defined in a similar way. As such, we can estimate any new value as the mean of a posterior predictive distribution. We should also note that with the rise of training samples number, the confidence region size reduces, so as to reflect the decreasing uncertainty in the model estimates.

Alternative Regressors for Comparison.
Regression trees is an alternative approach to nonlinear regression. e core idea lies in sub-dividing the space into smaller regions and then fit simple models to them [42,44]. Provided a training set L, a set of branches is created. Each binary split is performed according to a specific feature (from the m available). en, a new value prediction is defined as where c is the number of observations available at the specific cell. Support Vector Machine regression is another approach. e function used to predict new values (for linear support vector regression) is defined as [45] y where 〈·, ·〉 is the dot product, α i , α * i are Langrage multipliers, so that α i · α * i � 0, i � 1, . . . , n, and b is a bias term. SV algorithm can be made nonlinear by simply preprocessing the training patterns x i using a kernel function k(·, ·). e regression is performed as

Bayesian Optimization of Model Parameters.
A variety of widely used machine learning techniques contain a significant number of parameters to be decided (e.g., SVM kernel type and parameters and ANN layers and type of activation functions). e performance of any algorithm depends on the selection of these hyperparameters [46][47][48]. Typically, hyperparameter tuning involves grid search, random search, and genetic algorithms, among many other techniques [49]. Such techniques require many (nonconvex) function evaluations. Bayesian optimization (BayesOpt) is a surrogate modelling technique that can optimize an objective function that is expensive to evaluate, reducing the number of actual function evaluations required [50,51]. It is built on Bayesian inference and Gaussian processes and is applicable in cases where closed-form expression for the objective function is not known but can obtain observations (possibly noisy) of this function at sampled values.
BayesOpt builds a probabilistic proxy model for the objective, using outcomes of past experiments as training data. e proxy model (e.g., Gaussian process) is much cheaper to calculate but it can provide adequate information on where we should evaluate the true objective function to get a good result. Assume a vector P � p 1 , . . . , p m for a set of m hyperparameters to be tuned. Given a set of training paradigms (x i , y i ) n i�1 , we need to find P * � argmin P g(P| (x i , y i ) n i�1 ), where g is a cost function (e.g., cross-entropy cost and quadratic cost). e entire optimization approach is guided by an appropriate acquisition function (AF), which defines the next point (i.e., set of hyperparameters) to be evaluated. As such, any AF needs to balance between exploration and exploitation.
Exploration refers to region search where the uncertainty is high, expecting to find a new set of parameters that improve model's performance. Exploitation, on the other hand, is a region search close to already calculated high estimated values (i.e., regression performance scores).

Data Preprocessing and Experiment Setup.
e training sample consists of 4000 variable sets. Each set has 40 input variables: (1) the pumping rates of the 10 wells and (2) the distance of the SI toe from 30 observation points, uniformly distributed across the sea boundary. e Latin hypercube sample (LHS) statistical method was used to generate the 4000 pumping rate patterns. As mentioned in Section 2.2, the 0.5 kg/m 3 is selected as an indicative concentration value for the SI extend. e distance between the toe and the observation wells represents the initial position of the SI wedge. Regarding the initial position of the SI toe, the variable sets are divided into four categories of 1,000 samples. In the first category, the concentration results from the zero-pumping rate simulation define the initial location of the SI toe. In the remaining categories, the solute transport and hydraulic head results of the previous category simulations are used as the initial conditions for the following simulations.
e output set consists of 30 variables, which represent the final location of the SI toe, calculated as the distance from the same observation points. Figure 2 presents the initial and final position of the SI toe for a specific set of pumping rates, representing, along with the pumping rates, the input/ output variables used to train the surrogate model.

Hyperparameter Optimization.
Each regressor's hyperparameters were optimized using Bayesian optimization over 5k-fold cross-validation sets. e final parameter values are summarized in Table 2. An interesting remark is that, for the specific setup, simpler models (i.e., least square regression vs linear kernel SVMs, and linear kernel SVM vs Gaussian RBF or polynomial kernels) perform slightly better, during the optimization process. Figure 3 illustrates the normalized performance scores of the investigated regressors for the training set, and Figure 4 provides a further insight into the actual differences (in meters), on average, for the trained models. Errors in estimation do not surpass 10 meters for the GPR and 20 meters for the TreeEns. e other regressors achieve an average error greater than 40 meters. Figure 5 illustrates the optimization time (in minutes) required for the identification of the best possible hyperparameters, using Bayesian optimization. GPR and SVR had significantly higher training times. It is also intriguing that, for different observation points, SVR and TreeEns regressors' optimization times had increased variance. Figure 6 illustrates the case.

Statistical Evaluation.
Statistical errors calculate the sum of differences between actual (simulated) and forecasted (regressor estimated) values. e statistical measurements used were the mean absolute error (MAE) and the root mean square error (RMSE). Low error scores suggest a good regression model.

Computational Intelligence and Neuroscience
An additional performance score, i.e., coefficient of determination, R 2 , is used. R 2 provides a measure of how well-observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. Values close to 1, i.e., R 2 ≈ 1, indicate that the regression predictions approximate extremely well the actual data outputs. Figure 7 illustrates the average performance scores, for the proposed statistical errors. GPR surpasses all other repressors in all performance fields.     Computational Intelligence and Neuroscience

Measuring Actual SI Toe Location Estimation Error.
e statistical errors provide various information regarding the model performance. However, in our case, actual errors in the SI toe location estimations, measured in millimeters (mm), provide a deeper understanding of the regressors' performance. Figure 8 presents the discrepancies between    e comparative results indicate that the GPR method is overall more efficient and achieves more accurate prediction of the SI.
However, average scores fail to indicate the performance for each of the observation points, using each of the proposed regressors. Table 3 provides a further insight into the error values (meters) for each of the observation points.

Analysis of Variance.
To obtain further insights into the results and the relative performance of the different algorithms, we conducted an analysis of variance (ANOVA) on the distance between the SI toe and the observation points score results for the test samples. e MAE score, in meters, represents a significant amount of information about the overall performance. Using this method, we can study the effects that the main design factors have [52]. Table 4 displays the outcomes of ANOVA. In this table, the "Source" column corresponds to the source of variation in data (i.e., the regressors and the observation points). Sum and mean sq. correspond to mean measurements between the m groups and the grand mean; it is a means of quantification of the variability among the groups of interest. For the degrees of freedom (d.f.), it holds that d.f. � m − 1.
e F metric corresponds to the "average" intergroup variability divided by the "average" intragroup variability. e last column includes the p value, which is derived by comparing the F-statistic to an F-distribution with m − 1 numerator degrees of freedom and n − m denominator degrees of freedom, for the total set of n observations.
As can be seen in   Computational Intelligence and Neuroscience difference (HSD) post hoc test is also employed to identify sampling schemes and classifiers that provide the best results, while taking into consideration the statistical significance of the differences between the results. Figure 9 indicates that GPR by far surpasses the other regressors' MAE score. Mean scores for each regressor are shown as 'o'. e average scores from the subgroups in the experiment are also provided, in the form of a horizontal line. Because there is no overlap between the RMSE values for the GPR type compared with the other regressors, GPR scores are clearly statistically better than the others [53,54]. Figure 10 indicates that the best observation point is point no. 30. A slight overlap in MAE subgroups' scores with point 27 (2 nd best observation point) is observed.

Conclusion
e present study performs a comparative analysis of four different surrogate models for the variable density approach of seawater intrusion, in particular Gaussian process regression, binary regression decision tree method, ensemble tree learners, and the support vector machine regression models. Emphasis was given on the optimization of the examined techniques. To this end, a Bayesian optimization procedure of the surrogate models hyperparameters is used. e evaluation results indicate that the GPR method surpasses the other regressors in terms of the mean absolute error (MAE), the root mean square error (RMSE), and the coefficient of determination (R 2 ). It should be noted that GPR is significantly more time  Computational Intelligence and Neuroscience 9 consuming. In summary, the GPR method is a reliable and accurate surrogate model for SI and could be incorporated in a pumping optimization framework in coastal aquifers. Future research will focus on further scrutinizing the effectiveness of the GPR method for saltwater intrusion prediction, compared with other well-established surrogate models.

Data Availability
Data are not publicly available at this point because of intellectual property restrictions, but they can be provided to anyone interested on request. e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper. Acknowledgments e research leading to these results has received funding from the European Union's Horizon 2020 Research and