Evaporation Rate Prediction Using Advanced Machine Learning Models: A Comparative Study

Accurately estimating the amount of evaporation loss is necessary for scheduling and calculating irrigation water requirements. In this study, four machine learning (ML) modeling approaches, extreme learning machine (ELM), gradient boosting machine (GBM), quantile random forest (QRF), and Gaussian process regression (GPR), have been developed to estimate the monthly evaporation loss over two stations located in Iraq. Monthly climatical parameters have been used as an input variable for simulating the evaporation rate. Several statistical measures (e.g., mean absolute error (MAE), correlation coeﬃcient ( R ), mean absolute percentage error (MAPE), and modiﬁed index of agreement (Md)), as well as graphical inspection, were used to compare the performances of the applied models. The results showed that the GBM model has much better performance in predicting monthly evaporation over two stations compared to other applied models. For the ﬁrst case study which was in Diyala, the results showed a prediction enhancement in terms of MAE and RMSE by 7.17%, 21.01%; 16.51%, 15.74%; and 23.14%, 26.64%; using GBM compared to ELM, GPR, and QRF, respectively. However, for the second case study (in Erbil), the prediction enhancement was improved in terms of reduction of MAE and RMSE by 10.88%, 9.24%; 15.24%, 5%; and 16.06%, 15.76%; respectively, compared to ELM, GPR, and QRF models. The results of the proposed GMBM model can therefore assist local stakeholders in the management of water resources.


Introduction
In the hydrological cycle, evaporation plays a major role; therefore, monitoring evaporation is important for managing water resources, optimizing irrigation schedules, and modeling agricultural production [1,2]. Besides, evaporation rate has significant importance in studying climate change and global warming because this parameter dissipates a good proportion of the global precipitation [3][4][5]. e evaporation loss is influenced primarily by the vapor pressure gradient and the available heat energy, which are determined by the weather data like air temperature, relative humidity, wind speed, and solar radiation [6][7][8]. ese variables are strongly associated with other aspects like the current season, time of day, geographical location, and sort of climate [9,10]. e evaporation process is therefore extremely nonlinear and complex.
For computing and evaluating evaporation, there are two procedures, direct and indirect [11]. Pan evaporation E pan is considered as a well-known direct method used extensively for the estimation of evaporation rate. In particular, evaporimeters cannot be placed everywhere, especially in inaccessible regions where precise instrumentation is not possible [12]. Furthermore, the process of installing and maintaining this evaporation equipment in several regions is expensive [13]. However, the indirect method includes empirical equations used for measuring the evaporation rate [14]. ese empirical equations can be established utilizing meteorological and hydrological parameters such as temperature, sunshine hour, wind speed, humidity, and rainfall [15,16]. Precise measurement of some of these meteorological factors requires advanced tools and skilled labor [17]. Often, instrument malfunctions, improper maintenance, and harsh weather conditions make it difficult to gauge these data minus any errors, which is essential for the prediction of evaporation via empirical equations [18]. us, it would be problematic to project evaporation by gauging these factors incorrectly [19].
us, indirect systems of estimating evaporation by applying empirical equations are dependent on data and are also influenced by different assumptions. In other words, these approaches are considered as data-sensitive procedures and the accuracy of prediction would mainly depend on the data validity [20]. Additionally, such climatic data are generally scarce or hard to find at a particular hydrological station, and they tend to be discontinuous in certain places [21]. Evaporation is difficult to model through empirical techniques due to its extremely complex physical and nonlinear nature. In addition, an empirical model designed for a specific scenario might not perform well in another scenario, requiring recalibrations of the coefficients before execution. Several empirical models have been created by many researchers in literature to model evaporation loss [22]. e selection of the predictors is one of the main challenges for the nonlinear regression process. erefore, creating a robust predictive model using empirical procedures is very difficult.
Many studies have been conducted to solve different water-resource problems employing different artificial intelligence (AI) approaches such as random forest (RF), support vector machine (SVM), extreme learning machine (ELM), feed-forward neural network (FFNN), extra-tree, Gaussian process regression (GPR), gradient boosting model (GBM), and quantile regression forest (QRF) [23][24][25][26][27][28][29]. Goyal et al. [30]. presented a study to estimate the daily evaporation loss over subtropical areas using different AI modeling approaches. e study used six meteorological parameters to establish the applied models. e findings of the study illustrated that the Adaptive Neurofuzzy Inference System (ANFIS) and least square support vector regression (LS-SVR) provide the best accuracy compared to the other used models. Another study was performed in [31] to estimate the evaporation loss of the Beysehir lake located in the southern part of Turkey.
is study employed several machine learning approaches coupled with cross-validation technique to predict the monthly evaporation over that case study which is characterized as an arid and semiarid area. e study found that both ANN and SVR had a good prediction accuracy. Qasem et al. [32] developed a complicated model based on the incorporation of the ML models such as SVR and ANN with wavelet transforms (WT) for modeling the monthly rate of evaporation in arid and humid climates. e obtained results showed that the WT did not significantly enhance the prediction accuracy in some cases. Besides, the standard model (ANN) showed satisfactory accuracy in terms of predicting the evaporation rates. As ANN showed higher performance in prediction evaporation loss, it is significant to compare ANN with other machine learning methods such as RF and ELM. A study introduced by [33] provided a good comparison between the performances of ANN and random forest in the prediction of evaporation. e study's result proved that the RF has better performance than ANN as well as providing very accurate estimates. Furthermore, Althoff et al. [34] presented a study using different ML approaches to estimate the small dams' evaporation loss in Brazil. e findings of the study illustrated that the performance of RF was very satisfactory in the prediction of evaporation loss over small dams. Several other research evidenced the contribution of the AI models in simulating the catchment evaporation processes [35][36][37]. Recently, kernel-based models, fuzzy algorithms, and their hybrids with other algorithms have been successfully used for predicting evaporation [38]. However, developed gradient boosting models were rarely applied in modeling reference evapotranspiration worldwide. According to our knowledge, no study has focused on evaluating and comparing the capability of newly developed gradient boosting models for evaporation estimation in arid to semiarid climate zones of Iraq. erefore, it is interesting to evaluate the performance of GBM and compare it with reliable AI models such as extreme learning machine (ELM), quantile regression forest (QRF), and Gaussian process regression (GPR) for estimating evaporation rate (E p ) in arid to semiarid climate zones of Iraq. e contribution of this study is to determine the efficiency of the gradient boosting model (GBM) in estimating the evaporation rate (E p ) using data collected from two meteorological stations located in Iraq. e performance of GBM was compared with those of reliable AI models such as extreme learning machine (ELM), quantile regression forest (QRF), and Gaussian process regression (GPR). Furthermore, it is the first time to use GBM model for predicting the monthly evaporation loss related to several stations located in Iraq.

Data and Case Study
Iraq is geographically located in the Middle East and has almost two major climate zones, semiarid in the south and semihumid in the north [39]. e Iraqi region lacks sufficient water resources and suffers from droughts [40,41]. As temperatures rise in Iraq, surface water availability decreases, and groundwater levels in aquifers decrease. Iraq's hydrological cycle has been affected severely by evaporation, which currently depletes about 61% of its total precipitation [16,42]. us, it is very important to accurately predict the evaporation loss in Iraq. In this study, two case studies are selected to estimate the evaporation rate. e first case study is in Diyala state, while the second station is in Erbil state (see Figure 1). Diyala is located in the central part of the region, while Erbil is located in the northern region. e evaporation rate was predicted as function of six metrological parameters such as sunshine hours, minimum and maximum temperature, wind speed, rainfall, and relative humidity.

Gaussian Process Regression. Rasmussen and Williams
were the first to introduce the Gaussian process regression (GPR) [43]. is approach is a well-known and nonparametric method used for solving classification and regression problems. Furthermore, GPR model has been commonly employed to address several water resources concerts [44][45][46][47]. GPR combines Bayesian learning and kernel machines to form a principled and probabilistic approach to create a regression model. A model prediction's uncertainty can be directly outputted alongside the projected value [48].
In general, the mean and kernel function can be used to calculate a GPR [49]. According to this definition, GPR is an assemblage of random variables representing the value of function f(t) at the given location (t). It can be expressed as follows: (1) f(t) is the prior distribution of the regression function, and k(t), an d m(t) are the kernel and function, respectively. By considering that the training set T includes input finite numbers in a matrix form t 1 , t 2 , . . . t n , the joint distribution of GPR is defined as follows: where M(T) is the mean function which can be calculated by the mean function m(t) as follows:

Advances in Meteorology
Moreover, the kernel function K(T, T) of the applied model can be determined by mean function k(t, t ′ ) as follows: In this study, the mean function is set to zero for simplicity to produce a widely used GPR prior. Besides, this technique has been widely used in previous studies [43,50]. Finally, (1) will be rewritten as follows: 3.2. Extreme Learning Machine. Extreme learning machine (ELM) has the advantages of being a single hidden layer feedforward neural network (FFNN) with good global search ability, simple structure, fast learning speed, and excellent generalization abilities [51]. ere are two types of weights in the ELM: the input weights related to the hidden layer which are assigned randomly and the output weights which are attained by analysis and calculation [52]. In other words, unlike traditional neural networks, the ELM does not require iterative learning [53]. e outputs weights of the ELM can be easily computed by determining the generalized inverse of the output matrix of the hidden output weight values. e structure of the ELM is greatly simplified by this process. e training process of ELM is summarized by few steps as follows:.
(i) Input the training dataset, and select the ELM's structure (hidden nodes) and the activation function of the hidden layer (see Figure 2). (ii) Calculate the H matrix (output of hidden layer) as follows: . . , l, are hidden nodes parameters which are randomly assigned. (iii) Determine the output weight matrix (β): where T is the actual label vector of the training dataset and H + is Moore-Penrose generalized inverse matrix (H).

Quantile Random Forest (QRF). Random forest (RF) is an ensemble and supervised learning algorithm invented by
Breiman [54]. e core concept of this approach is to integrate multiple trees through ensemble learning procedures. Furthermore, RF is a modified version of the Bagging algorithm with the basic idea that, for the original dataset, S n are selected as a new data and S n would be trained by using put back sampling method separately. e CART decision tree in RF is employed as a weak learner; however, for each tree is generated, the required number of features will be selected randomly from the original dataset labels. us, in a regression problem, the results of weak learners (T) are averaged to obtain the final model output. Averaging approach of RF has a significant importance in reducing the bias, as well as variance and correlation between trees [23]. e quantile random forest (QRF) is considered as an improved version of RF, applying quantile regression (QR) instead of averaging approach in calculating the final form of a target [55]. Furthermore, the QRF is considered a nonparametric approach enhanced by a solid theoretical foundation [56]. e conditional distribution of the QRF can be mathematically expressed as follows: (8) is derived by taking mean value of the observations. With regard to QRF, E(l y i≤y | X � x) is representing the weighted average value of all observations l y i≤y | .
e steps below illustrate the QRF algorithm: (i) e M decision tree T(θ t ), t � 1, . . . k , is created in random forests (RF) as well as taking into account the observations of each node related to a decision tree. (ii) For X � x, o it will be repeated for all decision trees and then determine all observations of each decision tree. Finally, the weight w i (x, θ t ) of each observation i ∈ 1, . . . , n { } is calculated by averaging the weights of tree decisions. (iii) For all y ∈ R, calculate the estimate of the distribution function with (9) by using the weights obtained in step (2). Figure 3 presents the flowchart of the QRF model.

Gradient Boosting
Machine. Gradient boosting machine (GBM) model is one of the most famous supervised algorithms introduced as a robust technique to solve problems related to classification and regression [57]. Decision tree is a faster algorithm but it still suffers instability, so GBM is introduced to solve this serious problem [58][59][60]. Furthermore, GBM has combined the decision tress and boosting algorithms' advantages [61]. e GBM works mainly on the formulation of the gradient descent of boosting technique and, hence, it is very useful for classification and regression problems [62]. e boosting structure is primarily a constructive scheme of ensemble formation that involves successively adding new weak base models that are trained according to the calculated error of the previous whole ensemble model for each iteration, and these base learners generate only a slightly lower error rate compared to random guessing. e boosting method family is based on a constructive strategy in which the learning mechanism will fit new models sequentially to produce a more precise estimation of the response variable. Figure 4 shows the structure of gradient boosting machine regression model. e approach of the GBM model can be illustrated in several steps as follows: (i) e GMB is initialized to minimize the loss function with a constant value. (ii) e negative gradient of the cost function is estimated in each iterative training process as the residual value in x i model (current one). (iii) A new regression tree will be trained to fit the residual obtained from the second step.
(iv) In this step, the residual is updated and the current regression tree is added to the previous model.

Statistical Evaluation Metrics.
e four applied models have been compared and assessed to select the best models for predicting monthly evaporation. ere are five statistical criteria, root mean square error (RMSE), mean absolute error (MAE), correlation coefficient (R), mean absolute percentage error (MAPE), and modified index of agreement (Md), which were used to assess the models' performances for training and testing phases. e mathematical expressions of these parameters are illustrated below [64]: In the above equations, EP obi and EP smi are the actual and predictive monthly evaporation values at i − th record, respectively. EP ob and EP sim are the mean observed and predicted monthly evaporation values and n is the number of records Algorithm 1.

Results and Discussion
In this study, four machine learning modeling approaches have been developed to select the best model for predicting monthly evaporation. e four models (RF, ELM, GBM, and GPR) are trained and validated using climate data collected from two different locations in Iraq. About seventy percent of available data were used for calibration and the other thirty percentage used for validating the predictive models. e used models in this study have been assessed by different statistical criteria as well as graphical presentations.
For the case study, the performances of the applied models through the training phase are summarized in Table 1   Advances in Meteorology performance in the simulation of the evaporation rate for both case studies according to the obtained statistical parameters.
To assess the prediction accuracy of the applied models for the two case studies, boxplot diagrams were established to visually show the similarity of the prediction values with the observed evaporation rates. e performances of the four applied models to predict the monthly evaporation rate for both cases studies are graphically illustrated in Figures 5 and  6, respectively. e clearest observation that can be reported was the inability of the GPR model to generate an acceptable accuracy of evaporation estimations. Moreover, this model could not provide a satisfactory prediction especially for higher and lower values of evaporation. However, both figures illustrated that the GBM was superior because the calculated median for that model was very close to the actual value. Additionally, it successfully managed to simulate the higher and lower values of evaporation compared to other models.
Although success has been attained in the monthly evaporation using the GBM model during the training phase, it is very essential to evaluate the proposed model with testing dataset. As is well known, the training results may provide misleading assessment because the model is trained using known input and third corresponding targets [65]. Besides, the testing phase is very crucial in assessing the quality of the predictive models and, hence, the models' abilities would be assessed very well in terms of generalization and avoiding overfitting [66]. e assessment process of the applied models through the testing phase for the first case study that was in Diyala state is exhibited in Table 3. e superiority of the GBM model in estimating the monthly evaporation compared to other models has been easily noted in the table. More specifically, the GBM model was found to produce a satisfactory estimate with RMSE of 28.478, MAE of 21.541, MAPE of 0.181, R of 0.976, and Md of 0.987. However, the QRF provided the worst prediction accuracy compared to the applied models. With respect to case study 2 which was in Erbil state, the performance of the GBM according to e reported results for both case studies showed that the GBM significantly outperformed the other machine learning models. e superiority of this model can be measured based on its capacity for reducing the MAE and RMSE for both stations during the testing phase (see Figure 7). e results showed for the case first case study a prediction enhancement in terms of MAE and RMSE by 7.17%, 21.01%; 16.51%, 15.74%; and 23.14%, 26.64%; during using GBM compared to ELM, GPR, and QRF, respectively. However, for the second case study in Erbil state, the prediction enhancement was improved in terms of reduction of MAE and RMSE by 10.88%, 9.24%; 15.24%, 5%; and 16.06%, 15.76%; respectively, compared to ELM, GPR, and QRF models. e visualization assessment presented in Figures 8  and 9 proved that the estimated monthly evaporation rates for both stations by GBM through the testing phase were very close to the observed values. Moreover, the statistical parameters such as median and highest and lowest values   were noticed to be very similar to the actual values. However, these figures showed that the GPR model had a poor performance in both case studies compared to other models.
For further assessment, Taylor diagrams were created using the prediction values obtained from four models for both stations (see Figures 10 and 11). e advantage of    8 Advances in Meteorology using Taylor diagram is to assess the comparable models with the actual data using three statistical parameters (standard deviation, root mean square error, and correlation coefficient). Besides, the equivalent evaporation rates obtained from each model and the actual values were assigned on a polar diagram. It can be seen from  Advances in Meteorology figures related to both stations that the location of the GBM model was closer to the actual values than other comparable models.

Conclusions
As the evaporation rate is a significant element in the hydrological cycle, its process in nature is very complicated and stochastic. In this paper, the capability of artificial intelligence models such as ELM, QRF, GBM, and GPR has been evaluated in the prediction of monthly evaporation over two stations located in Diyala and Erbil states, Iraq. e input parameters include metrological data such as sunshine hours, minimum and maximum temperature, wind speed, and relative humidity. e models were assessed using different statistical criteria as well as graphical plots. e findings of this study revealed that the GBM modeling approach has an excellent performance in the prediction of the monthly rate of evaporation over two stations with minimum forecasting errors. However, the QRF models showed the poorest performance compared with other applied models. All in all, the achieved results proved that the suggested predictive model (GBM) showed an optimistic technique for these regions; thus, it may assist local stakeholders in the management of water resources.

Recommendations
e recommendations for future research can be illustrated as follows: (i) is study recommends the use of the adopted model GBM to estimate the monthly evaporation rates and investigate over several stations located in the middle and southern parts of Iraq. is study showed that the GBM model showed a good prediction accuracy in areas located in the eastern and northern parts of Iraq. us, it is very important to investigate the ability of this model in estimating evaporation in another regions. (ii) e application of feature selection tool is very important to choose the most proper input variables, thus reducing the model complexity [13,67]. (iii) e GBM model is incorporated with novel bioinspirated algorithms for enhancing its performance prediction, thereby producing much accurate predictions [68][69][70].

Advances in Meteorology
Data Availability e data are available upon request from the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.