Estimating the Pan Evaporation in Northwest China by Coupling CatBoost with Bat Algorithm

Accurate estimation of pan evaporation (Ep) is vital for the development of water resources and agricultural water management, especially in arid and semi-arid regions where it is restricted to set up the facilities and measure pan evaporation accurately and consistently. Besides, using pan evaporation estimating models and pan coefficient (kp) models is a classic method to assess the reference evapotranspiration (ET0) which is indispensable to crop growth, irrigation scheduling, and economic assessment. This study estimated the potential of a novel hybrid machine learning model Coupling Bat algorithm (Bat) and Gradient boosting with categorical features support (CatBoost) for estimating daily pan evaporation in arid and semi-arid regions of northwest China. Two other commonly used algorithms including random forest (RF) and original CatBoost (CB) were also applied for comparison. The daily meteorological data for 12 years (2006–2017) from 45 weather stations in arid and semi-arid areas of China, including minimum and maximum air temperature (Tmin, Tmax), relative humidity (RH), wind speed (U), and global solar radiation (Rs), were utilized to feed the three models for exploring the ability in predicting pan evaporation. The results revealed that the new developed Bat-CB model (RMSE = 0.859–2.227 mm·d−1; MAE = 0.540–1.328 mm·d−1; NSE = 0.625–0.894; MAPE = 0.162–0.328) was superior to RF and CB. In addition, CB (RMSE = 0.897–2.754 mm·d−1; MAE = 0.531–1.77 mm·d−1; NSE = 0.147–0.869; MAPE = 0.161–0.421) slightly outperformed RF (RMSE = 1.005–3.604 mm·d−1; MAE = 0.644–2.479 mm·d−1; NSE =−1.242–0.894; MAPE = 0.176–0.686) which had poor ability to operate the erratic changes of pan evaporation. Furthermore, the improvement of Bat-CB was presented more comprehensively and obviously in the seasonal and spatial performance compared to CB and RF. Overall, Bat-CB has high accuracy, robust stability, and huge potential for Ep estimation in arid and semi-arid regions of northwest China and the applications of findings in this study have equal significance for adjacent countries.


Introduction
Evaporation is the significant content for meteorological science, water resources evaluation, and hydrological cycle [1,2]. Accurate simulation of evaporation contributes to many aspects including hydrology and water resources management, agricultural activities, irrigation scheduling, and water conservation, especially in arid regions [3,4]. However, evaporation is extremely difficult to present effectively due to its complex interactions between land and atmosphere system [5]. Nowadays, the methods for evaporation measurement are generally divided into estimation by models and direct measurement Water 2021, 13,256 3 of 17 E p prediction in Poyang Lake Basin of southern China. The result showed FPAELM model was estimated best among these models at all stations. Seifi, A., et al. [25] evaluated the capability of three novel ANN models hybridized with Genetic Algorithm (GA), Grey Wolf Optimization (GWO), and Whale Optimization Algorithm (WOA) under five different climate conditions in Iran for estimating E p value. They draw conclusions that ANN-GA model performed better than the other two models in estimating daily E p and the hybrid ANN model represented input and output relationships effectively.
Bat algorithm (BA) is a meta-heuristic algorithm based on swarm intelligence for global optimization [26]. Due to the advantage of high accuracy, effectiveness, and maneuverability in optimizing parameters, the Bat algorithm has been applied in such as environmental resource scheduling, flood routing, rainfall forecasting, and evapotranspiration estimating [27][28][29][30][31]. Dong, J., et al. [32] conclude that the ELM model coupled with the Bat algorithm showed the highest accuracy and stability in the estimation of daily dew point temperature among ten models (including KNEA, GA-ELM, POS-ELM, ANN, ANFIS, RF, SVM, ELM, MARS). Therefore, the potential of the Bat algorithm in self-improving and the ability to optimize non-linear parameters is tremendous. To the best of our knowledge, the application of meta-heuristic algorithms such as the Bat algorithm in the hydrological field has been minimal, even none in E p prediction.
Besides, compared with bat algorithm, tree-based ensemble models such as gradient boosting with categorical features support (CatBoost) have unique advantages and equivalent potential in predicting ability. CatBoost is a novel gradient boosting technology proposed by Yandex Company. It has been applied in many fields as a result of good performance such as weather forecast, media popularity prediction, and reference evapotranspiration [33,34]. Huang, G., et al. [35] compared the CatBoost model with SVM and RF models in estimating reference evapotranspiration in humid regions of China and found the CatBoost model represents significant superiority not only in accuracy and stability but also in computing time and memory usage. Zhang, Y., et al. [20] further evaluated the feasibility of the CatBoost model in estimating ET 0 under arid and semi-arid conditions of Northern China and took the generalized regression neural network (GRNN) and random forests (RF) models as a contrast. Their findings revealed the CatBoost showed the same advantage comparing with GRNN and RF and was observed to be the best alternative for estimating ET 0 .
Nevertheless, the different random permutations produced in CatBoost may have a great impact on the results. Besides, there are more parameters for CatBoost to set compared to other machine learning models, which increases the possibility of falling into local optima. To overcome this weakness, coupling an efficient searching algorithm with CatBoost is a workable method. Bat algorithm stands out for its excellent global searching ability. In addition, there seems to be no literature available utilizing bat algorithms to optimize the CatBoost, which is a potential application in hydrology, agriculture, and environmental fields, especially in pan evaporation estimation.
Thus, the objectives of this study were set to (1) investigate capability and usability of the hybrid model coupled CatBoost with Bat algorithm (Bat-CB) in arid and semi-arid regions of northwest China for estimating E p ; (2) evaluate the generalization performance of Bat-CB under seasonal and geographic conditions though weather data from 45 stations, in comparison with CatBoost and RF models.

Random Forest (RF)
The random forest has not only striking predicting accuracy and widespread application in classification and regression fields but also has a powerful ability to handle features in a dataset [36]. Besides, random forest is a compatible algorithm. Iwendi, C., et al. [37] found ensemble random forest had outperformed all the included methods for improving intrusion detection systems. Therefore, to examine the forecasting capacity of two aforesaid advanced algorithms, random forest stands out as the standard method. Based on the Water 2021, 13, 256 4 of 17 classification and regression tree (CART), random forest utilizes the ensemble strategies such as bootstrap and bagging to handle high-dimensional regression issues [31,38,39].
For making a group of trees, random forest draws randomly from the original dataset as the training subsets by a bootstrap method and releases them back after sampling until the minimum number of nodes is reached. The data not sampled in the original dataset are called "out of the box" (OOB) and can be used to calculate out-of-bag error that is an unbiased estimation [38]. Additionally, RF grows trees unpruned and each node is split using the best predictor of randomly chosen subsets of predictors rather than the best one among all predictors, which are robust against the overfitting. Eventually, the ultimate outcome of the forecast is determined through a bagging procedure that assesses the predictors comprehensively in the integrated trees. More details about Random Forest can be found in Breiman, L. [38]. The structure of the RF is shown in Figure 1.
aforesaid advanced algorithms, random forest stands out as the standard method. Based on the classification and regression tree (CART), random forest utilizes the ensemble strategies such as bootstrap and bagging to handle high-dimensional regression issues [31,38,39].
For making a group of trees, random forest draws randomly from the original dataset as the training subsets by a bootstrap method and releases them back after sampling until the minimum number of nodes is reached. The data not sampled in the original dataset are called "out of the box" (OOB) and can be used to calculate out-of-bag error that is an unbiased estimation [38]. Additionally, RF grows trees unpruned and each node is split using the best predictor of randomly chosen subsets of predictors rather than the best one among all predictors, which are robust against the overfitting. Eventually, the ultimate outcome of the forecast is determined through a bagging procedure that assesses the predictors comprehensively in the integrated trees. More details about Random Forest can be found in Breiman, L. [38]. The structure of the RF is shown in Figure 1.
From the algorithm described above, only two parameters needs tuned, the number of ensemble trees (ntree) and the number of predictors randomly selected at each node (mtry). The ntree should be settled appropriately so that every input gets predicted at enough times without increasing calculating time excessively. As for mtry, the default values are different for classification. The two optimal parameters vary among different stations but will be trained and input until the optimum appears.  From the algorithm described above, only two parameters needs tuned, the number of ensemble trees (n tree ) and the number of predictors randomly selected at each node (m try ). The n tree should be settled appropriately so that every input gets predicted at enough times without increasing calculating time excessively. As for m try , the default values are different for classification. The two optimal parameters vary among different stations but will be trained and input until the optimum appears.

Gradient Boosting with Categorical Features Support (CB)
CatBoost, a novel machine-learning algorithm based on gradient boosting decision tree (GBDT) algorithm, was verified that it surpassed other advanced GBDT algorithms such as XGBoost and LightGBM in many aspects particularly while dealing with considerable data and features. The enhancements are majorly reflected in three fields: First and foremost, traditional GBDT algorithms generally cope with categorical features by a method named Greedy Target Statistics (Greedy TS) which is quite efficient but subject to an inherent problem of conditional shift. To avoid this problem, CatBoost applies an approach that relied on the ordered principle so that it can get over the target leakage. Therefore, this approach makes the whole dataset available for the training model to learn and handles categorical features during training time. Specifically, CatBoost performs a random permutation of the dataset and select one categorical feature, then calculates an average label value for the example with the same category value placed before the selected category in the permutation. According to Prokhorenkova, L., et al. [40], if we sample as a permutation (θ = [σ 1 , σ 2 , . . . , σ n ]n T ) from the given dataset, the permutation is substituted with (Equation (1)): In (Equation (1)), P is a prior value and β is the weight of the prior value. The prior is usually the average label value in the dataset, and it helps reduce the noise from the low-frequency category.
Secondly, another pivotal enhancement of CatBoost is the conversion from the traditional gradient boosting algorithm to the ordered boosting which figures out the inevitable problem of the gradient bias in the iteration process and increases the generalization ability. When GBDT substitutes categorical features with numerical values by target statistic, the conditional distribution for a training example will be not identical with that for the test example. Training a model without the specific sample, in order to make the residual of models unshifted, can solve the issue of unbiased gradient boosting. However, it is difficult to carry out in practice. CatBoost generates random multiple permutations through a method inspired by the ordering principle to obtain sufficient permutations, which can reduce the effect of overfitting efficiently and enhance the robustness of models [40].
Thirdly, in the aspect of handling categorical features, CatBoost constructs combinations of categorical features though a greedy way and uses these combinations as the additional features. Namely, CatBoost combines the categorical features already presented in built trees with all categorical features in the dataset. This method helps models to more easily capture the high-order dependencies and further improve the accuracy of estimation.
Another advanced and noticeable specialty is that CatBoost selects oblivious decision trees as the base predictors. Such trees work out an impartial decision with the same splitting across the entire level of the tree and speed up the execution, which means they are less prone to over-fitting and shorten the testing time.

Bat Algorithm Coupling with CatBoost (Bat-CB)
The bio-inspired bat algorithm is a metaheuristic algorithm originally introduced by Yang, X. S. [26], which mirrors the foraging behavior of micro bats. In the searching process, each bat emits high-frequency pulses to search for targets and analyzes the unique echolocation characteristics (i.e., velocity, loudness, and frequency) which contributes to locating the target and strengthen searching ability. Mathematically, the bat algorithm can be implemented as follows:

1.
Generating a population of bats for simulations, and assigning each bat the initial velocity v i , frequency f i, and position x i .

2.
From the first iteration to the maximum iteration, the three characters at time t are updated by (Equations (2)-(4)).
In (Equations (2)-(4)), β∈[0, 1) is a random vector from a normal distribution, f i controls the step length of bat movement, x t i and v t i are the updated positions and velocities of bats at time t, respectively, and x * is the current best position(solution), namely, a bat is located after comparing all the fitness values of solutions among the bats within the population. 1 Generating a random number (rand) as the criteria for whether the current solution needs improvement. If the random is higher than A t , bats will update their best positions through the random walk: where rand∈[−1, 1] and A t is the average loudness of all bats at time t. 2 Generating another random number. If rand <A i and f (x i ) < f (x*), then yield the solution at the last step and updating the emission rates of each bat r i and loudness of each bat A t by: where α and γ are both constants. Thereby, 0 < α < 1 and γ > 0.
The iterations (from step 2 to step 4) will continue processing until the maximum number of iterations is reached. Finally, ranking the fitness values of all bats and obtain the best position. The structure of the bat algorithm is shown in Figure 2.
In (Equations. (2)-(4)), β∈[0, 1) is a random vector from a normal distribution, fi controls the step length of bat movement, x t i and v t i are the updated positions and velocities of bats at time t, respectively, and x * is the current best position(solution), namely, a bat is located after comparing all the fitness values of solutions among the bats within the population. 1 Generating a random number (rand) as the criteria for whether the current solution needs improvement. If the random is higher than At, bats will update their best positions through the random walk: where rand∈[−1, 1] and At is the average loudness of all bats at time t. 2 Generating another random number. If rand <Ai and f(xi) < f(x*), then yield the solution at the last step and updating the emission rates of each bat ri and loudness of each bat At by: t rr (7) where α and γ are both constants. Thereby, 0 < α < 1 and γ > 0.
The iterations (from step 2 to step 4) will continue processing until the maximum number of iterations is reached. Finally, ranking the fitness values of all bats and obtain the best position. The structure of the bat algorithm is shown in Figure 2.  In this study, the bat algorithm was integrated with CatBoost models for estimating pan evaporation. As stated previously, the parameters have a large impact on the final performance of CatBoost. Choosing appropriate parameters for the CatBoost model will intensify the gradient boosting function and improves the forecasting ability remarkably in theory. In Bat-CB, three vital parameters of the CatBoost model, including the number of trees to grow (n rounds), the learning rate (eta), and the maximum depth of trees (depth), were optimized by bat algorithm.

Study Area
The study area, which covers nearly 1/6th area of China, comprises the most area of Xinjiang and the northwest regions of Gansu, Ningxia, and Inner Mongolia. The geographical position of the study area is located adjacently in Central Asia where is far from the seas and less influenced by a summer monsoon and humid ocean air. Therefore, it belongs to the typical temperate continental climate which is characterized by torridity, dry, abundant sunshine, and scarce precipitation. The annual precipitation ranges mostly from 100 mm to 300 mm while the annual evaporation is higher than 1500 mm, even up to 3000 mm. The multi-years mean relative humidity is about 50.33% and is far below that in eastern and western regions of China (almost range from 60% to 80%). Another noteworthy character, the evaporation varies greatly along with seasons while the evaporation in summer is 10-30 times as much as that in spring and winter.
Due to the above reasons, the water resources shortage is extraordinarily severe in such arid and semi-arid regions of Northwest China, which is the biggest obstacle in the socio-economic development of the study area. Solving the water shortage in the study area is a hotspot with huge potential value and significance. Additionally, the typical temperate continental climate, occupying a 3/5th area of Asia, is the most widely distributed climate in Asia. The results of the study may have a universal significance in such an area with a similar climate.

Dataset
In this study, a continuous and long-term series of daily meteorological data from 45 weather stations in Northwest China during 2006-2017 was selected for model training and testing. The five meteorological parameters, including minimum air temperature (T min , • C), maximum air temperature (T max , • C), relative humidity (RH, %), wind speed (U, ms −1 ), and sunshine duration (N, h) were considered in the models. The global solar radiation (R s , MJ m −2 d −1 ) data are insufficient because of the limitation of stations that can measure the parameter directly in the study area. Thus, the R s was calculated by using a completely clear day (R 0 ) and sunshine duration (N, h) through the empirical Angström-Prescott model (A-P model) according to Fan, J., et al. [41]. The other four parameters along with sunshine duration and completely clear day were obtained from the National Meteorological Information Center (NMIC) with quality control examined by China Meteorological Administration (CMA) (http://data.cma.cn/). The observed data as the real values of pan evaporation were obtained easily by the measuring pan among 45 stations. The data were divided into two groups, of which one group (2006-2013) was used to develop and train the three artificial intelligence models and the other group (2014-2017) was used for the model testing. The statistical properties of the daily data at the selected 45 stations are shown in Table 1.

Statistical Analysis
Four statistical evaluation measures were used to comprehensively evaluate the performance of different methods for pan evaporation estimations. The equations are as follows: (i) Root mean square error (RMSE) (ii) Mean absolute error (MAE) (iii) Nash-Sutcliffe Efficiency (NSE) (iv) Mean absolute percentage error (MAPE) Water 2021, 13, 256 9 of 17 In (Equations (8)-(11)), Y EST,i, and Y OBS,i are estimated and observed pan evaporations respectively. The Y OBS,i,MEAN is the average value of observed pan evaporation.  Table 2 shows the overall performance of the three machine learning methods at the 45 stations during the training and testing stages. In the training data, three models showed high consistencies among different statistical indicators and value categories. The  16.8% and 54.2%; the RMSE maximum decreased by 23.6% and 61.9%, while the minimum decreased by 4.3% and 16.9%.Obviously, RF pursues the best fitting results in training but performs worst in the practical testing, which indicates that RF model has the most serious over-fitting problem among the three models. This indication has been in accordance with not only the predictions of ET0 by Zhang, Y., et al. [20] who declared CatBoost had a less over-fitting problem than RF and GRNN models in all input combinations but also the estimation of dew point temperature by Dong, J., et al. [32]. In particular, the estimations of some stations by RF have quite large errors without a limitation for excessive dispersion, while CB and BAT-CB models have better and more positive effects in the stations with large errors.

Results and Discussion
Nevertheless, using an optimization algorithm will not usually have an improvement on the original models in some studies [6,42,43]. Comparing the hybrid model with the original model and another commonly used model whose parameter combination is not sophisticated was indispensable for the validation of the novel model [8]. In this part, the improvement of Bat-CB model is relatively limited in the stations where the errors of the three methods are relatively small. Correspondingly, the Bat-CB draws positive results in the stations and indictors whose values are relatively high. Specifically, in the comparison with CB, Bat-CB has a more conspicuous improvement of controlling the Max value and the SD value (decreased by 22.1% and 31.3% on average of RMSE, MAE, and MAPE) in the testing stage than the other three statistical indicators which are the mean value, the min value and the median value (decreased by 13.1%, 0.7%, and 12.0% respectively). It is noteworthy that Bat-CB outperformed CB in both calibration and validation stages, which is evident as the bat algorithm help CB overcome the overfitting problem and improve the accuracy of prediction substantially. In general, the above statistical results preliminarily show that BAT-CB model is superior to CB model and RF model.
Though Table 2 showed the four statistical indicators of estimating outcomes holistically, visualizing the performance of models in every station was indispensable and convincing. Consequently, Figure 3 presented the authentic estimating results of the three models through four statistical indices in the 45 stations. The consequences of models showed an accordant trend of the four different indicators in the stations, which was a complement of the previous conclusions. Bat-CB performed best in the majority of stations while RF was relatively dissatisfactory among the three models. Moreover, Bat-CB had a certain degree of advantages that varied from the different stations. However, the superiority of the models was not absolute, RF still had considerable predicting ability in station 51709 and the 51232, and CB performed slightly better than Bat-CB in the station 52546, 52652, and 52674 especially in RMSE and NSE indicators. Further research was still required to explain the advantages of the three models in various stations. Nevertheless, it is doubtless that Bat-CB had the best predicting ability and robust stability in most stations among the three models in general. To further observe the performance of the models, we randomly selected six stations dispersed from the local position in the study region and drew scatter plots of measured and simulated pan evaporation values (Figure 4). It is conspicuous that the scatter plots of measured values and simulated values of the three methods during the test period are significantly different. For better exemplification, the linear fit equation and the coefficient To further observe the performance of the models, we randomly selected six stations dispersed from the local position in the study region and drew scatter plots of measured and simulated pan evaporation values (Figure 4). It is conspicuous that the scatter plots of measured values and simulated values of the three methods during the test period are significantly different. For better exemplification, the linear fit equation and the coefficient of determination (R 2 ) were included, which provides a universal approach to measure the global adequacy of the model. It is noteworthy that both CB and Bat-CB have underestimated the values generally (according to the slope <1) which is inverse to the view that the heuristic models overestimate the high pan evaporation values in some studies [44]. In terms of the distribution of points, it is interesting that the forecasted values are distributed evenly to observed values especially when the values are large. This phenomenon is a signal alerting us to check over the models in practical application, and we can average the results or shorten the span of steps to approach the real values efficiently. However, RF mainly has highly overestimated aggregates of points at high values in 4 stations (51567, 51931, 52203, and 52681 sites) occupied a proportion of 2/3, which even transfer the illusion the results seemed convinced. Definitely, there is no use for RF to fix itself by the above methods. As a whole, Bat-CB avoids the problems that existed in CB and RF with high accuracy and shows the strongest stability in every station, particularly in large value estimation. On the other hand, the distribution of AE can be further evaluated the applicability of the model ( Figure 5).
AE is the absolute error which is computed as AE = |FOR − OBS|; [Ideal value = 0]. Hence, a model that has higher occurrences of AE close to zero is generally better and more convincing. In the above six stations, the three different machine learning methods all have about 50% point of AE below 0.4 mm·d −1 . From 0 to 2 mm·d −1 of AE, the proportion of data is gradually decreasing. However, the percentage of points larger than 2 mm·d −1 mainly distributed 10-20%. Especially for the RF model, the error of this method larger than 2 mm was more than other models. In contrast, the AE larger than 2 mm of Bat-CB was the most stable proportion in six stations ranged from 0.06 to 0.13 and the lowest proportion in 51156, 51931, 52681 sites. From the perspective of two groups that AE value is less <0.4mm·d −1 and AE value is 0.4-0.8mm·d −1 , the AE of the Bat-CB model had the highest proportion at 51156, 51567, 51,704, 51931 stations of a total of six stations, which were significantly higher than CB and RF models. Although the RF model had a slightly higher percentage of points <0.4 mm·d −1 at station 52203 than the other two models (higher The three different methods evenly have a good performance when the evaporation is small (<4 mm d −1 ), but when the evaporation is large than 4 mm d −1 , the RF scatter points diverted from the 1:1 liner form, and all sites showed obvious problems of overestimation or underestimation. In consideration of the special climate in the study area, the applicability and accuracy of RF is decreased and unconvinced in such arid and the semi-arid area where the evaporation is extremely higher than the normal levels with a long period every year. CB method was slightly better than the RF method in general, which was highly overestimated when the evaporation capacity of 52203 and 52681 stations was very large. Compared with CB, Bat-CB has a higher R 2 and convergent tendency of 1:1 line without any exception in all 6 sites as the plots showed ( Figure 4). This indicates Bat algorithm has a positive promotion on the CatBoost almost every site. Although the improvement evoked from the bat algorithm seems not so conspicuous in minority stations such as the 52203 site, the bat algorithm help CB avoid the overfitting problem obviously in 51704 and 51567 sites.
It is noteworthy that both CB and Bat-CB have underestimated the values generally (according to the slope <1) which is inverse to the view that the heuristic models overestimate the high pan evaporation values in some studies [44]. In terms of the distribution of points, it is interesting that the forecasted values are distributed evenly to observed values especially when the values are large. This phenomenon is a signal alerting us to check over the models in practical application, and we can average the results or shorten the span of steps to approach the real values efficiently. However, RF mainly has highly overestimated aggregates of points at high values in 4 stations (51567, 51931, 52203, and 52681 sites) occupied a proportion of 2/3, which even transfer the illusion the results seemed convinced. Definitely, there is no use for RF to fix itself by the above methods. As a whole, Bat-CB avoids the problems that existed in CB and RF with high accuracy and shows the strongest stability in every station, particularly in large value estimation.
On the other hand, the distribution of AE can be further evaluated the applicability of the model ( Figure 5). about 0.02), it also had the highest percentage of points with the AE at >2 mm·d −1 (h about 0.19), which is an important reason for the poor performance of RF models. G ally, RF performed worst far from Bat-CB and CB in 51156, 51931, 52203, 52681 sta CB is mildly inferior to RF and Bat-CB in 51567 and 51704 stations. While Bat-CB sh unparalleled stability and high accuracy compared with the two other models. This come is consistent with the results shown in scatter plots ( Figure.4.). Accordingly, th histogram also confirmed the main advantage of the Bat-CB model with higher accu is to reduce the proportion of points with large errors.

Seasonal Effects on The Performance of Machine Learning Models
From the point of view of water resources management, the error at month sca pecially the error of different seasons is also imperative for the application of the mo For the moment, most machine learning models were vulnerable in operating the su dramatic changes [8,45]. In this study, the stability and the accuracy of the mode equally important especially when the data are extremely abnormal like the pan eva tion in summer of arid regions. The average deviation of three different machine lea models across all stations for each month is shown in Table 3 and Figure 6 through statistical indicators (MAE, MAPE, RMSE, NSE respectively). For brevity, the main re for the performance difference of the three different algorithms is that the performan the three models is hugely different from April to October, while there is almost i crimination from November to March. The BAT-CB model had the best performance April to October on every indicator, followed by the CB model and RF model.
In addition, since the study area is located in the northern hemisphere, the abs errors RMSE and MAE in summer are significantly higher than those in winter (Dece to February). RF's poor ability to operate in the non-stationary environment and ada erratic changes makes it easy to stuck in the abrupt malfunction caused by seasonal ations of pan evaporation. RF had the extreme worst NSE and MAPE value in Septe and it performed unstably among months, which confirmed the foresaid disadvan AE is the absolute error which is computed as AE = |FOR − OBS|; [Ideal value = 0]. Hence, a model that has higher occurrences of AE close to zero is generally better and more convincing. In the above six stations, the three different machine learning methods all have about 50% point of AE below 0.4 mm·d −1 . From 0 to 2 mm·d −1 of AE, the proportion of data is gradually decreasing. However, the percentage of points larger than 2 mm·d −1 mainly distributed 10-20%. Especially for the RF model, the error of this method larger than 2 mm was more than other models. In contrast, the AE larger than 2 mm of Bat-CB was the most stable proportion in six stations ranged from 0.06 to 0.13 and the lowest proportion in 51156, 51931, 52681 sites. From the perspective of two groups that AE value is less <0.4 mm·d −1 and AE value is 0.4-0.8 mm·d −1 , the AE of the Bat-CB model had the highest proportion at 51156, 51567, 51,704, 51931 stations of a total of six stations, which were significantly higher than CB and RF models. Although the RF model had a slightly higher percentage of points <0.4 mm·d −1 at station 52203 than the other two models (higher about 0.02), it also had the highest percentage of points with the AE at >2 mm·d −1 (higher about 0.19), which is an important reason for the poor performance of RF models. Generally, RF performed worst far from Bat-CB and CB in 51156, 51931, 52203, 52681 stations. CB is mildly inferior to RF and Bat-CB in 51567 and 51704 stations. While Bat-CB showed unparalleled stability and high accuracy compared with the two other models. This outcome is consistent with the results shown in scatter plots ( Figure 4). Accordingly, the AE histogram also confirmed the main advantage of the Bat-CB model with higher accuracy is to reduce the proportion of points with large errors.

Seasonal Effects on the Performance of Machine Learning Models
From the point of view of water resources management, the error at month scale especially the error of different seasons is also imperative for the application of the models. For the moment, most machine learning models were vulnerable in operating the sudden dramatic changes [8,45]. In this study, the stability and the accuracy of the models are equally important especially when the data are extremely abnormal like the pan evaporation in summer of arid regions. The average deviation of three different machine learning models across all stations for each month is shown in Table 3 and Figure 6 through four statistical indicators (MAE, MAPE, RMSE, NSE respectively). For brevity, the main reason for the performance difference of the three different algorithms is that the performance of the three models is hugely different from April to October, while there is almost indiscrimination from November to March. The BAT-CB model had the best performance from April to October on every indicator, followed by the CB model and RF model. In addition, since the study area is located in the northern hemisphere, the absolute errors RMSE and MAE in summer are significantly higher than those in winter (December to February). RF's poor ability to operate in the non-stationary environment and adapt to erratic changes makes it easy to stuck in the abrupt malfunction caused by seasonal variations of pan evaporation. RF had the extreme worst NSE and MAPE value in September and it performed unstably among months, which confirmed the foresaid disadvantage. Additionally, using relative forecasting errors such as MAPE and NSE to assess the capacity of models among different conditions is necessary [45,46]. From the perspective of relative errors (MAPE), the errors of the Bat-CB model are not significantly different among months. This indicates that the Bat-CB model has better equilibrium and exhibits excellently and robustly in different seasons. Thus, Bat-CB may be a suitable and recommended hybrid model to overcome the long-lasting problem that most models cannot work effectively in suddenly changed conditions. Additionally, using relative forecasting errors such as MAPE and NSE to assess the capacity of models among different conditions is necessary [45,46]. From the perspective of relative errors (MAPE), the errors of the Bat-CB model are not significantly different among months. This indicates that the Bat-CB model has better equilibrium and exhibits excellently and robustly in different seasons. Thus, Bat-CB may be a suitable and recommended hybrid model to overcome the long-lasting problem that most models cannot work effectively in suddenly changed conditions.

Spatial Effects on the Performance of Machine Learning Models
Equal to seasonal data changes, the spatial performance of machine learning models is the other far-reaching factor mirroring the generalization ability of models. The performance of machine learning models for estimating pan evaporation was visibly distinct to different weather stations [20,23,42,47]. Marking the position of 45 stations in the map and reflecting the value of RMSE to the color is an efficient visualization method to explore the spatial generalization ability of the three models ( Figure 7). It was evident in Figure 7 that Bat-CB had the lowest RMSE among most stations while RF performed worst as a whole. This result and the condition of sufficient weather stations guarantee the high accuracy prediction, robust stability, and reliable generalization of Bat-CB once again.

Spatial Effects on The Performance of Machine Learning Models
Equal to seasonal data changes, the spatial performance of machine learning models is the other far-reaching factor mirroring the generalization ability of models. The performance of machine learning models for estimating pan evaporation was visibly distinct to different weather stations [20,23,42,47]. Marking the position of 45 stations in the map and reflecting the value of RMSE to the color is an efficient visualization method to explore the spatial generalization ability of the three models (Figure 7). It was evident in Figure 7 that Bat-CB had the lowest RMSE among most stations while RF performed worst as a whole. This result and the condition of sufficient weather stations guarantee the high accuracy prediction, robust stability, and reliable generalization of Bat-CB once again.
However, some individual stations did not reach satisfactory results in overall three models such as station 51855 and 51573, which probably because the stations suffered from the complicated terrain and labile climate. Thus, the application of models varied a lot in such spatial individual stations and further study was necessary for this field. Additionally, it is noteworthy that the several stations located in the Turpan Depression which are the mid area of Xinjiang province performed worst may be due to its extremely large values of pan evaporation. In contrast, the stations in the western and northern regions of the study area can be forecasted by Bat-CB with minuscule error, which indicated Bat-CB may have potential application in the extension of the boundary regions and adjacent countries like Russia, Kazakhstan, Iron, Kyrgyzstan, etc. Figure 7. Spatial distribution of the RMSE during testing stage generated from the three models for pan evaporation forecasting.

Conclusions
The study established a novel hybrid machine learning model (Bat-CB) and evaluated its application of accurately estimating the pan evaporation in the arid and semi-arid However, some individual stations did not reach satisfactory results in overall three models such as station 51855 and 51573, which probably because the stations suffered from the complicated terrain and labile climate. Thus, the application of models varied a lot in such spatial individual stations and further study was necessary for this field. Additionally, it is noteworthy that the several stations located in the Turpan Depression which are the mid area of Xinjiang province performed worst may be due to its extremely large values of pan evaporation. In contrast, the stations in the western and northern regions of the study area can be forecasted by Bat-CB with minuscule error, which indicated Bat-CB may have potential application in the extension of the boundary regions and adjacent countries like Russia, Kazakhstan, Iron, Kyrgyzstan, etc.

Conclusions
The study established a novel hybrid machine learning model (Bat-CB) and evaluated its application of accurately estimating the pan evaporation in the arid and semi-arid zones of northwest China. The CatBoost coupled with Bat algorithm (Bat-CB) model, along with original CatBoost and a commonly used tree-based RF algorithm, were fed with meteorological data (including T max , T min , R s , RH, U) from abundant 45 stations during 2006-2017 and were investigated through four statistical indicators (RMSE, MAE, MAPE, NSE). The results showed the Bat-CB exhibited suitable accuracy and stability in arid and semi-arid regions and are superior to CB and RF conspicuously. CB has a slight preponderance compared with RF which presents poor ability to operate huge erratic changes for instance the pan evaporation in arid regions. The improvement brought from the Bat algorithm were conspicuous and expressed comprehensively in almost every indicator and field compared to the original CatBoost. In seasonal performance analysis, Bat-CB had better equilibrium in different months and exhibited more accuracy and robustness from April to October in comparison with RF and CatBoost. In spatial performance analysis, the result confirmed the strongest predicting ability of the Bat-CB for pan evaporation once again and indicated that further spatial generalization study was still essential. Nevertheless, the variable combinations of meteorological inputs and more types of climate were not contained in this study. Further research exploring the application in other climates and the condition of missing or limited meteorological data is significant. Overall, Bat-CB had a powerful ability for pan evaporation forecasting and obviously outperformed CatBoost and RF among sufficient fields, especially in arid and semi-arid areas.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.