Evaluation of soft computing and regression-based techniques for the estimation of evaporation

The estimation of evaporation in the field aswell as the regional level is required for the efficient planning and management of water resources. In the present study, artificial neural network (ANN) and multiple linear regression (MLR)-based models were developed to estimate the pan evaporation on the basis of one day-lagged rainfall (Pt 1), one day-lagged relative humidity (RHt 1), current day maximum temperature (Tmax) and minimum temperature (Tmin). These were selected as the most effective parameters on the basis of cross-correlation. The performance of models was evaluated using correlation coefficient (r), root-mean-square error (RMSE) and Nash–Sutcliffe efficiency (coefficient of efficiency, CE) during calibration and validation periods. Based on the comparison, the ANNmodel (4-91), with sigmoid as activation function and Levenberg–Marquardt as a learning algorithm, was selected as the best performing model among all ANN models. The values of r, CE and RMSE for training and validation periods were found as 0.885, 0.785 and 1.00 mm/day and 0.889, 0.782 and 1.01 mm/day, respectively, through the ANN model (4-9-1). The values of r, CE and RMSE for training and validation periods were found as 0.835, 0.698 and 1.19 mm/day and 0.866, 0.750 and 1.15 mm/day, respectively, through the selected MLR model. Based on the sensitivity analysis, RHt 1 is selected as the most effective parameter followed by Pt 1, Tmax and Tmin. The developed model can be utilized as an alternative for the estimation of the evaporation at the regional level with limited input data. This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/). doi: 10.2166/wcc.2019.101 om http://iwaponline.com/jwcc/article-pdf/12/1/32/851530/jwc0120032.pdf er 2021 Aparajita Singh R. M. Singh Ashish Kumar Subodh Hanwat V. K. Tripathi (corresponding author) Department of Farm Engineering, Institute of Agricultural Science, Banaras Hindu University, Varanasi, Uttar Pradesh 221005, India E-mail: tripathiwtcer@gmail.com A. R. Senthil Kumar National Institute of Hydrology, Roorkee 247667, Uttarakhand


INTRODUCTION
Evaporation is the major, diffusive process of the hydrological cycle in which water is changed from the liquid state into the vapor state due to the transfer of heat energy. It plays a vital role in water resource planning and development in arid and semi-arid climatic regions (Shirsath & Singh ). The management of scarce water resources is important for sustainable crop production (Shirgure & Rajput ). In a hot climate, water loss plays an important role through evaporation from rivers, canals and open-water bodies. Evaporation losses are even significant in humid areas, but the cumulative mean precipitation over these areas can mask it, meaning it is not ordinarily recognized except during the rainless period. The estimation of evaporation is required in water balance computations, irrigation management, crop yield forecasting, river flow forecasting and ecosystem modeling in hydrology, agronomy, forestry and land resource planning (Terzi & Keskin ).
Irrigation scheduling is based on evapotranspiration estimated through evaporation. It is a systematic approach for efficient irrigation scheduling. Farm irrigation systems, as well as water resource development projects, are also designed with a long-term mean value of evaporation and their magnitude along with variation in evaporation losses (Shiri et al. ). There is a growing demand for evaporation data for the studies of surface water and energy fluxes, especially addressing the impacts of global warming (Xu & Singh ). The accurate estimation of evaporation is fundamental for the effective management of water resources. Therefore, the need for reliable models for quantifying evaporation losses from increasingly scarce water resources is greater than ever before (Tabari et al. ).
An appropriate and consistent measurement of evaporation estimated by direct and indirect methods for a longer duration is always challenging for the researchers. In the direct evaporation measurement method, the United States Weather Bureau (USWB) Class A pan evaporimeter and eddy correlation techniques were used, whereas, in indirect methods, the meteorological variables were used to estimate the evaporation. The most widely used crop monitoring and forecasting models by the Food and Agriculture Organization (FAO) are based on evaporation estimates (Gommes ). Yihdego & Webb () evaluated the differences in the measured pan evaporation and estimated evaporation seasonally of Lake Burrumbeet, Australia and showed that evaporation is fully radiation driven and the effect of wind is minimal. Although various empirical formulas are available, their performances are not satisfactory because of the complex nature of the evaporation process and non-availability of the input data. Therefore, the estimation of evaporation from easily available meteorological parameters is another good alternative.
Evaporation is a complex and nonlinear phenomenon as it depends on several interacting meteorological factors such as temperature, humidity, wind speed and bright sunshine hours (Xu & Singh , ). Unfortunately, reliable estimates of evaporation are extremely difficult to obtain because of complex interactions between the components of the soil-plant-atmosphere system. This is partly due to reasons such as a wide range of data type and expertise are required (Goyal et al. ). Many conventional models such as empirical, regression-based models and conceptual models have been developed to estimate the evaporation with a large quantity of data but produced less accurate results. The models developed from meteorological data like multiple regression models involve empirical relationships to some extent which accounts for local conditions. Therefore, most of the models may give reliable results when applied to climatic conditions similar to those for which they were developed. But without local or regional calibration, the use of such models under greatly different climatic conditions may give results that differ considerably. No single model is universally adequate under all climatic conditions, so it is difficult to select the most appropriate evaporation model for a given region.

Study area
The present study was carried out in the Roorkee, having an June) and 20 to 40 C in the rainy season (July to September). The humidity ranges from 30% to 100%. The soil type is alluvial of the Ganga plain derived from the soft dolomitic rocks of the Himalayas. Due to its location away from any major water body and its proximity to the Himalayas, the study area of Roorkee has an extreme and erratic continental climate, with average annual wind speed of 4.9 m/s (http://nihroorkee.gov.in/location.html; accessed 27 September 2018). Daily rainfall data (P), minimum temperature (T min ), maximum temperature (T max ) and relative humidity (RH) from January 2001 to December 2013 were collected from NIH (), Roorkee Observatory.

Artificial neural network
An ANN is an information processing structure that consists of several interconnected processing elements called nodes which are analogous to human brain neurons.
Each node combines some inputs and produces an output  consists of nodes that receive the hidden layer output and send it to the user. Feedforward means that all the interconnections between the layers propagate forward to the next layer. The main advantage of FFNN structures is that they are easy to handle and can approximate any input/output map (Hornik ). The log-sigmoid activation function was used in this study to introduce nonlinearity in the model. Each node is a simple processing element that responds to the weighted inputs that it receives from other nodes. The receiving node sums the weighted signals from all nodes to which it is connected in the preceding layer.
The net input to the node was the weighted sum of all the incoming signals as represented in the following equation: The activation function, y, is a nonlinear function of its net input and described by the sigmoid logistic function as per the following equation: (2)

Network architecture
The network geometry is generally highly problem oriented to obtain optimal network geometry through the trial-and-

Multiple linear regressions
MLRs are a multivariate statistical technique used to model the linear correlations between the dependent variable y and two or more independent variables by fitting a linear equation to observed data. The value of the dependent variable y is associated with each response of independent variables. The regression equation of y can be written as follows: where y is the response variable or the dependent variable, x 1 , x 2 …x n are the independent variables and m 1 , m 2 …m n are the regression coefficients and C is constant.
When there was a linear relationship between the dependent variable and independent variables, then the MLR models were the standard method for the estimation of responses between a dependent variable and various independent variables.

Data formulation
The daily observed data of relative humidity (RH), rainfall (P), minimum temperature (T min ) and maximum tempera- were used for validation. These datasets were selected based on a trial-and-error method.

Normalization of input data
The input values were normalized (using Equation (5)) between 0 and 1 before passing into a neural network since the output of the sigmoid function is bound between Ni where R i is the real value applied to neuron i, N i is the sub-

Coefficient of efficiency
Based on the standardization of residual variance with initial variance, the CE was used to compare the relative performance of the two approaches effectively and is given as per the following equation: (y ej À y j ) 2 P n j¼1 (y ej À y ej ) 2

Root-mean-square error
RMSE is used to evaluate the comparison between the model's simulated responses with that of recorded watershed responses. The RMSE is zero for perfect fit, whereas its increased values indicate higher discrepancies between predicted and observed values. The lowest the RMSE, the more accurate the estimation is and given by the following equation:

Correlation coefficient
The correlation coefficient (r) is an indicator of the degree of closeness between observed and predicted values. If observed and predicted values are completely independent, the correlation coefficient will be zero and is given as per the following equation: {(y j À y j )(y ej À y ej )} P n j¼1 (y j À y j ) 2 P n j¼1 (y ej À y ej ) 2

Ã100
( 8) where y j is a predicted variable, y ej is observed variable, y j is the mean of the predicted valve, y ej is the mean of the observed valve and n is the number of observation.

Selection of input vector
The cross-correlation analysis between the dependent and independent variables was carried out to select the significant input vector. To identify the input vector, a detailed cross-correlation analysis of the following variables was done.
(i) Daily rainfall (P) valves with daily evaporation valves.
(ii) Daily maximum (T max ) and minimum temperature (T min ) valves with daily evaporation valves.
(iii) Daily relative humidity (RH) valves with daily evaporation valves.
The cross-correlation graph was plotted between inputs used in analysis and pan evaporation. It was observed that the daily rainfall (P) valves and RH are negatively correlated with evaporation valves (Figures 3(a), 3(b), 4(a) and 4(b)).
The correlation analysis was also done to find the exact lag between the dependent and independent variables. The correlation coefficients were estimated between daily pan evaporation and different input parameters such as daily rainfall (P), daily relative humidity (RH), daily maximum temperature (T max ) and daily minimum temperature (T min ).
The maximum positive correlation coefficient value estimated between T max and evaporation was 0.72, whereas that estimated value between T min and evaporation was 0.53. The negative correlation coefficient with values À0.030 and À0.610 was found between evaporation and rainfall (P) and relative humidity (RH), respectively, as shown in Table 1. Negative correlation means that evaporation decreases with an increase in rainfall and RH.
The inputs for the ANN model based on correlation analysis were one day-lagged rainfall (P tÀ1 ), one daylagged relative humidity (RH tÀ1 ) current day maximum temperature (T max ) and current day minimum temperature (T min ) as shown in the following equation. There was one node in the output layer which estimates evaporation E(t).

Development of ANN-and MLR-based models
The estimation equation developed through MLR analysis computes the daily pan evaporation with the same input parameters used in the ANN model. The one daylagged rainfall (P tÀ1 ), one day-lagged RH (RH tÀ1 , daily T max and daily T min) were taken as the independent variables and pan evaporation was taken as a dependent variable for the MLR model. The regression equation developed for the present study was given in the The coefficients m 1 , m 2 , m 3 and m 4 were found according to the regression analysis between the dependent variable and independent variables. The regression coefficient values m 1 ¼ À0.010 observed between evaporation and daily rainfall, m 2 ¼ 0.199 and m 3 ¼ 0.008 observed between daily evaporation and daily maximum temperature, daily minimum temperature, respectively, and m 4 ¼ À0.044 observed between evaporation and daily RH.
From regression analysis, it was observed that rainfall and RH were negatively related to evaporation. The multiple linear equation was developed as per the following equation: E ¼ À0:010 Ã P tÀ1 þ 0:199 Ã RH tÀ1 þ 0:008 Ã T min À 0:044 Ã T max þ 0:313 The number of neurons in the hidden layer is explored from 1 to 10 based on the trial-and-error procedure. The transfer functions of hidden and output layers have been considered as log sigmoid and pure linear, respectively, in the training of the ANN model.

Performance evaluation of ANN and MLR models
On the basis of performance of the ANN models during calibration and validation periods, the EVAP9 model with structure ANN (4-9-1), with sigmoid as activation function ively, as given in Table 2 was selected from all the structures.
Even the ANN structure 4-7-1 with seven neurons in the hidden layer had given a better result, but the differences between the results of these two structures with different numbers of neurons in the hidden layer were negligible, and also further iterating the model with more than 10 neurons in the hidden layer, the performance of the model was fluctuating (decreasing and then is increasing) which might have led to the overfitting. The performance of the  Table 3.

Sensitivity analysis
To check the effect of each input parameter on the pan evaporation, sensitivity analysis has been performed by utilizing the selected ANN (4-9-1) model. The effectiveness of each   parameter was judged by skipping parameters one by one from the selected model with four inputs as depicted in Table 4. These models were explored in the ANN with the sigmoid activation function and Levenberg-Marquardt as a learning algorithm with nine neurons in the hidden layer.
The model without RH tÀ1 has a very low value of r (0.78) and CE (0.723) during training. Similar values of r and CE were observed during testing periods. The higher value of

CONCLUSIONS
The estimation of evaporation at the regional level is cumbersome with limited meteorological data due to the complex nature of the evaporation process. It can easily be performed using an ANN with available measured meteorological parameters. Evaporation is highly correlated with current day minimum and maximum temperatures as compared with one day-lagged rainfall and RH.