Estimation of actual evapotranspiration in barley crop through a generalized linear model

Evapotranspiration is a key variable of the water cycle. Its calculation requires several ground data that frequently are not available. This study contains a detailed method and measurements of meteorological and energy balance variables that can be used to estimate the daily actual evapotranspiration (ETa). A linear generalized model is obtained to calculate the ETa from common variables measured in meteorological stations. The method showed a good performance over a barley crop of easthern Argentine Pampas and can be applied and tested in other great plains. Measurements of soil-plant-atmosphere are included The routines to reproduce the method are included The generalized method allows the calculation of daily ETa over crops and was tested over barley crops


Introduction
Evapotranspiration represents about 80% of the water transferred from soil-plant into the atmosphere over great plains [1] . The calculation of actual evapotranspiration (ETa) usually requires measurements of several variables of the atmosphere (e.g. air temperature and humidity, solar radiation) and surface (e.g. temperature, soil moisture), with nonlinear and complex interactions. The application of artificial intelligence has contributed to the estimation of the evapotranspiration, considering the fluctuations of the local climate [2] . However, given that these techniques usually limit the understanding of the relationships between the involved variables, linear generalized models (GLM) could be a possible solution. In GLM, the dependent variable is linearly related to the factors and co-variables through a link function. This study describes the method and data used to build a GLM to monitor ETa over rain-fed barley crops.
The study was carried out during the development of a barley crop (from August to December 2019 and 2020) in eastern Argentine Pampas (La Alcira station: 37,49 °S, 58,90 °W, 186 m.a.s.l.). The climate is temperate and subhumid. The soil type is Typic Argiudoll. Barley was sown by direct seeding with 0.17m between furrows. Ground measurements were collected from an energy balance station connected to CR10 0 0 and CR30 0 data loggers (Campbell Scientific Inc.), monitoring the variables detailed in Table 1 .
A GLM was obtained considering the most important variables for ETa estimation. The mathematical expression of the GLM is: where Y i is the i th observation of the dependent variable (ETa), x ij is i th observation of the j th independent variable ( j = 1, 2, ..., n ), β j represents parameters to be estimated, β 0 is the intercept, and link is the link function.
On the other hand, ETa was calculated using the water balance method (WB) considering soil moisture and crop physiological characteristics. The daily WB used to calculate soil water storage where S f and S i are the final and initial soil water storage in the root zone, Ex f and Ex i are the last and initial water excess accumulated in the soil, P is the rainfall, RO is the surface runoff, and DP is the deep percolation. Subsurface horizontal water movements have been dismissed due to their values are lower than the vertical ones in plain environments. The units of Eq. 2 are in mm d −1 . ETa was calculated as the residual term of Eq. 2 and was later used as the dependent variable of our generalized linear model.
The difference S f -S i was estimated using the SOILVue10 sensor, which measures the volumetric soil moisture (SM). Water storage is directly proportional to SM [m 3 m −3 ] times the depth covered by the sensor. We considered daily averages of S at a depth of 0.05-0.3 m. In the analysis period, no significant water excesses or losses due to deep percolation were recorded to calculate ETa, which can be corroborated by the water table levels in areas surrounding the experimental plot (see sheet "Water_table," Data_GLM.xlsx, https://data.mendeley.com/datasets/77r6w44xbp/draft? a=5d9be36a-ae9a-4c9f-b526-121d1d37b6eb). On the other hand, no surface runoff was recorded during the visual control of the plot.
A supervised method of learning was chosen to train the model. The flowchart of ET GLM model is shown in Fig. 1 . It shows the phases of the framework: i) Data collection, ii) Data analysis and data preprocessing, and iii) grid search of best hyperparameters. In step i) the response variable (ETa) and the explanatory variables were collected, where SM is the volumetric soil moisture [m 3 /m 3 ], NDVI is the normalized difference vegetation index, Tr is the crop surface temperature measured at 45 °[ °C], Rin is the incoming (305 to 2800 nm) solar radiation at surface [MJ/m 2 /d], T [ °C] and RH are the air temperature and relative humidity (%), respectively. These variables are in a daily temporal resolution.
In stage ii) the final dataset was created. Previously to build the model, an exploratory analysis of data was carried out. The variables with linear dependence (-0.7 > r of Spearman > 0.7) were discarded and the most explanatory ones (p < 0.01) were selected. Statistical assumptions about the heterogeneity of variance structure and the distribution of statistical residuals were tested [3] . Finally, in step iii) a set of optimal hyperparameters for a learning algorithm was chosen in order to avoid the overfitting. For this, the K-fold cross-validation (K-CV) method was selected. K-CV is a resampling procedure used to evaluate machine learning models on a limited data sample. This technique aims to partition the training and test data about K times. K-CV is simple to understand and because it generally results in a less optimistic estimate of the model skill than other methods, such as a simple train-test split. The Python language and the Pandas [4] , NumPy [5] and Scikit-Learn [6] libraries were used to develop the procedure. The routines are included in the file named MethosdX_Notebook: https://data.mendeley.com/datasets/77r6w44xbp/draft?a=5d9be36a-ae9a-4c9f-b526-121d1d37b6eb Thus, with a total of 189 measurements, the model (mod) was obtained: Rin, NDVI, T, RH and Tr were measured at 2 m height while SM was measured at 20 cm depth. It should be noted that the algorithm randomly uses the data to adjust and validate the model. The used data are included in the link https://data.mendeley.com/datasets/77r6w44xbp/draft? a=5d9be36a-ae9a-4c9f-b526-121d1d37b6ebXX On the other hand, ETa measurements described a Poisson-Gamma distribution ( Fig. 2 a). This distribution has a complex probability density function. Thus, the distribution was adjusted to a Tweedie distribution [7] . The algorithm sklearn.linear_model.TweedieRegressor [6] allowed us to find the optimal link function to adjust the coefficients to the ETa distribution ( Eq. 4 ). About the validation, the comparison between GLM results and the results obtained with the WB is included in Fig. 2 b. About the performance of the model, the root mean square error (RMSE) was 0.577 mm/d, the Mean Absolute Error (MAE) was 0.473 mm/d, and the Determination Coefficient (R2) was 0.81.
The proposed GLM is suitable to follow a crop surface (from partially vegetated to full cover) with meteorological data measured near the canopy (2m height) and soil moisture measured in the root zone. It can be applied in other areas with data commonly obtained in typical energy balance or meteorological stations. The model showed the forcing variables of the soil-plant-atmosphere system that influence evapotranspiration.
Given the frequent limitation of ground data, the method can be used with satellite data, reanalysis or a combination of them. For example, the variables involved in Eq. 3 are available in the NASA-POWER dataset [8] , which have a good correlation with field data [9 , 10] . Likewise, NDVI data are available in several missions (e.g. MODIS products MOD13C2, MOD13A1 version 6 [1 , 11] ). It should be noted that a calibration model is needed to be applied in other areas with different biophysical characteristics to obtain good results, considering the variables involved in the method for those areas.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.