Advances in Water Resources Potential of satellite and reanalysis evaporation datasets for hydrological modelling under various model calibration strategies

Twelve actual evaporation datasets are evaluated for their ability to improve the performance of the fully distributed mesoscale Hydrologic Model (mHM). The datasets consist of satellite-based diagnostic models (MOD16A2, SSEBop, ALEXI, CMRSET, SEBS), satellite-based prognostic models (GLEAM v3.2a, GLEAM v3.3a, GLEAM v3.2b, GLEAM v3.3b), and reanalysis (ERA5, MERRA-2, JRA-55). Four distinct multivariate calibration strategies (basin-average, pixel-wise, spatial bias-accounting and spatial bias-insensitive) using actual evapora- tion and streamﬂow are implemented, resulting in 48 scenarios whose results are compared with a benchmark model calibrated solely with streamﬂow data. A process-diagnostic approach is adopted to evaluate the model responses with in-situ data of streamﬂow and independent remotely sensed data of soil moisture from ESA-CCI and terrestrial water storage from GRACE. The method is implemented in the Volta River basin, which is a data scarce region in West Africa, for the period from 2003 to 2012. Results show that the evaporation datasets have a good potential for improving model calibration, but this is dependent on the calibration strategy. All the multivariate calibration strategies outperform the streamﬂow-only calibration. The highest improvement in the overall model performance is obtained with the spatial bias- accounting strategy ( + 29%), followed by the spatial bias-insensitive strategy ( + 26%) and the pixel-wise strategy ( + 24%), while the basin-average strategy ( + 20%) gives the lowest improvement. On average, using evaporation data in addition to streamﬂow for model calibration decreases the model performance for streamﬂow (-7%), which is counterbalance by the increase in the performance of the terrestrial water storage ( + 11%), temporal dynamics of soil moisture ( + 6%) and spatial patterns of soil moisture ( + 89%). In general, the top three best performing evaporation datasets are MERRA-2, GLEAM v3.3a and SSEBop, while the bottom three datasets are MOD16A2, SEBS and ERA5. However, performances of the evaporation products diverge according to model responses and across climatic zones. These ﬁndings open up avenues for improving process representation of hydrological models and advancing the spatiotemporal prediction of ﬂoods and droughts under climate and land use changes.


Introduction
Assessing the spatiotemporal variability of hydrological processes is the crux of effective water resource management. Global warming is expected to intensify (i.e., accelerate) the hydrological cycle, thus increasing or decreasing evaporation depending on places ( Donat et al., 2016 ;Famiglietti and Rodell, 2013 ;Huntington, 2006 ). Minimum air temperature tial in (semi-)arid catchments, where evaporation represents the dominant outgoing water flux. In such environments, a classical hydrological model calibration on streamflow alone could lead to important model biases. More generally, for catchments with strong anthropogenic influences on the water cycle (e.g., irrigation schemes, dams, etc.) or in data scarce regions, the inclusion of evaporation data in model calibration has the potential to better constrain a hydrological model rather than streamflow alone ( Becker et al., 2019 ;Jiang and Wang, 2019 ). Accordingly, calibrating hydrological models solely based on streamflow is not sufficient to guarantee an accurate representation of the hydrological system because streamflow is the result of several inter-linked processes, thereby it masks spatial heterogeneity ( Tobin and Bennett, 2017 ;Wambura et al., 2018 ). This limitation in model implementation has been overcome by the advent of multivariate calibration techniques using satellite-based datasets, which offers models a chance for better spatial heterogeneity ( Efstratiadis and Koutsoyiannis, 2010 ;Rakovec et al., 2016 ).

List of symbols
In the quest to improve process representation in hydrological models, large scale distributed hydrological modelling faces the challenging requirement of spatially explicit observational datasets for model setup and performance evaluation Fatichi et al., 2016 ;Hrachowitz and Clark, 2017 ). Correspondingly, Satellite Remote Sensing (SRS) and reanalysis datasets of various hydrological processes have been used as input data Maggioni and Massari, 2018 ) or as calibration and evaluation data for hydrological models ( Koppa and Gebremichael, 2020 ;McCabe et al., 2017 ). Evaporation estimates from SRS are increasingly used in multivariate calibration of hydrological models. Because it is a key indicator of surface water availability, evaporation is an essential source of information for better constraining the spatiotemporal representation of processes in hydrological models Cui et al., 2019 ;Talsma et al., 2018 ). The increasing availability and diversity of gridded evaporation datasets has triggered many evaluation and comparison studies ( Long et al., 2014 ;Vinukollu et al., 2011b ), which highlight significant differences between the datasets and thereby indicate underlying uncertainty in the evaporation estimates López et al., 2017 ). The uncertainty stems from the strong variability of bio-geophysical variables that drive evaporation (e.g., albedo, net radiation, surface roughness and temperature) and the diversity of the model structures, model parametrizations and input datasets used to estimate evaporation ( Badgley et al., 2015 ;Wang and Dickinson, 2012 ;Zhang et al., 2020 ). Therefore, the choice and use of SRS evaporation data in hydrological modelling should be done cautiously, particularly in catchments with strong anthropogenic influences ( Senkondo et al., 2019 ;Yang et al., 2016 ). Generally, the following four approaches are adopted to evaluate gridded evaporation products: (i) analysis of the variance between several products (e.g., Jimenez et al., 2011 ;Khan et al., 2018 ;Mueller et al., 2011 ;Senkondo et al., 2019 ;Trambauer et al., 2014 ); (ii) point-to-pixel comparison with ground-based measurements (e.g., Chen et al., 2014 ;McCabe et al., 2015 ;Michel et al., 2016 ;Ramoelo et al., 2014 ;Velpuri et al., 2013 ); (iii) hydrological consistency by water balance calculation (e.g., Liu et al., 2016 ;McCabe et al., 2008 ;Miralles et al., 2016 ;Wang et al., 2018 ;Weerasinghe et al., 2020 ); and (iv) assessing the ability of evaporation datasets in improving the parameter estimation of hydrological models (e.g., Demirel et al., 2018 ;Immerzeel and Droogers, 2008 ;Jiang et al., 2020 ;Pomeon et al., 2018 ;Winsemius et al., 2008 ).
Assessing the uncertainty of evaporation estimates at large-scale is challenging due to the limited availability of ground-based measurements ( Bhattarai et al., 2019 ;Ceperley et al., 2017 ). The uncertainty in evaporation varies in space and according to climate regions ( Blatchford et al., 2020 ;Vinukollu et al., 2011a ). In evaluating the contribution of gridded evaporation datasets to hydrological model calibration, some limitations can be denoted in previous studies. Most previous studies only use or compare few evaporation datasets (e.g., Kunnath-Poovakka et al., 2016 ;Vervoort et al., 2014 ), and rarely (if any) investigate the use of reanalysis datasets (i.e., retrospective analysis; cf. Bosilovich et al., 2008 ), which are an important source of spatial evaporation estimates ( Feng et al., 2019 ). Usually, a lumped or semi-distributed model is used (e.g., Odusanya et al., 2019 ;Rientjes et al., 2013 ), which does not harness the full potential of the gridded evaporation datasets that is their spatial patterns ( Armstrong et al., 2019 ;Stisen et al., 2018 ). Most studies do not test different model calibration strategies, with some exceptions that use a semi-distributed model (e.g., Herman et al., 2018 ;Rajib et al., 2018 ). Finally, a few studies use a bias-insensitive metric to focus only on the spatial patterns of gridded evaporation products ( Dembélé et al., 2020a ;Koch et al., 2018 ).
This study aims to fill current knowledge gaps by evaluating the utility of nine satellite-based (prognostic and diagnostic) and three reanalysis evaporation datasets in improving the performance of a distributed hydrological model using four distinct calibration strategies. This study does not intend to quantify the intrinsic accuracy of the evaporation products nor determine whether a product is better than the others in terms of absolute values. Rather it strives to evaluate their ability to improve the simulations of a distributed hydrological model when used as a calibration variable. Besides the high number and diversity of gridded evaporation datasets evaluated, the novelty of this study is the implementation of four distinct model calibration strategies with a fully distributed hydrological model, the evaluation of the model responses with multiple variables (i.e., streamflow, soil moisture and terrestrial water storage) to test evaporation error propagation on other hydrological processes, and the application of the experiment in a large basin spread across four eco-climatic zones with considerable anthropogenic influence. This study strives to answer two inter-related research questions. Firstly, what is the ability of satellite and reanalysis evaporation datasets to improve the overall predictive skill of a fully distributed hydrological model? Secondly, how important is the model calibration strategy in improving the representation of hydrological processes? The proposed research is carried out in the Volta River basin located in West Africa, using the mesoscale Hydrologic Model (mHM) over a period of ten consecutive years (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012).
The VRB extends over four eco-climatic zones (i.e., Sahelian, Sudano-Sahelian, Sudanian and Guinean) characterized by increasing vegetation density and receiving increasing precipitation from north to south ( Fig. 1 a; Table 1 ). The information provided in Table 1 is obtained from the global aridity index database ( Trabucco and Zomer, 2018 ), and the WFDEI meteorological data (see Table 2 ; Weedon et al., 2014 ) for the period 1979-2016. Fig. 1  The climate is driven by the Inter-Tropical Convergence Zone (ITCZ) and varies from sub-humid in the south to semi-arid in the north ( Dembélé et al., 2019 ;FAO/GIEWS, 1998 ;Mul et al., 2015 ). Actual evaporation varies between 70% and 90% of annual rainfall in the basin.

Distributed hydrological model
The fully distributed mesoscale Hydrologic Model (mHM) is a conceptual model that simulates dominant hydrological processes (e.g., evaporation, soil moisture and discharge) per grid cell in the modelling domain ( Kumar et al., 2013 ;Samaniego et al., 2010 ). Samaniego et al. (2011) provide a schematic representation of the processes accounted for in mHM. A multiscale routing model based on the Muskingum-Cunge method ( Cunge, 1969 ) is used for the routing of the total grid-generated runoff through the river network ( Thober et al., 2019 ). The sub-grid variability of the basin physical characteristics (e.g., soil texture, land cover) is accounted for with a multiscale parameter regionalization technique . The model parameters (e.g., hydraulic conductivity, soil porosity) are linked to the basin physical characteristics via pedo-transfer functions and global parameters. Thirty-six global parameters (cf. Pokhrel et al., 2008 ) are tuned during model calibration for this study (Table S6). mHM was found reliable for modelling hydrological processes in the VRB ( Dembélé et al., 2020a ). The version 5.9 of mHM is used in this study.
Reference evaporation ( E ref ) is calculated following the method of Hargreaves and Samani (1985) , which was found to be reliable for semiarid regions like the VRB ( Bai et al., 2016 ;Er-Raki et al., 2010 ;Gao et al., 2017 ). E ref is formulated as follows: where R a (MJ/m 2 /day) is the extraterrestrial radiation computed based on the latitude of the location and the day of the year ( Allen et al., 1998 ),  See Table 3 = 2.45 MJ/kg is the latent heat of vaporization of water. The unit of radiation is converted into equivalent water evaporation in mm/day with the ratio R a / . The differences in advection or vapor transfer effect are compensated by the constant = 0.0023, and T avg , T max and T min represent the daily average, maximum and minimum air temperature in degrees Celcius (°C) at a given location.
Potential evaporation ( E p ) is calculated by adjusting E ref with a dynamical scaling function ( F DS ) based on leaf area index ( Allen et al., 1998 ;Demirel et al., 2018 ), therefore accounting for vegetation-climate interactions Birhanu et al., 2019 ;Jiao et al., 2017 ). E p is formulated as follows: where a is the intercept term, b is the vegetation dependent component, and c represents the degree of nonlinearity of the leaf area index ( I LA ). The coefficients a, b , and c are determined through model calibration.
In this study, actual evaporation ( E a ) is defined as the sum of transpiration and evaporation from interception, land and water bodies ( Coenders-Gerrits et al., 2020 ;Shuttleworth, 1993 ). E a is calculated as a fraction of E p from soil layers depending on soil moisture availability and the rooting depth ( Feddes et al., 1976 ). Soil moisture is estimated by a multi-layer infiltration capacity approach adopting a three-layer soil scheme (0-5, 5-30 and 30-100 cm depths). Terrestrial water storage at each grid cell is the sum of the surface and subsurface water storage (i.e., lakes, wetlands, soil moisture reservoirs, interflow and baseflow).
More information on the calculation of hydrological processes in mHM can be found in the work of Kumar (2010) and Samaniego et al. (2010) .

Input datasets
The morphological datasets (i.e., elevation, slope, land cover, etc.) and meteorological datasets (i.e., rainfall and air temperature) used to set up and run the distributed model are described in Table 2 . The meteorological datasets are selected based on their accuracy ( Dembélé and Zwart, 2016 ), and their suitability to plausibly represent hydrological processes in the VRB ( Dembélé et al., 2020b ). All morphological data are resampled to a resolution of 1/512°(~200 m at the equator) using the nearest neighbor technique, while the meteorological data are resampled to 0.0625°(~7 km) using bilinear resampling. Due to the good ability of mHM in parameter transferability across scales ( Dembélé et al., 2020a ), the model is run at daily time step with a spatial discretization of 0.25°(~28 km) corresponding to 619 modelling grid cells in the basin, which attenuate the computational demand. In-situ streamflow data and gridded datasets of evaporation are used for model calibration ( Section 2.4 ). The description of the evaporation datasets is provided in Section 2.3 .
The quality control and the gap-filling of the streamflow data are described by Dembélé et al. (2019) . In addition to in-situ streamflow, SRS datasets of soil moisture and terrestrial water storage are used to evaluate the model performance after calibration. The surface soil moisture ( S u ) dataset is obtained from ESA CCI  and represents the first soil layer (i.e., 2-5 cm depth). The blended product of both passive and active microwave products is used here ( Gruber et al., 2017 ;Liu et al., 2012 ;Wagner et al., 2012 ). The terrestrial water storage ( S t ) anomaly data (release RL05) is obtained from GRACE ( Landerer and Swenson, 2012 ;Swenson, 2012 ). The ensemble arithmetic mean of different solutions from three processing centers (i.e., Center for Space Research at University of Texas, Geoforschungs Zentrum Potsdam and Jet Propulsion Laboratory) is used in this study because it has been shown to be more effective in reducing noise in the Earth's gravity signal than the individual solutions ( Sakumura et al., 2014 ).
Four versions of the GLEAM product are evaluated. They differ in terms of input data used for their production and in terms of their spatiotemporal coverage (cf. Table 1 in Martens et al., 2017 ). The version v3.3 differs from the v3.2 in the following forcing datasets: surface radiation, near-surface air temperature and land cover maps. The versions v3.3a and v3.2a are produced with reanalysis, satellite and gauge-based datasets, while the versions v3.3b and v3.2b are mainly produced with satellite datasets.
Considerable differences can be observed both in the temporal dynamics and the spatial patterns of the 12 gridded evaporation datasets across the climatic zones in the VRB ( Figs. 2 and 3 ).

Model calibration and evaluation
The modelling period extends from 2000 to 2012 with 3 years of model warm-up (2000-2002), 6 years for calibration (2003)(2004)(2005)(2006)(2007)(2008), and 4 years for evaluation (2009)(2010)(2011)(2012). Available daily in-situ streamflow datasets from 11 gauging locations are used for model calibration and evaluation, while monthly datasets of E a ( Table 3 ) are used for model calibration, and monthly datasets of S u (ESA CCI) and S t (GRACE) are used for model evaluation. All the E a datasets are rescaled to 0.25°using bilinear interpolation to match the modelling spatial resolution, and sub-monthly data are aggregated to monthly resolution. Only the first soil layer of mHM is compared to the ESA CCI data, which represents the surface soil moisture.
First, a streamflow-only calibration is adopted as benchmark. Then, the contribution of evaporation datasets in improving hydrological model calibration is tested by simultaneously constraining the model with streamflow and each of the twelve gridded evaporation datasets using four calibration strategies for each (Fig. 4) . Therefore, 48 scenarios (i.e., 12 datasets times 4 calibration strategies) are developed and compared to the benchmark calibration to evaluate the impact of different calibration strategies on model performance. The dynamically dimensioned search algorithm ( Tolson and Shoemaker, 2007 ) is used for parameter estimation, using 5,000 iterations for each of the 48 scenarios and for the benchmark model calibration. The computational runtime is about 6 days for each of the 49 model simulations on a computer Intel Xeon Processor E5-2697 v3 with 64 GB of RAM.

Calibration on streamflow data -benchmark
The benchmark calibration (case Q) is elaborated by calibrating the hydrological model solely with streamflow ( Q ) data. The objective func- Table 3 Gridded actual evaporation datasets. The characteristics of the datasets are those used in this study although the same datasets can be available from the data providers with different versions and spatiotemporal resolutions.   tion for case Q ( Φ Q ) to be minimized is obtained by calculating the average Kling-Gupta efficiency ( E KG ) over the 11 gauging points for Q in the basin, and subtracting it from 1. Φ Q ranges from its optimal value that is 0 to positive infinity, and is formulated as follows: where g is the total number of streamflow gauging stations in the basin, E KG is the modified Kling-Gupta efficiency ( Kling et al., 2012 ) calculated for the observed ( Q obs,i ) and modelled ( Q mod,i ) streamflow of the i th gauging point. E KG is composed of the Pearson correlation coefficient ( r ), the bias term ( , i.e., the ratio of the means) and the variability term ( , i.e., the ratio of the coefficients of variation, cf. Eq. 8 ). E KG ranges from negative infinity to its optimal value that is 1. A model is better than the mean observed flow if E KG > -0.41 ( Knoben et al., 2019 ).  Dembélé et al. (2020a) is used to quantify the degree of reproduction of the spatial patterns of hydrological processes. The proposed spatial pattern efficiency ( E SP ) metric only considers the spatial pattern of the underlying variables and ignores their absolute values. E SP is an integrated measure of the dynamics, the spatial variability, and the locational matching of grid cells between the modelled ( X mod ) and observed ( X obs ) variables. With X obs and X mod composed of n cells, E SP is defined as follows: = mod ∕ mod obs ∕ obs and (8) where r s is the Spearman rank-order correlation coefficient with d i the difference between the ranks of the i th cell of X mod and X obs , is the variability ratio (i.e., the ratio of the coefficients of variation), and the spatial location matching term calculated as the root mean squared error ( E RMS ) of the standardized values (z-scores, Z X ) of X mod and X obs ( Dembélé et al., 2020a ). E SP ranges from negative infinity to its optimal value that is unity. E SP = 0 when there is a moderate relationship between the ranks of the observed and modelled variables (i.e., r s = 0.55), and E SP = -0.67 when the ranks are not related (i.e., r s = 0). More details on E SP are provided by Dembélé et al. (2020a) .

Multivariate calibration strategies.
Four multivariate calibration strategies with distinct objective functions are proposed to simultaneously consider Q and E a data as calibration variables. Each objective function ( Eqs. 11 ,15 ,17 and 19 ) is formulated based on the Euclidean distance approach ( Eq. 10 ), in which all elements are equally weighted ( Khu and Madsen, 2005 ). The Euclidian distance ( D E ) between two points X and Y of coordinates ( x 1 , x 2 ,…, x n ) and ( y 1 , y 2 ,…, y n ) in an n -dimensional space ( Upton and Cook, 2014 ) is given by: The four multivariate calibration strategies differ from each other based on the formulation of the sub-objective function for E a (i.e., Φ E a ), while Φ Q remains unchanged. The observed variable ( E a,obs ) and the modelled variable ( E a,mod ) of actual evaporation are represented each by a 3D array of dimension [ M x N x T ], with M the number of rows, N the number of columns, and T the number of time steps ( Fig. 5 ). For this study, M represents 40 latitude rows, N represents 36 longitude columns and T represents 72 months (i.e. calibration period). The modelling domain has 1440 grid cells of which 619 are active (i.e. grid cells representing the basin area). The inactive grid cells are masked out during the calculation of the performance metrics.
In the following, a is a 2D array of actual evaporation represented by all cells ( i,j ) in the spatial domain Ω, and a is the time series of actual evaporation for a given cell at row i and column j . The four multivariate calibration strategies are defined as follows: 1-Temporal basin average (BA) : the matching of the observed and modelled E a is done on basin-averaged time series. The sub-objective function ( Φ E a _ BA ) to be minimized is obtained by calculating the E KG for the observed and modelled time series of basin average actual evaporation ( a , obs and a , mod ) and subtracting it from 1. The objective function ( Φ BA ) is formulated as follows: 2-Temporal pixel-wise (PW) : the matching of the modelled and observed E a is done individually on the time series of each grid cell in the basin. The E KG is calculated for the observed and modelled time series of E a at each grid cell in the basin, and the sub-objective function ( Φ E a _ PW ) to be minimized is the average of the E KG calculated for all grids, subtracted from 1. The objective function ( Φ PW ) is formulated as follows: 3-Spatial bias-accounting (SB) : the matching of the modelled and observed E a is done for all pixels at each time step. The E KG is calculated at each time step between all the pixels of the observed and modelled E a . The sub-objective function ( Φ E a _ SB ) to be minimized is the average of the E KG calculated for all time steps, subtracted from 1. The objective function ( Φ SB ) is formulated as follows:

4-Spatial bias-insensitive (SP)
: the sub-objective function ( Φ E a _ SP ) to be minimized is similarly calculated as for the SB calibration except that a bias-insensitive metric (i.e., E SP ) is used as skill score instead of E KG . The objective function ( Φ SP ) is formulated as follows: All the objective functions ( Φ BA , Φ PW , Φ SB and Φ SP ) vary between their optimal value that is 0 and positive infinity. Except case SP that is a bias-insensitive approach, other calibration strategies consider the absolute values (i.e. raw data) of evaporation.

Model output evaluation
In addition to daily streamflow data, independent monthly datasets of satellite-based soil moisture ( S u ) and terrestrial water storage ( S t ) ( Table 2 ) are used for model evaluation. The temporal dynamics of streamflow is evaluated with E KG . Both the temporal dynamics and the spatial patterns of modelled S u are evaluated using the Pearson correlation coefficient ( r ) and the spatial pattern efficiency metric ( E SP ), while only the temporal dynamics of modelled S t is assessed using r , due to the coarse spatial resolution of the GRACE data. The skill scores for the temporal dynamics are calculated for each pixel (or gauging point) in the basin, while spatial skill scores are calculated per time step.

Results
In general, the trend of the model performance (high vs. low scores) among the scenarios (i.e., evaporation datasets vs. calibration strategies) is conserved between the calibration and the evaluation periods for all variables. Therefore, the following results are presented for the entire simulation period that comprises the calibration and evaluation periods whose results are additionally provided in the supplementary materials.

Model performance for multiple hydrological processes in the VRB
The model performance for various hydrological processes in the VRB reveals the potential of SRS and reanalysis evaporation datasets to improve the model responses if the appropriate calibration strategy is used ( Fig. 6 ). Detailed results are provided in the supplementary material ( Figures S1-S2).
For Q , the benchmark model (i.e., Q-only) yields a median E KG of 0.69. In the multivariate calibration scenarios (i.e., Q + E a ), the E KG of Q varies between 0.42 for SEBS with case PW to 0.73 for CMRSET with case SP. The best multivariate calibration strategy is the case SB with an average E KG of 0.68 and 75% of the evaporation datasets producing a higher model performance than the benchmark, followed by case SP ( E KG = 0.67), case PW ( E KG = 0.63) and case BA ( E KG = 0.60). The top 3 best evaporation datasets for the average E KG of Q over the calibration strategies are GLEAM v3.2b ( E KG = 0.71), GLEAM v3.3b ( E KG = 0.71) and GLEAM v3.3a ( E KG = 0.70), while the worst are ALEXI ( E KG = 0.61), SEBS ( E KG = 0.56) and ERA5 ( E KG = 0.51). The decrease in the model performance for Q in the multivariate calibration might be an artifact caused by equifinality (i.e., non-uniqueness of model parameters; Beven, 2006 ;Savenije, 2001 ) that occurred with the Q-only calibration, which gives more degrees of freedom for constraining the model parameter space Dembélé et al. (2020a) . In fact, Figures S68-S71 show that the high performance for Q achieved with the Q-only calibration is obtained at the expense of poor performance for other hydrological processes, while the multivariate calibrations with Q + E a result in parameter sets that provide equivalent model performance for Q but higher model performance for S t and S u . These results support the findings of Dembélé et al. (2020a) , thereby confirming the pitfalls of the Q-only calibration ( Bouaziz et al., 2020 ;Stisen et al., 2018 ).
The evaporation datasets show a high potential to improve the temporal dynamics of modelled S t as 79% of the multivariate calibration scenarios outperform the case Q ( r = 0.73) with an average r of 0.81. The  Fig. 6. Model performance in the entire simulation period (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012) for the temporal dynamics (a, b, c) of streamflow ( Q ), terrestrial water storage ( S t ) and soil moisture ( S u ), and the spatial patterns (d) of S u in the Volta River basin. The x-axis gives the objective functions for different model calibration strategies. The y-axis indicates the variables used for the model calibration. Q-only is the benchmark calibration. Circle color represents the median model performance obtained with 11 gauging points for Q , with 52 pixels for S t , with 619 pixels for S u (c) or with 120 months for S u (d). The color bars show the skill score (i.e., E KG , r, E SP ). Circle size represents model performance variability in terms of relative Interquartile Range (IQR) computed as (IQR -IQR min )/(IQR max -IQR min ). The IQR range for the 49 scenarios is given in the subplot titles. The best model is the bluest and smallest circle, while the worst model is the reddest and largest circle. lowest average r of S t among the calibration strategies is given by case SP ( r = 0.79), but interestingly, it outperforms the case Q. The highest performances for median r of S t are obtained with MERRA-2 ( r = 0.87), SSEBop ( r = 0.87) and ALEXI ( r = 0.84) while lowest performances are given by SEBS ( r = 0.77), MOD16A2 ( r = 0.76) and ERA5 ( r = 0.74).
The temporal dynamics of modelled S u show a higher model performance than that of S t , with an average r of 0.91 and 98% of the multivariate calibration scenarios that outperform the case Q ( r = 0.86). The case BA performs similarly to case PW and case SP with an average r of 0.91 across evaporation datasets, and outperformed by case SB ( r = 0.93). The highest r of S u is 0.93, and it is obtained with GLEAM v3.3a, MERRA-2 and GLEAM v3.2a, while the weakest scores are obtained with MOD16A2 ( r = 0.91), SEBS ( r = 0.89) and ERA5 ( r = 0.89).
The representation of the spatial patterns of S u improves for all the multivariate scenarios as compared to the case Q. However, case SB ( E SP = 0.0) has the highest average performance across the evaporation datasets, while case BA ( E SP = -0.04) has the lowest performance. The best evaporation datasets for the simulation of the spatial patterns of S u considering all the calibration strategies are SSEBop ( E SP = 0.02), MERRA-2 ( E SP = 0.01) and GLEAM v3.3a ( E SP = 0.01), while the worst are MOD16A2 ( E SP = -0.04), SEBS ( E SP = -0.07) and ERA5 ( E SP = -0.09).
In general, it is observed that the model performances for S t and S u improve for most of the multivariate calibration scenarios. Among the multivariate calibration strategies, case SB gives the best results considering the average model performance for all variables ( Q, S t and S u ). In Fig. 7 b, the highest average relative change in model performance by multivariate calibration strategies as compared to the Q-only calibration is obtained with case SB is ( + 29%), followed by case SP ( + 26%), case PW ( + 24%) and case BA ( + 20%). Consequently, all grid-based model calibration strategies outperform the basin-average calibration (case BA), which gives the lowest average model performance. These results highlight the value of calibrating hydrological models on the full extent of gridded evaporation datasets. It is noted that, in most of the scenarios, calibrating the model only on the spatial patterns (case SP) of the evaporation datasets, thereby ignoring their absolute values, improves the predictive skill of the model with a higher performance than that of case Q and case BA ( Fig. 6 ). With these findings, case SP (i.e., only spatial patterns of evaporation datasets) can be preferred to case SB (i.e., absolute values of evaporation datasets) for the calibration of hydrological models because the propagation of the errors of the evaporation estimates into the modelling process can be case specific and depends on the model structure. This observation also draw attention on the adequacy of the model for a given experiment ( Addor and Melsen, 2019 ). Moreover, for a given evaporation dataset, the spatial variation in biases in the estimates can lead to contrasting performance across regions ( Jung et al., 2019 ;Nicholson, 2000 ).
The spatial patterns of ALEXI, followed by those of MOD16A2 and CMRSET, were the least informative for the model calibration in case SP ( Fig. 6 ). Considering all hydrological processes and model calibration strategies, the top three best performing evaporation datasets are MERRA-2, GLEAM v3.3a and SSEBop, while the bottom three datasets are MOD16A2, SEBS and ERA5 ( Figs. 6 and 7 a). However, it is noteworthy that they outperform the Q-only calibration when used in case SP, meaning that only their spatial patterns improve the model performance. In general, the versions 3.3 of GLEAM show a slightly higher model performance than the versions 3.2.

Impact of calibration strategies on spatial patterns
Different model calibration strategies result in different spatial patterns of modelled E a and S u , as shown for selected scenarios in Fig. 8 . Additional maps of E a , S u and S t for other scenarios are provided in the supplementary material ( Figures S58-59, Figures S65-66 and Figures  S38-39), along with those obtained with the Q-only calibration ( Figure  Fig. 7. Relative change in the hydrological model performance as compared to the Q-only calibration strategy when (a) adopting a multivariate calibration strategy ( Q + E a ) with twelve evaporation datasets, or (b) using four different calibration strategies. The values on the line from each vertex to the center of the polygons give the relative difference in model performance as compared to the Q-only calibration. The change in model performance is given in percentage of the performance metrics (i.e., E KG , r, E SP ) for streamflow ( Q ), terrestrial water storage ( S t ), and soil moisture ( S u ). For instance in (a), using SSEbop in multivariate calibration increases E SP of S u by 110%, while in (b), the calibration strategy case BA decreases E KG of Q by -13%, as compared to the Q-only calibration.  -term (2003-2012) average of annual (a) actual evaporation ( E a ) and (b) soil moisture ( S u ) obtained by calibrating the mHM model with two evaporation datasets (y-axis, blue font) and different calibration strategies (x-axis, red font). A min-max normalization of the values allows rescaling them between 0 and 1 to emphasize the spatial patterns and use a unique color scale. S12). In general, the south to north gradient of increasing aridity observed with modelled E a and S u is well depicted for all the calibration strategies. However, considerable mismatches in the variability of the patterns are observed among the calibration strategies. Such discrepancies in spatial patterns have implications for water resources assessment including water accounting, flood and drought monitoring and prediction ( AghaKouchak et al., 2015 ;Klemas, 2014 ;Teng et al., 2017 ;West et al., 2019 ). Knowing when flood or drought events occur is important, but knowing the spatial extent of the event is crucial for deploying efficient adaptation and mitigation strategies ( Brunner et al., 2020 ;Diaz et al., 2019 ;He et al., 2020 ). Consequently, improving the representation of the spatial patterns of hydrological processes should be a key consideration in modelling with spatially distributed models. The comparison of the maps of modelled E a ( Fig. 8 a) with the reference ALEXI dataset ( Fig. 3 ) reveals that the spatial patterns of E a in the case BA show the highest mismatch with the reference, thereby unveiling the Fig. 9. Model performance of streamflow ( Q ), terrestrial water storage ( S t ) and soil moisture ( S u ) using various evaporation datasets and model calibration strategies with the mHM model. The results are shown for the four climatic zones in the Volta River Basin (VRB) over the simulation period (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012). Each score for a given evaporation product represents the average over the scores obtained with 4 calibration strategies, while the score is the average over scenarios with 12 evaporation datasets for each calibration strategy. The skill scores of the temporal dynamics are obtained with the Kling-Gupta efficiency ( E KG ) for Q and the Pearson's correlation coefficient ( r ) for S t and S u . The spatial pattern efficiency ( E SP ) is used to assess the spatial representation of S u . The benchmark model performance is given for the Q-only calibration as a reference. The skill scores are ranked from the highest (bluest) to the lowest (reddest). potential pitfalls of the basin-average calibration. The highest improvement of the spatial patterns of S u is obtained with case SB ( + 101%), followed by case SP ( + 93%) and case PW ( + 85%), while case BA ( + 76%) provides the least improvement ( Fig. 7 b).

Analysis per climatic zone
The analysis of the model performance according to climatic zones is done for all hydrological processes except Q because of the few gauging stations, which are unevenly distributed across the VRB (cf. Fig. 1 ).
The results reveal contrasting model performances across climatic zones ( Fig. 9 ). Additional results are provided in the supplementary material ( Figures S3-S11).
In average, the model performance at predicting the temporal dynamics of S t and the spatial patterns of S u is higher in the intermediate climatic regions (i.e., Sudano-Sahelian and Sudanian zones) than in the driest and wettest regions (i.e., Sahelian and Guinean zones) of the VRB. In terms of the temporal dynamics of S u , the model performance decreases slightly from the driest to the wettest regions.
In general, the multivariate calibration scenarios with evaporation datasets lead to a higher model performance for all hydrological processes in all climatic zones, as compared to the Q-only calibration ( Fig. 9 , Figure S13). Although the GLEAM products globally perform well, they are the least effective at predicting the temporal dynamics of S t in the Sahelian zone. The top three best evaporation datasets for improving the average model performance across climatic zones are SSEBop, MERRA-2 and ALEXI for the temporal dynamics of S t , MERRA-2, GLEAM v3.3a and GLEAM v3.2a for the temporal dynamics of S u , and GLEAM v3.3b, GLEAM v3.3a and GLEAM v3.2a for the spatial patterns of S u . Similar to results obtained at the entire VRB scale ( Section 3.1 ), MOD16A2, SEBS and ERA5 still show the lowest contribution in improving the model performance across different climatic zones. More details on the ranking of the evaporation datasets are provided in the supplementary material (Tables S1-S5; Figures S14, S21 and S40).
Contrary to the basin scale analysis, case SP is the least efficient calibration strategy per climatic zones. This result can be justified by the fact that the model calibration on the spatial patterns of evaporation datasets is done at the scale of the VRB. Consequently, the spatial variability of hydrological processes at large scale is not representative of the climatic zones where the patterns are more homogenous. For regions with strong spatial variability, sub-region model calibration on spatial patterns can be a way forward in overcoming the pitfalls of domain-wide calibration, thereby ultimately resulting in a higher model performance.

Discussions
The results presented are primarily valid for the VRB because errors in evaporation estimates are known to vary according to region ( Hartanto et al., 2017 ;Sörensson and Ruscica, 2018 ). However, the innovative model evaluation approach proposed in this study is not location-specific and can be applied to other regions using a grid-based hydrological model. The robust evaluation approach for evaporation datasets is based on multiple hydrological processes, and responds to the recent call of the scientific community for process-oriented diagnostics of earth system models ( Maloney et al., 2019 ;Melsen et al., 2016 ). The same methodology can be deployed for the evaluation of other spatial observational datasets such as gridded soil moisture estimates using robust metrics ( Dong et al., 2019 ), and subsequently be used to estimate their value for hydrological modelling. Moreover, this study highlights the potential of satellite and reanalysis evaporation datasets to improve the representation of various hydrological processes, which might guide modellers in choosing the adequate product for their applications, and support the data developers in their quest of improving global estimates of evaporation ( McCabe et al., 2019 ). It must be noted that the overall ranking in Fig. 9 does not systematically determine whether a dataset is good or bad, rather it shows which evaporation product provides the highest or lowest model performance for a given hydrological flux or state variable. Only the skill scores allow a judgement on the ability of a given dataset to yield a good model performance.
There might be uncertainties related to the rescaling of the evaporation datasets from their native spatial resolutions to that of the hydrological modelling resolution. However, for a fair comparison, all the datasets should be used at the same spatiotemporal resolution. The high performance of the GLEAM datasets is most likely due to the integration of soil moisture information in the calculation of actual evaporation. Therefore, GLEAM is expected to perform well for hydrological modelling. Surprisingly, MERRA-2 ranks among the best evaporation datasets notwithstanding its coarse spatial resolution. The high performance of MERRA-2 can be explained by its high temporal resolution, which might compensate its lower spatial resolution. In general, there is no clear tendency of SRS datasets to outperform the reanalysis datasets, and vice versa. Thus, besides their use as forcing data, reanalysis datasets represent a valuable source of information for the calibration of hydrological models. Satellite data of soil moisture and terrestrial water storage used for model evaluation in this study are not free of errors. However, at large scale and in poorly gauged regions, they are the only source of data that can be used for spatial model evaluation ( Peters-Lidard et al., 2019 ).
Overall, this study strives to provide solutions to some of the current challenges in hydrology (i.e., modelling methods, uncertainty and spatial variability; Blöschl et al., 2019 ). The proposed methodology represents an innovative way to use satellite and reanalysis datasets to improve process representation in hydrological models ( Clark et al., 2015 ;Peters-Lidard et al., 2017 ), develop hydrological water accounting ( Hunink et al., 2019 ) and advance prediction in ungauged basins ( Hrachowitz et al., 2013 ;Zhang et al., 2019 ).
Future studies should test the advantages of a multi-scale calibration framework that accounts for both domain-wide and sub-domain spatial heterogeneity, which might lead to better prediction of spatial patterns across climatic zones at large scale. Calibration strategies with multiple non-commensurable variables as well as spatial patterns is a way forward in advancing process representation in hydrological models ( Dembélé et al., 2020a ;Nijzink et al., 2018 ;Zink et al., 2018 ). More importantly, the choice of the calibration strategy or the objective function is determinant for high model performance, mainly under a changing environment ( Fowler et al., 2018 ;Schaefli et al., 2010 ). Uncertainties in the structure of the mHM model might influence the modelled hydrological processes. For instance, the calculation of actual evaporation ( Section 2.1 ) might be subject to a double accounting of stress from both the leaf area index and soil moisture. However, in such case, the sources of stress can be proportionally adjusted by the model parameters during the calibration. Further work should investigate the impact of different model structures on the methodology proposed in this study, and assess the parameter sensitivity of mHM depending on calibration strategies and variables, which is beyond the scope of the current study. Besides improving model calibration, satellite and reanalysis datasets can play an important role in identifying deficiencies in model structures and contribute to model improvement ( Hulsman et al., 2020 ). As the accuracy of the satellite and reanalysis datasets depends on the quality of the input meteorological datasets used for their production, it is also important to assess the impact of precipitation datasets on evaporation modelling ( Dembélé et al., 2020b ;Mao and Wang, 2017 ;Or and Lehmann, 2019 ).
Finally, an ensemble product that merges different evaporation datasets is a potential way forward in reducing regional uncertainties and thereby improving global estimates (da Motta Paca et al., 2019 ;Jiménez et al., 2018 ). Such advances in evaporation product development can facilitate prediction in ungauged basins using earth observations.

Conclusion
Four model calibration strategies are used to evaluate twelve satellite and reanalysis datasets in the large transboundary Volta River basin located in West Africa. The experiment is done with the mHM model over the period 2003-2012. The key findings can be summarized as follows: -Satellite and reanalysis datasets can improve the predictive skill of the hydrological model if the appropriate calibration strategy is used. -Overall, MERRA-2, GLEAM v3.3a and SSEBop individually provide the highest contribution in improving the model performance.
-Model calibration on the full extent of the gridded evaporation datasets result in a higher model performance than calibration on basin-average estimates. -Using only the spatial patterns of gridded evaporation data for model calibration, and not their absolute values, yields higher model performance than classical approaches based on basin average evaporation signal or based only on streamflow. -Contrasting spatial patterns of soil moisture are obtained depending on the modelling scenarios, with differences in the model performances according to climatic zones.
These findings contribute to solving current challenges related to large-scale hydrological modelling and provide avenues for improving process representation with the use of increasingly available satellite and reanalysis datasets. Improving the representation of the spatial patterns of hydrological processes should be a key consideration in modelling with spatially distributed models, which would allow better predicting floods and droughts. Moreover, the results provide insights to the developers of the evaporation datasets and might serve of guidance for future developments. However, a replication of the proposed methodology to evaluate evaporation datasets should be applied in other regions with different hydro-climatic conditions, and with different hydrological and land surface models. Future work should also investigate the possibility of prediction in ungauged basins solely from earth observation datasets, which are increasingly and readily becoming available.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments and Data
We thank the providers of the datasets used in this study (see Tables 2 and 3 ). We are grateful to the streamflow data providers (DGRE of Burkina Faso, HSD of Ghana and the Volta Basin Authority). Moctar Dembélé was supported by the Swiss Government Excellence Scholarship (2016.0533 / Burkina Faso / OP) and the Doc.Mobility fellowship (SNF, P1LAP2_178071) of the Swiss National Science Foundation. Bettina Schaefli and Natalie Ceperley were supported by a research grant of the Swiss National Science Foundation (SNF, PP00P2_157611). We thank the three anonymous reviewers for their constructive feedback. The entire modelling database can be freely accessed at http://doi.org/10.5281/zenodo.3800416

Supplementary materials
Supplementary material associated with this article can be found, in the online version, at doi: 10.1016/j.advwatres.2020.103667 .