Assessing robustness in global hydrological predictions by comparing modelling and Earth observations

ABSTRACT Hydrological modelling to support hypotheses on Earth system boundaries or the accelerating water crisis is nowadays done at the global scale, with difficulties associated to model uncertainties. Here we bring a robustness analysis of internal model variables as an additional tool for model evaluation using data from six Earth observation products and the global catchment model World-Wide HYPE in a comparative study. The assessment shows that: (i) variables have high agreement in mid-latitude temperate regions; (ii) the variables with higher agreement, and associated with good model performance in streamflow, were actual evapotranspiration, fractional snow cover and snow water equivalent; and (iii) changes in total water storage showed very poor agreement, probably due to an insufficient number of aquifers in the model set-up. We propose this procedure as a standard complementary method in global hydrological modelling, highlighting the importance of justifying models before using them for scenario analysis or water accounting.


Introduction
The first global hydrological model (GHM) appeared in 1989 and described the global hydrological cycle in broad terms (Sood and Smakhtin 2015).Since then, different modelling communities followed, using different discretization schemes and approaches to describe dynamics of hydrological variables (Bierkens et al. 2015) and develop GHMs or continental models for different purposes (Archfield et al. 2015).For instance, hydrological communities moved towards global water security modelling (Arnell 1999, Vörösmarty et al. 2000, Döll et al. 2003) as well as forecasting river flow at the global scale (e.g.Alfieri et al. 2023, Harrigan et al. 2023).
Meanwhile, the climate community moved towards Earth system models with improved hydrological components (e.g.Tang et al. 2007, Rost et al. 2008, Pokhrel et al. 2014, Wada et al. 2014), and the meteorological land-surface community coupled atmospheric and water cycle processes (e.g.Liang et al. 1994, Best et al. 2011, Guimberteau et al. 2012) for integrated forecasting.All these global models are gridded and many are not calibrated or evaluated against streamflow observations but rather are used in multi-model ensembles.Accordingly, recent systematic evaluation of several such GHMs found weak to poor performance of river discharge in many basins around the globe (Gädeke et al. 2020, Krysanova et al. 2020).
The catchment hydrology community, with a long tradition in calibrating and evaluating hydrological models, has traditionally worked at scales up to the river basin.Thanks to the increasing computational capacity as well as the growing trend towards open data, catchment modelling has now expanded to also encompass continental scales (Widén-Nilsson et al. 2007, Pechlivanidis and Arheimer 2015, Donnelly et al. 2016).Along these lines, Arheimer et al. (2020) set up the Hydrological Prediction for the Environment (HYPE) model, a processoriented semi-distributed hydrological model, at the global scale.World-Wide HYPE (WWH) uses openly accessible data, such as river discharge after quality checks (Crochemore et al. 2020), and all datasets required for the catchment delineation, soil, and land use classification, and model forcing.The stepwise calibration strategy used in the set-up of WWH was designed to represent river discharge worldwide (Arheimer et al. 2020).
Nevertheless, the goal of GHMs is broader than just simulating streamflow; they also aim at supporting hypotheses about Earth system boundaries (Rockström et al. 2023) or how to manage the accelerating water crisis (e.g.Hoekstra et al. 2012, Bierkens et al. 2021, GCEW 2023).This demands an understanding of the global water cycle and reproducing the different water cycle components in addition to river discharge, for instance global evapotranspiration (Pimentel et al. 2023), runoff from soils (Santos et al. 2022) or groundwater fluxes (De Graaf et al. 2019).It is well known that correctly reproducing river discharge does not guarantee that other hydrological state variables and fluxes in a catchment are correctly represented; models may reproduce river discharge well, but not for the correct reasons (Kirchner 2006), which could be due to the problem of equifinality (Beven 2006, Muñoz et al. 2014, Her and Chaubey 2015).Therefore, evaluating internal variables in hydrological models is important to ensure the correct representation of the different water cycle components (Montanari et al. 2008, Lindström et al. 2010, Rakovec et al. 2016, Li et al. 2018).
Nowadays, another independent source of data is available from Earth observation (EO).The huge efforts made in the last few decades to observe the Earth from space have made accessible a sizeable number of products representing hydrological variables (mostly covering at least the last decade).For instance, potential and actual evapotranspiration (PET and AET; Mu et al. 2011), snow cover (Hall et al. 2002), snow water equivalent (SWE; Takala et al. 2011), soil moisture (SM; Dorigo et al. 2017) or changes in water storage (ΔS; Han et al. 2005) are some of the hydrological variables retrieved from EO. Sometimes, the spatiotemporal resolution of these products makes their use at the local scale difficult; however, they offer great opportunities at the global scale.
Even though this remote sensing information is usually referred to as observation, in most cases the final variable provided is based on post-processing and combination with other datasets.For instance, EO-based evapotranspiration is derived from surface temperature observations, using meteorological information and specific assumptions about vegetation cover (Mu et al. 2007); soil moisture is derived from microwave information assuming changes in the dielectric constant (Notarnicola et al. 2008); and ΔS values are derived from variation in the gravity field (Landerer and Swenson 2012).Therefore, these products include not only uncertainties due to the measuring devices, but also uncertainties related to the algorithms used for translating the observed magnitudes to the final hydrological variables.Here, we will refer to these data as "EO products" to underline that they are not direct observations, and uncertainties are intrinsically included in all of them, just like in hydrological models.
There is a clear need to assess the performance of GHMs using streamflow data (e.g.Krysanova et al. 2018Krysanova et al. , 2020)).In addition, we propose that an in-depth assessment of GHMs against not only streamflow but also remote-sensing information would help ensure that model results are robust.Since both hydrological models and the EO products have uncertainties and constitute a combination of direct measurements and equations, we do not aim to evaluate them against each other.Doing so would assume that one is inherently more trustworthy than the other, which is sometimes done in hydrological studies (e.g.Xu et al. 2014, Rajib et al. 2016, Demirel et al. 2018, Huang et al. 2018, Han et al. 2019, Dile et al. 2020).We prefer to consider both datasets as plausible quantifications of a certain variable.
Therefore, we propose to assess the degree to which these quantifications agree or correspond as an indicator of the results' robustness (see definitions by e.g.Lloyd 2015, Schupbach 2018, Harris and Frigg 2023).Thus, we define a variable as robust if there is agreement on it among several independent data sources (e.g.river gauges, process-based modelling, EO products) at the global scale.Such robustness would give more confidence when using hydrological model results, for instance in impact analysis (e.g.climate scenarios), operational hydrological forecasts, and understanding global hydrological storages and fluxes.Finding robust variables calls for a multi-source and multi-variable analysis.In this context, the main objectives of this study are twofold: first, to compare estimates of hydrological variables from WWH and EO products to assess the robustness of results; and, second, to promote the concept of evaluating GHMs in the light of EO products and thereby go beyond in situ river discharge observations to multi-variable estimation of the water cycle.
This work examines research question No. 16 of the 23 unsolved problems in hydrology (UPH; Blöschl et al. 2019): "How can we use innovative technologies to measure surface and subsurface properties, states, and fluxes, at a range of spatial and temporal scales?" p. 1148.

Data
We compared estimates of six hydrological variables from two types of data sources for the period 2000-2014 (Table 1): EO and dynamically modelled outputs from the GHM WWH.The procedures on how to estimate the hydrological variables in the model and in each EO product are presented below.

Potential evapotranspiration
PET is a concept commonly used in hydrology as an intermediate component describing the energy-based demand when calculating AET.PET is defined as the evaporation from a vegetated surface with an unlimited amount of water available and without advection or heat-storage effects *Mean size of 1020 km 2 (5th percentile: 64 km 2 ; 50th percentile: 770 km 2 ; 95th percentile: 2185 km 2 ).(Thornthwaite 1948).We use the moderate resolution imaging spectroradiometer (MODIS) global PET product (MOD16; Mu et al. 2007Mu et al. , 2011) ) at monthly resolution for EO, which is the most standardized remote-sensing-based evapotranspiration product at global scale.MOD16 is based on the Penman-Monteith equation (Penman 1948, Monteith 1965), which combines energy and water balance using standard climatological records, such as radiation, temperature, humidity, and wind speed, and introducing aerodynamics and surface resistance from the vegetation canopy.Due to its physical parameterization, this product needs a high number of input datasets to calculate the final evapotranspiration.These include remote sensing products and meteorological data.
The remote sensing products it utilizes are: (1) collection 5 climate modellling grid (CMG) MODIS albedo band 10 from MOD43C1 (Schaaf et al. 2002, Jin et al. 2003, Salomon et al. 2006); (2) global collection 4 MODIS land cover type 2 (MOD12Q1; Friedl et al. 2002) and (3) global MODIS collection 5 of fraction of photosynthetically active radiation (FPAR) and leaf area index (LAI) (MOD15A2; Myneni et al. 2002).Daily air pressure, air temperature, humidity, and radiation from the reanalysis product of the Global Modelling and Assimilation Office (GMAO) are used as meteorological forcing (see Mu et al. 2011 for specific details).The final PET provided by the MOD16 product is calculated as the sum of the evaporation from rain intercepted by the canopy before it reaches the soil, the potential plant transpiration, the evaporation from wet soil and the PET from the soil components.The MODIS PET product has been assessed by comparison with other distributed PET estimates (Velpuri et al. 2013, Faisol et al. 2020).However, its accuracy is highly dependent on the meteorological data and the PET equation used in the calculation.
In WWH, PET (epot in HYPE terminology) is calculated using three different model equations: (1) the Jensen-Haise equation (Jensen and Haise 1963) in temperate areas; (2) the modified Hargreaves equation (Hargreaves and Samani 1982) in tropical and arid areas; and (3) the Priestley-Taylor equation (Priestley and Taylor 1972) in snow/ice-dominated regions.These algorithms have been judged the most appropriate in previous HYPE set-ups in the selected regions (Donnelly et al. 2016, Andersson et al. 2017, MacDonald et al. 2018).Equation parameters for PET calculations have been estimated for each hydrological response unit, (HRU; corresponding to different land cover) using river discharge from representative gauged catchments (Arheimer et al. 2020, Santos et al. 2022).

Actual evapotranspiration
MOD16 was also used as the remote sensing-based product for AET.The Penman-Monteith equation is behind this product (see Section 2.1).However, in this case, the final AET is calculated as the sum of different evaporation components: evaporation losses from the wet canopy, actual plant transpiration and actual soil evaporation.This product was validated against eddy covariance measurements worldwide, with varying results in terms of accuracy (Hu et al. 2015).For instance, Mu et al. (2011) showed a mean bias in daily AET of 0.31 mm/ day −1 at 46 sites in North America.In South America, Ruhoff et al. (2013) found that the relative error (RE) in AET was about 19% at two sites over the Rio Grande (Brazil).In Europe, Hu et al. (2015) obtained correlation coefficients ranging within the interval 0.45-0.98 and a mean bias within the interval −0.9 to 1.11 mm/day −1 .In Asia, Kim et al. (2012) found AET correlations ranging from 0.27 to 0.82 and biases ranging from −24 to 2.3 mm/8 days −1 , over 17 sites.
In WWH, AET (evap in HYPE terminology) may occur from the two topsoil layers (some 1-2 m depending on the HRU), considered as the plant rooting depth.In each layer, AET is assumed equal to PET if the soil moisture is above a certain threshold (field capacity) and decreases until the zeroevaporation threshold is reached (wilting point).The AET is assumed to decrease exponentially with depth and is divided between the two layers.AET is set to zero for temperatures below a certain threshold and for negative PET estimation (Lindström et al. 2010).For the HRU corresponding to water bodies (e.g.rivers, floodplains, and lakes), AET is assumed equal to PET when the air temperature is above a certain threshold.This evaporation is limited by the volume of water in the water body.

Fractional snow cover
For fractional snow cover (FSC) we used MOD10CM, which is a global 0.05 × 0.05° resolution monthly mean snow cover extent product derived from the daily equivalent MOD10C1 (Hall and Riggs 2007).Two filters are applied to consider only relevant days in the monthly mean estimation.The first filter is used to remove days with high cloud coverage.A daily clear index (CI) is defined and only days above a threshold (70%) are used in the monthly mean.The second filter is applied to filter out the cells in which the snow cover extent is less than a threshold (10%).Similarly, the daily FSC product, MOD10C1, has spatial aggregation at 0.05 × 0.05° of the 500 × 500 m daily MOD10A1, which contains the original snow cover extension in terms of normalized difference snow index (NDSI; Dozier 1989).This index is calculated based on the physical properties of snow.
Snow has a very high reflectance in the visible region and very low reflectance in the shortwave infrared region of the electromagnetic spectrum.A cell with an NDSI value greater than zero is considered to have some snow presence.The NDSI from MOD10A1 is interpreted as a binary map to build the MOD10C1 product.MODIS snow products have been substantially assessed against ground data.An overall accuracy of about 93%, with variations by land-cover type and snow condition, was reported by Hall and Riggs (2007).In mountain areas these accuracies depend less on land-cover type and more on the period of the year (Parajka et al. 2012, Notarnicola et al. 2013).The errors are usually lower in winter than in spring (Gascoin et al. 2015, Pimentel et al. 2018).In addition, Hall and Riggs (2017) stated that MOD10CM, compared to other monthly snow maps, tends to better agree during the winter months than in the fall when snow is accumulating.
WWH calculates FSC (cfsc in HYPE terminology) based on the methodology of Samuelsson et al. (2011).The snow cycle is divided in three phases: (1) an accumulation phase in which the snow cover increases as function of the SWE to a maximum extent threshold; (2) a consolidation phase, in which the snow cover extent is redistributed based on topographic characteristics of the HRU (e.g.mean elevation and standard deviation of this elevation) and the maximum extent is reached; (3) a melting phase, in which the snow cover extent decreases exponentially.For seasons when the snowpack does not reach the definition of large snowpack, the first equation is used during the whole season.

Snow water equivalent
The Global Snow Monitoring for Climate Research (GlobSnow v. 2.0; Takala et al. 2011) is used as the remote sensing source for SWE.This product is based on a retrieval algorithm that assimilates synoptic weather station data on snow depth with satellite passive microwave radiometer data from the Scanning Multichannel Microwave Radiometer (SMMR), Special Sensor Microwave Imager (SSM/I) and Special Sensor Microwave Imager/Sounder (SSMIS) sensors (Pulliainen et al. 1999, Takala et al. 2011).Daily, weekly, and monthly time series of SWE, covering more than 30 years, are produced at 25 km resolution over the Northern Hemisphere (latitudes ranging from 35° to 85°), except for mountainous areas and Greenland, due to limitations in the satellite information and retrieval algorithm.The monthly product was used in this study.The retrieval accuracy of GlobSnow SWE product was evaluated against 1264 different snow survey locations in Russia (Hancock et al. 2013).The comparison showed a root mean square error of 32.5 mm for snowpacks with SWE values ranging between 0 and 150 mm; this error increases to 43.5 mm when including all snowpacks analysed (Luojus et al. 2016).GlobSnow SWE product was also evaluated using snow surveys in Finland and Canada, with root mean square errors of 30 and 100 mm, respectively (Mortimer et al. 2020).
In WWH, all precipitation falling under a temperature threshold is considered to be snow and is added to the snowpack.Moreover, a portion of the precipitation happening within a temperature range centred on this temperature threshold is also defined as snow and added to the snowpack.SWE losses occur by evaporation and melting.Evaporation from the snowpack is calculated as a fraction of the PET in each catchment, which is calibrated for each HRU in the model.Snowmelt is a function of two melting ratios, that are temperature-and radiation-dependent, respectively.A snow cover scaling is also applied in each catchment to only consider covered areas.

Soil moisture
SM from the European Space Agency (ESA) CCI SM (ESA multi-decadal, global satellite-derived dataset as part of the Climate Change Initiative programme; Dorigo et al. 2017) is used as the remote sensing source data.Here we used the COMBINED product, which is a combination of active and passive microwave products and was found to outperform the single sensor input products (Dorigo et al. 2017).The ESA CCI SM COMBINED product has global coverage, daily temporal resolution and 25 × 25 km grid size.
There is no clear agreement on the soil depth that these products can represent; however, several studies suggested that the SM measurements could be extrapolated to 20-25 cm (Zeng et al. 2015, Wigneron et al. 2017, Ma et al. 2019).The product provides several uncertainty flags for removing possible errors in measurements and retrieval, which we have considered as a quality check of the data.The overall accuracy, which was calculated as unbiased root mean square error, of the ESA CCI SM product is 0.04 m 3 m −3 .Moreover, based on various studies, Dorigo et al. (2017) concluded that this product generally matches in situ observation relatively well in temperate climates, over grassland and agricultural areas, and in semi-arid regions, but has difficulty reflecting the temporal dynamics in the driest and wettest areas.
In WWH, the SM is the result of the water balance in the soil.A fraction of the rainfall and snowmelt infiltrate into the topsoil, the remaining fraction being diverted into surface and macropore flows.This diversion only occurs if a maximum water content is exceeded.This maximum threshold in a soil layer is determined by three model parameters defined per layer that represent the total porosity of each layer.Soil moisture above the threshold may percolate down to the next layer.The percolation is determined by the soil type influencing the maximum percolation capacity and the available space in the soil layers below (Lindström et al. 2010, Arheimer et al. 2020).In WWH, the soil is divided into three layers (e.g.Santos et al. 2022).Here, we only used the water available in the first layer (sml1), which has a constant depth of 25 cm.

Variation in ΔS
Gravity Recovery and Climate Experiment (GRACE) Tellus Release 05 is the dataset used as the EO product to assess variations in water storage (ΔS).GRACE Tellus derives ΔS relative to a time-mean baseline measuring changes in the gravity field (Wahr et al. 1998), which can be mainly associated with variations in all hydrological storage.Here, we used, as recommended, the ensemble of the spherical harmonic data produced by three centres: the Center for Space Research (CSR), GeoForschungsZentrum (GFZ), and the Jet Propulsion Laboratory (JPL).This ensemble is multiplied by a scaling grid, which restores most of the energy removed by the Gaussian and degree 60 filters applied in the production of the spherical harmonic solutions and reduces uncertainties.These uncertainties, in terms of mean square error, are generally lower than 2 cm for large river basins and about 3-4 cm for smaller river basins (Landerer and Swenson 2012).GRACE Tellus has a global coverage of monthly data at 1° × 1° spatial resolution.
In WWH we derive the ΔS through a post-processing procedure applied to model outputs.All ΔS in the model (i.e. in aquifers, glaciers, lakes, floodplains, rivers, snow, and soil) are summarized for each catchment and aggregated to the monthly time scale.Monthly ΔS values are derived in relation to a reference baseline, which is calculated as the mean ΔS for the same years as the baseline in GRACE Tellus.

Methods
For each of the six hydrological variables listed in Table 1, EO and WWH time series were compared at the coarser scale.The PET, AET, FSC, SWE and SM from EO products were aggregated to the model's spatial scale (on average ~1000 km 2 ) by calculating spatial average values for each catchment and month, taking into consideration specific EO products' warning flags.Hydrological model anomalies in ΔS were computed at GRACE resolution (1° × 1°).
The variables and analysis methods were chosen to examine the robustness in results from WWH for three main applications: (i) assessing long-term water resource availability under present and future climates, (ii) operational forecasting of seasonal variation in river discharge, and (iii) better understanding hydrological processes worldwide.First, connected to climate application, long-term means (defined as the means of the annual values during 2000-2014) were used in the analysis of raw EO and model values in terms of their relative difference formulated based on the RE: where x represents the long-term mean of the selected variable during the period 2000-2014, and the sub-indices EO and WWH represent the EO product and the WWH modelled hydrological variable, respectively.Positive (negative) values indicate that the WWH-modelled value is greater (smaller) than the EO product value.
Second, monthly means were used to assess the seasonal dynamics for each variable within the study period.This assessment was carried out using the statistical Kling-Gupta efficiency metric (KGE; Gupta et al. 2009; Equation (1)) and its components r, β and α (Equations ( 2)-( 4)).These components are related to the most common metrics used to assess model performance in catchment hydrology: Pearson correlation coefficient (CC), RE and relative error of standard deviation (RESD): where: and x represents the monthly time series of each variable, x their long-term means, and σ their standard deviations.The sub-indices EO and WWH represent the EO product and the WWH modelled hydrological variable, respectively.
Finally, connected to process understanding, we evaluated the relation between (i) the WWH model performance in terms of discharge and (ii) the agreement between EO products and internal WWH model variables.The WWH model had already been calibrated and evaluated using KGE as a metric for model performance between simulated and in situ river flow measurements at 5338 sites for the period 1981-2012 (Arheimer et al. 2020), of which 4367 sites where applicable in the current study for the time period with EO data (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014).The results of Arheimer et al. (2020) show a median monthly KGE of 0.4 with spatial variations in model performance.For instance, better performances (KGE > 0.6) were found in Europe, Japan, Southeast Asia, and the USA, as well as in parts of Canada, Russia, and South America.
Moreover, the WWH model shows potential in reproducing several flow signatures (Arheimer et al. 2020), such as high monthly flows, spatial variability of high flows, duration of low flows, and constancy (see definition by Colwell 1974) of daily flow.Nevertheless, robust signals of additional hydrological model variables would indicate that the model represents river discharge for the right reasons.To analyse this, we grouped catchments globally into four groups based on their KGE value for discharge in the 4367 gauging stations with data available in the period 2000-2014, each group representing one quartile of the KGE distribution.In each group, the six internal variables (PET, ET, FSC, SWE, SM and ΔS) were assessed using the same metrics as described in Equations ( 1)-( 5).

Results and discussion
The results from comparing hydrological variables from GHM and EO products around the globe show both similarities and differences, which, in coherence with previous studies, are sometimes difficult to explain (cf.Döll et al. 2014, Bierkens et al. 2021).Nevertheless, we conducted a thorough analysis of the six variables studied with the ambition to learn about process representation.Below we discuss potential reasons for disagreements, and we found that uncertainties could be referred to both data sources.
In the discussion that follows, we use the term "robustness," which has various meanings in science (O'Loughlin 2021).We examine our results from a model-based robustness analysis perspective, which is commonly used in climate science (Lloyd 2015, Schupbach 2018).According to this definition, a result is regarded as well supported if a sufficiently diverse set of models agree on it (Harris and Frigg 2023) when there is no "ground truth" available.Using this definition, we found the long-term mean of the variables PET, AET, FSC and SWE to be robust, SM to be semi-robust and ΔS to be weak.For seasonal dynamics, none of the variables were robust at the global scale (only regionally) except for PET.To a large extent this can be referred to uncertainties in the EO products, but also to the GHM (see below).

Long-term patterns
Comparison of annual means for the six variables from WWH and EO products, respectively, showed similarities in the global patterns but also major differences, which were quantified in relative terms (Fig. 1) and analysed as follows.
For PET, the general decreasing gradients from the equator to the poles are captured in both datasets.However, when looking into PET relative differences between the two datasets (Fig. 1, row 1, column 3), one can see some local discrepancies.PET results are more similar in the southern hemisphere than in the northern hemisphere.WWH gives lower values of PET in latitudes greater than 30°N with relative differences between −25 and −75%.The different equations used for estimating PET in WWH (Arheimer et al. 2020) and via the EO product (Mu et al. 2011) might be the main reason for this lack of agreement (Pimentel et al. 2023).
In the case of AET, the latitudinal gradients are also well captured in both datasets (Fig. 1, row 2).Again, spatial heterogeneities can be identified in the comparison of the two AET maps.The regions with good agreement between the two analysed data sources (relative differences within the interval from −25 to 25%) are: the Amazonas, central Africa and Indonesia (AET values in the range 1000-2000 mm); eastern North America and central Europe (AET values in the range 500-1000 mm); and western North America, eastern Europe and western Russia (AET values in the range 0-500 mm).In contrast, the regions with lower agreement are southern South America, the Sahel, southern Africa, India, the Mekong region, and Australia.In all these regions AET is estimated up to 75% higher in WWH than in the EO product.The results for eastern Russia and northern Canada also differ, but in this case AET is estimated up to 50% lower in WWH than in the EO product.These differences could be due to systematic errors in the MODIS product, which have been observed in previous studies.When validating the MODIS AET product against other remote sensing products at the global scale, Miralles et al. (2016) highlighted systematic lower values for the MODIS AET product in tropical and sub-tropical regions but larger values in high latitudes of the northern hemisphere.
Snow is distributed similarly worldwide in both WWH and EO products (Fig. 1, rows 3 and 4).In the northern hemisphere, FSC values match well in latitudes between 40°N and 80°N.Values are above 0.4 in both datasets, and relative differences lie within the interval −25 to 25%.However, we also found low agreement in certain areas of this latitude band: the coastal areas of Alaska, Canada, and western Europe.The non-permanent character of snow in these areas, with values of FSC within the interval 0-0.2 (Fig. 1, row 3, columns 1 and 2), leads to drastic relative differences when comparing the two datasets (Fig. 1, row 3, column 3).Greenland and the islands located above 70°N also show low agreement in FSC.The analysis is biased by the limitations of the EO product in capturing snow information during winter months due to the northern nights (Lee et al. 2006, Notarnicola 2020, Rößler et al. 2021).The discrepancy also reflects the different methodologies used for separating ice and snow in WWH (Lindström et al. 2010) versus the EO product (Hall and Riggs 2017).In latitudes between 30° and 40°N, FSC generally shows low agreement between the two data sources.WWH simulates FSC in lower latitudes, where the EO product does not retrieve snow.In the southern hemisphere, the agreement of FSC is low in the Andes and New Zealand, with high relative differences also due to the non-permanent character of snow.The seasonal character of snow in areas with highly seasonal snow (e.g.Sierra Nevada in Spain, Atlas Mountain Range in Morocco, Sierra Nevada in California, Andes in Chile) is not well represented, neither in the WWH model nor in the EO product.Higher resolution, better meteorological data, and/or specific calibration would be needed to capture these local phenomena (Marchane et al. 2015, Cornwell et al. 2016, Pimentel et al. 2017b).
In terms of SWE, the EO product is limited to the northern hemisphere at latitudes above 30°N.This variable shows lower agreement than FSC (Fig. 1, rows 3 and 4, column 3).Relative differences higher than 75% are observed in all regions with data, which indicate low robustness in these estimates.WWH gives lower SWE values than the EO product in Alaska, western North America, and China.In contrast, the EO product provides lower SWE in eastern North America and western Europe.In this case the EO product is not conditioned by the northern night as it operates in the microwave region of the spectrum; however, more uncertainty is introduced due to the major complexity of the retrieval algorithms (Takala et al. 2011).
The long-term mean values of SM in the upper soil layer are more homogeneous in WWH than in the EO product.Nevertheless, both identify the lowest SM values (0-0.1 m 3 m -3 in the same areas -the Sahara, Arabian and Gobi deserts -and the highest values (0.2-0.4 m 3 m −3 ) in latitudes higher than 50°N.When comparing patterns of long-term SM values, extremely dry areas (e.g.Sahara Desert, Arabian Peninsula, Karakum Desert and Gobi Desert) correspond to locations where the WWH model shows much lower values of SM than the EO product, reaching RE values in the interval −50 to −75%.Deserts have also been identified as areas where EO products have low correlation with reanalysis data (e.g.ERA-Interim/Land, Dorigo et al. 2017).Dorigo et al. (2010) attributed these differences to the filling of temporal gaps in the passive microwave time series with lower quality active microwave observations.In contrast, WWH shows larger values than the EO product in a more dispersed fashion across the northern hemisphere without following a clear pattern (e.g.northwestern North America, Siberia, and the Tibetan Plateau).Previous assessments of the EO product have shown poor correlation with in situ measurements in high latitudes, including the high-latitude boreal forest and the Tibetan Plateau (Dorigo et al. 2017).
For ΔS, the results show a general discrepancy between WWH and EO (Fig. 1, row 6, columns 1 and 2).The relative differences show a much lower ΔS in WWH than in the EO product.This difference is almost homogeneous worldwide, with relative differences below −75% for 60% of the global land mass.Small overestimation spots appear dispersedly all over the globe, being significant only in parts of northern Canada and Siberia.This difference might be due to a lack of deep aquifers in these regions, as the WWH model mainly simulates subsurface storage regularly interacting with surface flow (Arheimer et al. 2020).Despite these high differences in ΔS values, the two datasets agree on the sign of ΔS in some regions (i.e.whether the total water storage is increasing -purple in Fig. 3 -or decreasing -orange in Fig. 3).For instance, negative ΔS values in central Canada, the Amazonia and northwestern Siberia, and positive ΔS values in northwestern Canada and Australia, are found in both datasets.
To summarize, global patterns in long-term annual means are similarly represented for PET, AET, FSC and SWE, partly similarly represented for SM and deviate for ΔS when comparing the datasets, i.e.WWH and EO products.

Seasonal dynamics
Seasonal dynamics were compared using KGE and its components.Regarding evaporation variables, the KGE values for PET point to the regions with similar monthly characteristics in EO and WWH datasets (KGE greater than 0.5; row 1, column 1 in Fig. 2): the eastern USA; the southern part of South America; central, eastern, and southern Europe; southern Africa; and Australia.In these areas, CC, RE and RESD have values close to the optimum.In contrast, northern latitudes (above 50°N), equatorial regions (10°N-10°S), and eastern China show low KGE values (KGE lower than 0.25), due to clear biases in mean and standard deviation (row 1, columns 3 and 4 in Fig. 2).Nevertheless, all regions but the equatorial regions show an agreement in temporal dynamics (based on CC).In the northern latitudes and equatorial regions WWH provides clearly lower values than the EO product (RE and RESD lower than −25%), while in eastern China, WWH has clearly higher values (RE and RESD higher than 25%).These discrepancies could be related to the selection of PET methods in the set-up of the WWH model or to limitations of the MODIS products (Pimentel et al. 2023).For instance, Velpuri et al. (2013) demonstrated that a bias in seasonality was introduced in the MODIS product by the GMAO meteorological data and uncertainty was introduced when filling some gaps for evapotranspiration calculation.
For AET, KGE values are generally lower than for PET, and have different spatial distribution.The largest KGE differences appear in southwestern Canada, the western and some part of central USA, South America, equatorial and southern Africa, the Mediterranean region, central Asia, parts of India, China, the Mekong region and Australia, where the model shows higher values in mean and standard deviation (RE and RESD higher than 25%).Interestingly, these regions generally show a good correlation coefficient (except for the Amazon basin, where CC is lower than 0.25), which indicates that the temporal dynamics of the AET process are well captured but not the magnitude.Low monthly correlation values between EO products and flux tower observations in this region have been reported by Paca et al. (2019).Overall, there is a general tendency to underestimate seasonal AET values in the selected EO product (Michel et al. 2016, Xu et al. 2019, Zhu et al. 2022).
The snow variables, FSC and SWE, show quite similar KGE distributions.They show higher agreement in northern latitudes (above 50°N) than in southern latitudes.The high agreement occurs in regions where snow is permanent  , 2000-2014, in terms of Kling-Gupta efficiency (KGE) and each of its components: Pearson correlation coefficient (CC), relative error (RE) and relative error of standard deviation (RESD) (columns) for each of the six selected hydrologic variables (rows): potential evapotranspiration (PET), actual evapotranspiration (AET), fractional snow cover (FSC), snow water equivalent (SWE), soil moisture (SM), and changes in water storage (ΔS).High agreement is represented in blue, while low agreement is represented in red (column 1 and 2) yellow and green (column 3 and 4).
during the whole snow season and therefore both the WWH model and EO products can more easily capture snow cycles and their magnitude.However, some spots also present differences in these areas, such as northern Siberia and the Canadian Artic Archipelago for FSC, and some catchments in eastern USA, western Russia, and northern Canada for SWE.Large differences appear in lower latitudes (KGE lower than 0. 5), which could be considered transition zones, where snow dynamics are more changeable within and between years and, consequently, more difficult to capture (Fayad et al. 2017, Pimentel et al. 2017a).
The low KGE values in FSC over these transition regions are combinations of low values in the three KGE components.For instance, monthly correlation is low, with CC values lower than 0.25.The MODIS EO product used for FSC has limitations in capturing timing in these regions, especially the end of the spring melting (Rittger et al. 2013, Yang et al. 2015, Cornwell et al. 2016, Marchane et al. 2017, Pimentel et al. 2017b, Polo et al. 2020).This fact also influences RE and RESD values, with WWH producing lower values and lower deviations than the EO product in western USA, Chile and the Mediterranean Basin (negative RE and RESD) and higher values and higher deviations than the EO product (positive RE and RESD) in eastern USA and parts of the Tibetan Plateau (Fig. 2, row 3, columns 2-4).
Regarding areas with low KGE values for SWE, the correlation between WWH and the EO product is good (CC values higher than 0.75), showing agreement in temporal dynamics.However, WWH gives much higher snow amounts (RE and RESD greater than 75%) than the EO product (positive RE and RESD) in eastern USA and Europe, and lower (RE and RESD lower than −75%) SWE values than the EO product (negative RE and RESD) in middle USA and eastern Asia (row 4, columns 3 and 4 in Fig. 2).These errors could be related to the more complex snow dynamics over these areas not being correctly captured by the model or to the uncertainty associated with a shallow snowpack in the EO product -or to both sources of error.Takala et al. (2011) affirm that the uncertainty in the SWE EO product increases if the snowpack is smaller than 150 mm, which could correspond to shallow snowpacks common over these zones.In any case, the snow variables, and especially FSC due to its low correlation, were found to display lower agreement than PET and AET in terms of seasonal dynamics.
Regarding SM, KGE values show a clear gradient along latitudes in the northern hemisphere.While KGE values above 50°N indicate low agreement (KGE lower than 0), more reasonable values are observed for areas below this latitude (KGE higher than 0.25).This gradient is clearly due to a bad correlation between the two datasets above 50°N, as shown by the patterns in CC (Fig. 2 row 5, column 2) and due to a large difference in RESD, with generally lower SM variability in the model (Fig. 2 row 5, column 4).Dorigo et al. (2017) point out that, in northern latitudes, seasonal gaps due mainly to the presence of dense forests and frozen soils constitute a challenge for retrieving SM from remote sensing.This may be the cause for the mismatch observed in our comparison.However, interestingly, the results observed in terms of the mean bias are very good, with optimum values reached almost worldwide (Fig. 2, row 5, column 3).Exceptions to this good RE are mainly observed in northern Africa and central Asia (RE lower than −0.75) in areas with deserts, which have previously been highlighted as challenging areas (Dorigo et al. 2017).There is a risk that the unclear soil depth representation in the EO product (see Section 2.5) might be shallower and more spatially variable than the first layer of the WWH soil (fixed at 25 cm in this analysis), which thus must be changed to correspond to the EO product.
KGE values for ΔS are globally quite low (KGE lower than 0, for 75% of the total land area covered by GRACE), which means that there is a large discrepancy between the two methods of estimation.Biases in means are larger and differ between regions, being highly skewed one way or the other, also where other researchers have found high correlation between hydrological modelling and this ΔS EO product (e.g.Rakovec et al. 2016, Cáceres et al. 2020).This indicates that the changes in volume might be due to underestimation of aquifers in WWH (see above).
This general mismatch between the two sources is, however, less related to errors in temporal dynamics (CC greater than 0.5, for 53% of the total land area covered by GRACE).We find good agreement in ΔS for March to May (average relative differences between the data sources of 3.6, −2.5 and −21.1% for each month) and discrepancies in February and June to November (mean relative difference between the data sources of 33.5, −39.1, −31.0, −30.9, −42.8, −50.9%, −66.1% for each month) globally (Fig. 3).The biases most negatively affecting the overall agreement in Fig. 1 (row 6) are for December and January (mean relative differences between the data sources of −116.5 and 355.8%).These more extreme discrepancies are found in the northern hemisphere.
To summarize, besides PET the seasonal dynamics of the studied variables can be considered robust at the regional scale but not globally.To a large extent this can be referred to uncertainties in the EO products but also to the GHM.

Hydrological understanding
Here, we evaluate whether the WWH model predicts streamflow for the right reasons, i.e. whether the numerical process descriptions based on our hydrological understanding can be justified from internal model variables.For that, we compared the relationship between (i) the WWH model performance in terms of discharge and (ii) the agreement between EO products and internal WWH model variables (in terms of KGE and its components).We assume that, if the agreement with an EO-product variable is correlated with the WWM model performance in discharge, the representation of the variable in the model may be the cause of the WWH performance in discharge.Alternatively, if they are anti-correlated, the WWH model discharge may be right for the wrong reasons, or wrong for the right reasons.
The KGE performances for discharge do not appear to be linked with PET since the four catchment groups show a similar spread in PET agreement and acceptable median values of KGE (around 0.5).The components of the KGE show that even though CC is very high (almost 1) for catchments with high discharge KGE, these catchments are also characterized by substantial negative biases in mean and standard deviation in PET with clearly lower WWH model estimates (median values about −25 and −20%, respectively).These results are in line with the previous conclusions obtained for PET in WWH, suggesting that the spatial selection of the PET models may not be optimal in the model set-up (Pimentel et al. 2023).No clear relation exists between the KGE in discharge and SM either.
Poor similarities in SM are found for KGE (median value about 0.2), because of poor correlation (median CC around 0.6) and a poor RESD (median value about −35%).We see a tendency for poorer SM with improved KGE for discharge, which would indicate that the model is right for the wrong reason, but probably reflects the fact that different soil depths are being compared.Hence, the upper soil layer must be adjusted in upcoming WWH versions, as it is crucial in the interaction between land and atmosphere (Katul et al. 2012) and robustness would be important for analysing societal challenges linked to climate change (Van Loon et al. 2016, Döll et al. 2018, Konapala et al. 2020, Gudmundsson et al. 2021).
ΔS values in the selected catchments show very similar patterns regardless of river flow model performances, with KGE median values of about −0.5, CC about 0.4, RE about 0% in three groups out of four, and RESD about −35%.Therefore, it can be concluded that there is no clear relationship between the performance in discharge, in terms of KGE, and the agreement between WWH and EO estimates for these three analysed variables (PET, SM and ΔS).
However, the results highlight a clear relationship between good performances in modelled discharge and agreement in AET, FSC and SWE.Higher agreement in KGE, CC and RESD in terms of AET is associated with high performances in discharge.Only small mismatches are observed for the median RE, which is within 10% for all four groups.These results indicate that the calibration of the parameters controlling AET in WWH succeeds despite the observed disagreement in PET between EO product and model.
The same relationship is found for FSC, where a higher agreement corresponds to higher KGE values in discharges.Beside snow extent, snow amount seems to follow the same pattern, showing a relation between performances in river discharges and agreement in SWE.A higher agreement in SWE is found when model performance in discharge improves (red box plots for KGE show a median KGE in SWE about −0.05, while blue box plots show a median KGE value in SWE of about 0.2).This could be explained by the location of the catchments, with good-KGE performance in areas with low uncertainties in the SWE provided by the EO products, while catchments with poor performance are located in areas with more ephemeral snowpacks (see above).
To summarize, the results show that AET and the snow variables FSC and SWE are robust variables in the hydrological predictions that improve discharge performance at the global scale, and thus contribute to WWH being right for the right reasons.PET and ΔS, on the other hand, did not influence the model performance in water discharge, while SM may be right for the wrong reasons (but the comparison was biased by soil depths).

Conclusions
This work highlights the value of comparing hydrological models at the global scale with various EO products to check the robustness of internal model variables in capturing longterm annual means and seasonal (monthly) patterns to indicate where and when the model is more likely to be "right" for the right reasons.
We propose that hydrological scientists working with scenarios at the global scale should always analyse the modelbased robustness to justify their results, either in ensemble modelling of calibrated models (e.g.Krysanova et al. 2017Krysanova et al. , 2020) ) or -as in this study -by multi-variable comparison with independent data sources.Preferably, more data sources should be used than in our example; and, ideally, an intergovernmental panel should be established for global water assessments to support policymakers by compiling results from numerous researchers.Only if several sources of data agree should the result be regarded as well supported and to provide credibility when applying GHM results in policy-oriented research on global challenges (e.g.Hoekstra et al. 2012, De Graaf et al. 2019, Rockström et al. 2023).Robustness analysis will also show model limitations and guide model developers regarding where to concentrate their efforts.For instance, the results imply that the current version of WWH should not be applied for analysing deep-groundwater quantities and that results for soil moisture should be treated with caution.
We found that the general patterns of four out of six analysed variables in the hydrological cycle -PET, AET, FSC, and SWE -show similarities between WWH catchment modelling and EO products at the global scale, while a larger discrepancy was found for SM, and very poor agreement for ΔS.Dissimilarities in results indicate that hydrological variables above the ground and earlier in the flow path, such as evaporation and snow, are more robust than the variables describing subsurface downstream processes, such as SM distribution and ΔS.The latter reflects complex processes that can be challenging both to describe with hydrological models and to observe via satellites.The results support previous indications that the WWH model underestimates water in aquifers and thus needs improved representation of groundwater storage.This limits its application in climate change impact studies of future water resources and scenarios for pumping of groundwater reservoirs.
The regional analysis shows that there is little agreement between WWH and EO products in regions with extreme characteristics, such as cold regions (Canadian prairies), arid regions (western USA, deserts), highly forested areas (Amazonas), and transition zones (Sahel and Mediterranean Basin).This indicates that the particularity of these regions calls for specific regional modelling and monitoring approaches rather than continental or global approaches.On the contrary, in some of the temperate regions at mid-latitudes, e.g.eastern USA and central Europe, the long-term mean of the analysed variables (i.e.PET, AET, FSC and SM) showed a reasonable agreement, which means that global approaches can be applied for long-term analysis.For temporal patterns, monthly values of snow variables did not show agreement, which supports previous doubts on quality of the EO products in some snowy regions.We also showed a large discrepancy in terms of SM, indicating that a thinner first layer of soil in the WWH model set-up is probably needed for a higher agreement with EO products.
With respect to equifinality in dynamic model parameters, it is interesting to note that good agreement of AET and snow variables in specific catchments was also linked to good model performance in river discharge at their outlets.This suggests that these processes are comparatively well described in the WWH model.There were no clear indications of good discharge performance associated with bad internal model representation.
Overall, we found the robustness analysis useful, and we recommend the presented approach using EO products combined with in situ river gauges as a standard method in hydrological modelling at the continental or global scale.This approach goes beyond the more traditional evaluation of model performance against river discharges without considering the additional variables from EO products as a ground truth.Hence, the model-based robustness analysis takes advantage of innovative technology and new data without including more uncertainty.This study thus advances UPH No. 16 (Blöschl et al. 2019), by showing the potential and challenge of innovative technologies to measure global heterogeneity of surface and subsurface properties of water state variables and fluxes.This method furthers the identification of caveats in understanding and reproducing the global hydrological cycle in models, without forgetting that EO products also have uncertainties.

Figure 1 .
Figure 1.Comparison of the long-term annual means for the analysed period (2000-2014) of the six analysed variables (rows): potential evapotranspiration (PET), actual evapotranspiration (AET), fractional snow cover (FSC), snow water equivalent (SWE), soil moisture (SM), and changes in water storage (ΔS) for values simulated by World-wide HYPE (WWH; column 1), values from the selected Earth observation (EO) products (column 2) and relative difference (column 3).The relative difference uses the EO product as the baseline.

Figure 2 .
Figure2.Agreement in monthly dynamics for the study period, 2000-2014, in terms of Kling-Gupta efficiency (KGE) and each of its components: Pearson correlation coefficient (CC), relative error (RE) and relative error of standard deviation (RESD) (columns) for each of the six selected hydrologic variables (rows): potential evapotranspiration (PET), actual evapotranspiration (AET), fractional snow cover (FSC), snow water equivalent (SWE), soil moisture (SM), and changes in water storage (ΔS).High agreement is represented in blue, while low agreement is represented in red (column 1 and 2) yellow and green (column 3 and 4).

Figure 3 .
Figure 3. Monthly long-term mean of changes in water storages (ΔS) obtained from World-wide HYPE (WWH) (columns 1 and 3) and Earth observation (EO) products from Gravity Recovery and Climate Experiment (GRACE) (columns 2 and 4) for the period 2000-2014.

Figure 4 .
Figure 4. (a) Map of World-wide HYPE (WWH) model performance in terms of river discharges assessed in 4367 catchments with available river discharge observations.These catchments are divided into four groups using the quantiles of the Kling-Gupta efficiency (KGE) distribution for WWH model performance.(b) Agreement of WWH and Earth observation (EO) products in terms of internal hydrologic variables for the period 2000-2014: potential evapotranspiration (PET), actual evapotranspiration (AET), fractional snow cover (FSC), snow water equivalent (SWE), soil moisture (SM), and changes in water storage (ΔS) (y-axis) for each category of WWH flow performance with respect to in situ observations (x-axis).Each column represents a different metric: KGE, Pearson correlation coefficient (CC), relative error (RE) and relative error of standard deviation (RESD).Optimum values are represented by the dashed green line and good performance interval by green area.Note: the 4367 river discharge observation stations were not used in the WWH calibration process.