Evaluating the surface energy partitioning in ERA5

Climate reanalyses provide a plethora of global atmospheric and surface parameters in a consistent manner over multidecadal time scales. Hence, they are widely-used in many fields, and an in-depth evaluation of the different variables provided by reanalyses is a necessary means to provide feedback on the quality to their users and the operational centers producing these data sets, and to help guiding their development. Recently, the European Centre for Medium Range Weather Forecast 5 (ECMWF) released the new state-of-the-art climate reanalysis ERA5, following up on its popular predecessor ERA-Interim. Different sets of variables from ERA5 were already evaluated in a handful of studies, but so far, the quality of surface energy partitioning has not been assessed yet. Here, we evaluate the surface energy partitioning over land in ERA5, and concentrate on the appraisal of the surface latent heat flux, surface sensible heat flux, and Bowen ratio against different reference data sets and using different modelling tools. Most of our analyses point towards a better quality of surface energy partitioning in ERA5 10 than in ERA-Interim, which may be attributed to a better representation of land-surface processes in ERA5, and certainly to the better quality of near-surface meteorological variables. One of the key shortcomings of the reanalyses identified in our study is the overestimation of the surface latent heat flux, which – although substantially lower than in ERA-Interim – still remains in ERA5. Overall, our results indicate the high quality of the surface turbulent fluxes from ERA5 and the general improvement upon ERA-Interim, thereby endorsing the efforts of ECMWF to improve their climate reanalysis and to provide useful data to 15 many scientific and operational fields.

Small-scale MOdeling (COSMO) REAnalysis version 6 (COSMO-REA6) from the German weather service, perform better than ERA5. Also surface wind fields have been shown to be accurately represented in ERA5 (Olauson, 2018), mainly as a result of the relatively high spatial resolution. Other studies have focused on the validation of vertical profiles of atmospheric properties such as humidity and temperature, typically revealing that the representation of these fields is better in ERA5 than in various other data sets, including its predecessor ERA-I (e.g. Brunamonti et al., 2019;Graham et al., 2019;Zhang and 5 Cai, 2019). Finally, Albergel et al. (2018) compared the quality of ERA-I and ERA5 by forcing the Interactions between Soil, Biosphere, and Atmosphere (ISBA) land-surface model with meteorological parameters derived from both reanalyses and comparing the simulated land-surface parameters from ISBA to independent data from satellite observations and in situ measurements. Based on their study, Albergel et al. (2018) concluded that forcing the model with ERA5 surface meteorology yielded consistently better estimates of hydrological states and fluxes. 10 Despite the importance of an accurate representation of the processes involved in the surface energy partitioning, at present and to the authors best knowledge, no study has directly evaluated the partitioning of energy in ERA5 into the two major surface turbulent fluxes (i.e. the surface sensible and latent heat fluxes). As surface energy partitioning acts as a nexus between the land surface and atmosphere, such an analysis might provide useful insights to further improve the modelling of this coupled system, and to advance the quality of future reanalyses. Therefore, the objective of this study is to evaluate the surface turbulent fluxes 15 (and their ratio; i.e. the Bowen ratio) from ERA5 for the period 1983-2014 at different spatio-temporal resolutions. Several experiments are conducted using various observational data sets and modelling tools to evaluate the spatial and temporal variability of these variables at different scales, ranging from point to catchment-scale and sub-daily to yearly scales. The paper is organised as follows: in Sect. 2 we describe the experimental set-up and the data sets used in this study, and provide a brief overview of the key differences between ERA-I and ERA5. In Sect. 3 we describe the results of our experiments and 20 discuss the quality of surface energy partitioning in both reanalyses; concluding remarks are summarised in Sect. 4.
2 Methods and data 2.1 ERA5 and ERA-I data ERA5 is the latest state-of-the-art reanalysis produced at ECMWF (Hersbach et al., 2019), replacing the widely-used ERA-I (Dee et al., 2011). A first segment of the data set, covering the period 2010-2016, was released early 2017, about a decade 25 after the successful release of ERA-I. Compared to ERA-I, which uses IFS cycle 31r1, ERA5 is produced using an improved version of ECMWF's modelling and data assimilation system (IFS cycle 41r2) and ingests information from a substantially larger volume of improved observations, resulting in a high-quality reanalysis of global atmospheric, oceanic, and land-surface fields at hourly time steps, 137 vertical pressure levels, and horizontal resolution of approximately 31 km. Several advancements upon ERA-I are expected to affect the surface energy partitioning in ERA5 (Hersbach et al., 2019), including (1) a better forcing 30 of solar irradiance, greenhouse gases, and stratospheric sulphate aerosols, which affect the available energy at the surface, (2) a substantially higher spatial resolution, allowing for a more realistic representation of surface-atmosphere interactions in complex terrain such as mountainous or coastal regions, and (3) a better land-surface model, namely the Hydrology Tiled ECMWF Scheme for Surface Exchanges over Land (H-TESSEL), which has a demonstrated high skill to simulate surface turbulent heat fluxes in offline experiments (Balsamo et al., 2015;Albergel et al., 2012;Balsamo et al., 2008).
Here, the surface sensible heat flux, surface latent heat flux, and Bowen ratio derived from both ERA5 and ERA-I are evaluated globally for the period 1983-2018 (i.e. the period for which reference data are available; see Sects. 2.2 and 2.3) and the global domain. Next to the turbulent fluxes and the Bowen ratio, precipitation, 2-meter air temperature, and surface 5 radiation components (from which the surface net radiation is calculated) are processed. These variables are used to disentangle the role of the improved atmospheric data versus the more evolved land-surface model in ERA5.

FLUXNET 2015
In situ eddy-covariance data for the turbulent fluxes are obtained from the FLUXNET 2015 synthesis data set covering the period 1991-2014. The data set is processed as by Martens et al. (2017), including (1) masking of rainy intervals at hourly 10 time steps to remove unreliable measurements due to wet sensors, (2) removing gap-filled data records, and (3) aggregating to both 3-hourly and daily temporal resolutions. Note that for the temporal aggregation, 20% of the higher resolution data within the interval are allowed to be missing. Aiming at the calculation of robust validation statistics, only sites with at least 365 daily records (i.e. at least one full year of data) after masking are retained, resulting in a validation sample of 143 quality-checked eddy-covariance sites (Fig. 1). Note that the same set of towers is used in the sub-daily (i.e. 3-hourly) and daily evaluations of 15 the turbulent fluxes, making the validation metrics between experiments inter-comparable. As the temporal variability of the turbulent fluxes is strongly influenced by the seasonal cycle of its main drivers, the performance of the land-surface scheme in response to anomalous weather conditions (i.e. with respect to the seasonal cycle) might be masked when raw time series are analysed. As such, most of the analyses in this study are based on standardised anomalies to better evaluate the skill of ERA5 in capturing the effect of specific meteorological conditions on the surface energy partitioning. To calculate the climatology, 20 only FLUXNET sites with a minimum record length of five years are considered, resulting in 77 eddy-covariance towers when anomalies are used for validation ( Fig. 1). Note that about 50% of these selected sites have a record length of more than 10 years, with a maximum of 21 years.
The Bowen ratio at each eddy-covariance site is calculated at daily temporal resolution as the ratio of the surface sensible heat flux and the surface latent heat flux. The Bowen ratio might be highly unstable when the turbulent fluxes are small compared 25 to the measurement error of the eddy-covariance system, even at the daily temporal resolution. Therefore, outliers in the in situ time series of the Bowen ratio are masked by removing records outside the following window: [q 25 − 1.5(q 75 − q 25 ); q 75 + 1.5(q 75 − q 25 )], where q 75 and q 25 are the 75% and 25% quantiles of the Bowen ratio time series, respectively . Next to the turbulent fluxes, measurements of surface net radiation, near-surface air temperature, and precipitation at the eddy-covariance sites are processed the same way, except for the masking of rainy intervals. These variables are typically 30 not recorded at each eddy-covariance site and are therefore only available at 83 sites in total.
For each eddy-covariance site in the validation data set, the variables from the corresponding ERA5 and ERA-I grid cell are extracted at their native spatial resolution and both as 3-hourly and daily (temporal) aggregates. The Bowen ratio from both reanalyses is calculated and processed in a similar manner as for the in situ data; i.e. only at the daily temporal scale and outliers 4 https://doi.org/10.5194/gmd-2019-315 Preprint. Discussion started: 21 January 2020 c Author(s) 2020. CC BY 4.0 License. are removed. Eddy-covariance sites located within the same grid cell of the reanalysis are treated separately in the validation to avoid potential problems resulting from merging sensors with different absolute values and gaps in their record (Martens et al., 2017).

Catchment water and energy-balance data
If changes in water storage are neglected, the catchment-scale latent heat flux can be calculated as precipitation minus river 5 discharge; both averaged over a sufficiently long time period Liu et al., 2014;Wang and Dickinson, 2012;Miralles et al., 2011;Vinukollu et al., 2011). By taking into account the latent heat of vaporisation and the density of water: where λ is the latent heat of vaporisation of water (assumed to be constant; 2260·10 3 J kg −1 ), ρ is the density of liquid water (assumed to be constant; 1000 kg m −3 ), E is terrestrial evaporation (m s −1 ), λρE is the surface latent heat flux (W m −2 ), P is 10 precipitation rate (m s −1 ), and Q is the river discharge (m s −1 ). The assumption that changes in catchment water storage can be ignored requires the consideration of a sufficiently long period compared to the concentration time of the catchment; as in Miralles et al. (2016), a yearly aggregation period will be considered here.
A similar reasoning as for the catchment mass balance can be made in terms of energy balance: when changes in energy storage can be neglected, the energy balance at the catchment implies that the surface sensible heat flux can be calculated as 15 the difference between surface net radiation and the sum of ground and latent heat fluxes: where H is the surface sensible heat flux (W m −2 ), R n is the surface net radiation (W m −2 ), and G is the ground heat flux (W m −2 ). Combining Eqs. 1 and 2 thus provides a means to evaluate the long-term average catchment-scale Bowen ratio, derived from surface net radiation, ground heat flux, precipitation, and river discharge as: where β (-) is the Bowen ratio.
In this study, Eqs 1-3 are used in combination with an observational data set of river discharge covering the period 1983-2014 to derive an annual benchmarking data set of turbulent fluxes and Bowen ratio at the catchment scale. Discharge measurements are obtained from the Global Runoff Data Centre (GRDC), providing data for nearly 4000 catchments with a daily or 25 monthly temporal resolution. As in Miralles et al. (2016), records with data artifacts are first removed based on an exhaustive visual screening, and only catchments with an area larger than 2500 km 2 are considered. In addition, only catchments with a gridded area (on a regular 0.25 • latitude-longitude grid) deviating less than 20% from the area reported by GRDC are retained.
If measurements are recorded at multiple locations and thus for different drainage areas (particularly in Central Europe), measurements further downstream are favoured. By doing so, catchments are selected without any spatial overlap (due to possible sub-catchments measured upstream). After this initial filtering, data available at the daily scale are first aggregated to monthly values, given that at least 25 days per month are present. To reduce the impact of (e.g.) human disturbances such as large-scale groundwater pumping or regulations of river flow, non-overlapping moving windows containing monthly data of 15 years are 5 calculated. Any catchments for which the average of a window is exceeded more than three standard deviations by the mean of the subsequent window are discarded, to remove catchments where obvious disturbances occur during the study period.
Finally, monthly averages are aggregated to annual averages, conditioning on at least 10 months per year being present.
Surface net radiation and precipitation to derive catchment-scale validation data for the turbulent fluxes and the Bowen ratio

15
The Global Land Evaporation Amsterdam Model (GLEAM) is a process-based semi-empirical model designed to estimate terrestrial evaporation and its separate components at the global scale from satellite observations alone (Miralles et al., 2011).
In summary, GLEAM first calculates potential evaporation using the Priestley and Taylor equation (Priestley and Taylor, 1972) for four land cover fractions per grid cell: (1) low vegetation, (2) tall vegetation, (3) bare soil, and (4) open water. Estimates of potential transpiration (for the first two fractions) are converted into actual transpiration by applying an empirical multiplicative 20 stress factor. The latter is calculated as a function of vegetation optical depth -which is used as a proxy for vegetation water content (Liu et al., 2013(Liu et al., , 2011 -and root-zone soil moisture. The root-zone soil moisture in GLEAM is calculated using a multi-layer soil water balance model driven by precipitation, and is further optimised using a Newtonian Nudging data assimilation scheme (Martens et al., 2017. For the bare soil fraction, the evaporative stress factor is calculated based on surface soil moisture alone, while for the open-water fraction, no evaporative stress is considered (i.e. actual equals potential 25 evaporation). Finally, for grid cells covered by snow, sublimation is calculated using the Priestley and Taylor equation with a specific set of parameters (Murphy and Koop, 2005). The fraction of precipitation intercepted by the vegetated surface and directly evaporated back into the atmosphere (i.e. rainfall interception loss) is only calculated for the fraction of tall vegetation.
For this purpose, the implementation of Gash's analytical model of rainfall interception (Gash, 1979) by Valente et al. (1997) is used. Ultimately, the total evaporative flux is then calculated by summing the fluxes calculated for the four cover fractions.

30
For a detailed description of GLEAM, we refer the readers to Martens et al. (2017Martens et al. ( , 2016 and Miralles et al. (2011Miralles et al. ( , 2010. Here, GLEAM is used as a tool to assess quality differences in some key meteorological drivers of the turbulent fluxes, derived from ERA5 and ERA-I, and to explore the skill of the land-surface model implemented in ERA5 (H-TESSEL) to accurately model the control of the land surface on the turbulent heat fluxes. To do so, GLEAM is forced by an up-to-date version of the GLEAM v3a forcing data base described in Martens et al. (2017), which uses near-surface air temperature and surface net radiation from ERA-I (hereafter referred to as GLEAM+ERA-I). Next, GLEAM is also forced using the same data set, but with near-surface air temperature and surface net radiation from ERA5 (hereafter referred to as GLEAM+ERA5).
Although GLEAM has been designed to target the accurate estimation of terrestrial evaporation (or surface latent heat flux), here, the surface sensible heat flux, calculated as the residual of the energy balance ignoring changes in energy storage (Eq. 2), 5 and the Bowen ratio from GLEAM are also evaluated. GLEAM is run for the period 1989-2015 -where 1989 is used as a spin-up year (Martens et al., 2017) -and the output from both experiments is evaluated against the eddy-covariance data (Sect. 2.2) at the daily temporal scale.

CLASS4GL boundary layer model
The Chemistry Land-surface Atmosphere Soil Slab (CLASS) model for GLobal studies (CLASS4GL; http://class4gl.eu) is 10 a free software tool designed to investigate the dynamics of the ABL and its sensitivity to different land and atmospheric conditions using data from weather balloons (Wouters et al., 2019). The core of CLASS4GL is the ABL model CLASS, which is coupled to a soil-vegetation module allowing the simulation of the diurnal evolution of the ABL with a temporal resolution of 60 seconds. The platform is able to mine appropriate observational data from global radio soundings, satellite data, and reanalysis data from the last 40 years to constrain and initialise the ABL model. Its interactive interface automatises 15 multiple simulations of the ABL in parallel and allows to perform global perturbation experiments. It aims at fostering a better understanding of land-atmosphere feedbacks and to disentangle the drivers of (extreme) weather conditions globally.
Here, CLASS4GL is used as a tool to assess whether the surface energy partitioning in ERA5 has been improved upon ERA-I. Therefore, CLASS4GL is forced with the turbulent fluxes derived from both ERA5 and ERA-I to simulate diurnal tendencies of potential temperature, humidity, and mixed-layer height. The latter are evaluated against direct observations from balloon 20 soundings from across the globe, representative of different environmental and climate conditions (Wouters et al., 2019). These soundings are sourced from the Integrated Global Radiosonde Archive (IGRA; Durre et al. (2006)) and are screened for the observation time and quality. A detailed description of this data set, together with a description of the processing and quality checks can be found in Wouters et al. (2019). The data set used in this study consists of 18,000 quality-checked morningafternoon sounding pairs from 121 locations across the globe from 1981 to 2018. 25 As described by Wouters et al. (2019), the evaporative fraction derived from reanalysis data (either ERA-I or ERA5) is used to guide the simulations of the ABL diurnal evolution, and the resulting afternoon profiles of humidity, potential temperature, and ABL height. The simulations of CLASS4GL are subsequently validated against afternoon profiles from balloon soundings.
Alltogether, this experiment assesses whether the surface energy partitioning in ERA5 may yield a more realistic diurnal evolution of the ABL in comparison to the surface energy partitioning in ERA-I. Bowen ratio (daily resolution), for ERA5 (green) and ERA-I (yellow), respectively. As shown, statistics are consistently better for ERA5 than for ERA-I, with typically higher R and lower MAD against in situ measurements, even though the bias remains relatively similar. This indicates that ERA5 is better capturing the temporal dynamics in surface energy partitioning, both at 10 sub-daily and daily temporal resolutions. Especially for the daily-aggregated surface sensible heat flux, a clear improvement can be seen, with the median R of ERA5 across all reference sites approaching the 75% percentile of the ERA-I distribution.
For the surface latent heat flux and Bowen ratio, improvements are less remarkable, but still consistent. Notably, for both ERA-I and ERA5, validation statistics are generally better for sensible than for latent heat fluxes (see higher median R and lower MAD for sensible heat fluxes, irrespective of data set and temporal aggregation). Albeit the differences in pre-processing with an overall higher quality (green color) in the sensible and latent heat fluxes from ERA5. However, it can be argued that there is a tendency of ERA-I to perform better than ERA5 in warm and dry regimes, especially for the latent heat flux and 25 Bowen ratio. These climates are, nonetheless, only sampled by three eddy-covariance towers and thus results may not be generalised. In addition, conclusions based on the performance in certain climate regimes should be interpreted with care, as FLUXNET sites are not uniformly distributed: mild climates are generally over-represented and most sites are located in Europe and the Continental United States (CONUS), as shown in Fig. 1 and described in Baldocchi et al. (2001).
Presumably, much of the improvement in surface energy partitioning in ERA5 over ERA-I can be attributed to a better 30 representation of land-surface processes in the more advanced H-TESSEL land-surface model. The better performance of H-TESSEL -in reference to TESSEL, its antecessor used in ERA-I -was already illustrated by Balsamo et al. (2015), who compared the quality of different land-surface variables from ERA-I and ERA-I/Land over the Northern Hemisphere. ERA-I/Land is in essence an offline simulation of H-TESSEL forced with atmospheric data derived from ERA-I. Although quality differences between ERA-I and ERA-I/Land cannot only be attributed to the land-surface scheme but also to the different model set-up (i.e. online vs. offline simulation), Balsamo et al. (2015) argued that most of the improvement was due to the land-surface model. As H-TESSEL is now also implemented in ERA5, analogous improvements can thus be expected in ERA5 over ERA-I regarding the simulation of land-surface variables.

5
Despite the fact that several studies have shown the high performance of H-TESSEL as compared to TESSEL for simulating a variety of land-surface parameters (e.g. Balsamo et al., 2015;Albergel et al., 2012), Balsamo et al. (2015) showed that improvements in the turbulent fluxes of ERA-I/Land over ERA-I could not be uniquely linked to the different land-surface scheme. Hence, the better quality of surface energy partitioning in ERA5 is, most likely, not only owed to an improved parameterisation of the land surface, but also a better quality of the atmospheric drivers, simulated by the coupled atmospheric 10 model, which is constrained by a 4D-variational data assimilation of a large number of quality-controlled observations (Hersbach et al., 2019). The better quality of some key meteorological parameters is confirmed by the results presented in Fig. A3, which shows violin plots of the validation statistics for surface net radiation, 2-meter air temperature, and precipitation at the FLUXNET sites, for 3-hourly and daily temporal resolutions, respectively. Although statistics from ERA5 are better at both temporal resolutions, especially the sub-daily variability of all three variables has been substantially improved over ERA-I, 15 which may largely be the result of a better modelling of cloud properties in ERA5 (Hersbach et al., 2019).
Finally, as described in Sect. 2.1, one of the key improvements in ERA5 upon its predecessor is the higher spatial resolution at which atmospheric and land processes are resolved. However, Fig. A2 shows that when ERA5 is linearly re-sampled to the spatial resolution of ERA-I, statistics calculated against eddy-covariance measurements only change marginally. Nevertheless, such an analysis only gives a crude idea of the impact of the spatial resolution as (1) due to non-linear processes and feedback 20 mechanisms, a simple re-sampling of the model output does not properly represent the effect of the high-resolution numerical modelling; (2) the effect is expected to be the highest in complex terrain such as mountainous regions, coastal areas, or highlyheterogeneous landscapes, which are under-represented in the FLUXNET data base; (3) representativity errors -resulting from the relatively small footprint of eddy-covariance towers as compared to model grid cells -remain considerable at the spatial resolution of ERA5. and GLEAM+ERA-I can be found in mild climates as indicated in Fig. 5a, there is no clear tendency of GLEAM+ERA5 to perform better under specific climatic conditions. The surface sensible heat flux and Bowen ratio from GLEAM+ERA5, on the other hand, tend to degrade in quality (compared to GLEAM+ERA-I) when the climate gets drier and colder. It should be 10 emphasised here again that GLEAM has been specifically designed to estimate the latent heat flux, thus the surface sensible heat flux -calculated here as the residual from the energy balance -has not been subject to equally extensive validations than its latent counterpart, and is prone to be more uncertain.
The turbulent fluxes and Bowen ratio from GLEAM+ERA5 can also be directly compared to ERA5 to provide a crude evaluation of the skill of H-TESSEL as compared to the simpler land-surface scheme in GLEAM. Figure 4b shows that ERA5 15 is better capturing the temporal dynamics of the anomalies, generally resulting in lower MAD and higher R for all variables.
Only in terms of the bias, ERA5 is overall performing worse than GLEAM+ERA5, especially for the surface latent heat flux, which is consistently overestimated in ERA5 for almost all in situ sites (close to 75% of the sites have a positive bias, Fig. 4b).
This results in a median MD of 9 W m −2 compared to the slight underestimation of -2 W m −2 for GLEAM+ERA5 at daily time scales. The positive bias in the surface latent heat flux from ERA5 is very similar to the one from ERA-I, with a median 20 MD of 10 W m −2 across all in situ sites at daily resolutions (Fig. 2). The tendency to overestimate the latent heat flux in ERA-I has been previously reported in different studies (Michel et al., 2016;Miralles et al., 2016;Balsamo et al., 2015;Decker et al., 2012), and important changes in the IFS have thus not been able to mitigate this bias in ERA5. Given the interaction between the coupled atmospheric and land-surface model in the reanalysis, the consistent positive bias in the surface latent heat flux is potentially affected by both components of the modelling framework. Although it is hard to identify the exact 25 cause of this bias, it might be induced by the overestimation of the number of wet days typically found in reanalysis data sets (Beck et al., 2019), combined with precipitation rates that are often underestimated (Beck et al., 2019), and vegetation density that might be overestimated (Král, 2011). This presumably results in an overestimation of the interception loss (Král, 2011), an important component of the total latent heat flux in densely-vegetated regions (Martens et al., 2017;Miralles et al., 2010). Note that this hypothesis is partially supported by our analysis: despite the fact that a positive bias can be found virtually 30 everywhere, the strongest biases are typically found in densely-vegetated sites (not shown). We should emphasis here, however, that biases calculated against eddy-covariance measurements have to be interpreted with care, given the representativity errors, and provided that turbulent heat fluxes are thought to be generally underestimated by the eddy-covariance technique, especially in case of the surface latent heat flux (Beer et al., 2010). When the seasonal cycle is not removed prior to the evaluation ( Fig. A4b), GLEAM+ERA5 seems to perform equally good or slightly better than ERA5, indicating that GLEAM+ERA5 is 35 10 https://doi.org/10.5194/gmd-2019-315 Preprint. Discussion started: 21 January 2020 c Author(s) 2020. CC BY 4.0 License. marginally better than ERA5 at capturing the seasonal dynamics (Fig. A4b), but worse at capturing the response of surface energy partitioning to short-term anomalies in meteorological conditions (Fig. 4b).
Nonetheless, Fig. 5b shows that for the surface latent heat flux, the better performance of ERA5 over GLEAM+ERA5 is mainly due to its better statistics in relatively wet or cold climatic regimes. In drier regimes and, especially warm regions (mainly located along the west coast of the CONUS and few eddy-covariance sites in Australia; Fig. 1), GLEAM+ERA5 seems 5 to better capture the anomalies of the surface latent heat flux, which might indicate that H-TESSEL has room to improve the response to water stress. For the Bowen ratio, similar conclusions may be drawn, even though the quality of the sensible heat flux in ERA5 is consistently better than in GLEAM+ERA5.

Evaluation using catchment energy-balance data
As described in Sect. 2.3, observations of river discharge may be combined with precipitation, net radiation, and ground ERA-I generally underestimates the flux, ERA5 overestimates it in about 70% of the catchments. In addition, the absolute bias of the surface sensible heat flux from ERA5 is higher than in ERA-I in 55% of the catchments. However, this potential overestimation is not confirmed by the in situ validation presented in Sect. 3.1.1 (Fig. 2), where the surface sensible heat flux from both reanalyses appeared nearly unbiased. Finally, for the Bowen ratio, estimates of ERA5 appear better in about 60% of the catchments, arguably reflecting the improvement in the surface latent heat flux. Finally, it should be emphasised here that the quality of the catchment-scale sensible heat flux (and Bowen ratio) estimates used as reference is potentially lower than that of the surface latent heat flux, as (1) the assumption that the ground heat flux is a fixed fraction of the surface net radiation only affects the estimates of the sensible heat (Eq. 2), and (2) the estimates of 5 sensible heat flux depend on the estimates of surface latent heat (Eq. 2), resulting in a propagation of errors which is difficult to assess. Hence, the catchment-scale evaluation of the surface sensible heat flux should be more carefully interpreted. Figure 10 shows the validation of the estimated afternoon ABL properties from CLASS4GL forced with the surface energy partitioning from ERA-I (on the one hand) and ERA5 (on the other hand). The validation is performed by comparison against 10 a global archive of balloon soundings (Sect. 2.5). Results are shown for the diurnal temporal change (tendency) of potential temperature (dθ/dt), humidity (dq/dt), and mixed-layer height (dh/dt). The overall performance at reproducing the diurnal ABL tendencies is improved when CLASS4GL is forced with ERA5 instead of ERA-I. This is the case for all statistical scores being considered and for each ABL variable being analysed. In addition, this is also the case in most Köppen-Geiger climate classes, which suggests that the higher performance is consistent across climate regimes. The largest improvement in simulated Similar values typically found in literature -although based on different land-surface models or retrieval algorithms, input data 5 sets, or region considered (e.g. areas permanently covered by snow or ice included or not) -range between 55·10 3 km 3 and in line with results previously reported in literature (e.g. Miralles et al., 2016;Wild et al., 2015;Mueller et al., 2013;Jiménez et al., 2011, and references therein) where similar biases were found for ERA-I. Figure 11 shows that the lower globally-averaged surface latent heat flux from ERA5 mainly results from reduced values 15 along the east coast of the CONUS, the south of Europe, the Sahel, India, and large parts of South America. These regions align well with the areas identified in Miralles et al. (2016) where ERA-I seemed to strongly overestimate the surface latent heat flux, and thus point to a better performance of ERA5 in these specific regions, although positive biases still prevail (Fig. 7).

Evaluation using CLASS4GL
The surface latent heat flux from ERA5 is higher than in ERA-I in only a few areas, such as the central CONUS, eastern Australia, and eastern Europe. For the surface sensible heat flux, differences between ERA5 and ERA-I are clearly defined, 20 with substantially higher values in the equatorial forests and lower values in (semi-)arid regions in the case of ERA5.

Conclusions
This study evaluated the surface energy partitioning over land in ECMWF's latest reanalysis ERA5 by assessing the quality of the surface latent heat flux, surface sensible heat flux, and Bowen ratio at different spatio-temporal scales and using different validation approaches. Results were also compared with the predecessor ERA-I for reference. Different in situ validation data 25 sets -including eddy-covariance, river discharge, and balloon sounding data -were used to validate the reanalysis fields, and GLEAM and CLASS4GL were adopted as modelling tools to evaluate the surface energy partitioning in both reanalyses.
In a first experiment, the turbulent fluxes and the Bowen ratio from the reanalyses were directly compared against eddycovariance measurements from the FLUXNET 2015 data set. The analysis revealed that ERA5 performed consistently better than ERA-I for all variables analysed, both at daily and sub-daily temporal resolutions, resulting in lower MAD and higher 30 R against in situ data. The differences were most clear when anomaly time series were analysed, indicating that -although statistics also improved in case of the raw time series -ERA5 is substantially better capturing the response of surface energy partitioning to specific meteorological events. As one of the key changes in ERA5 is the use of the state-of-the-art H-TESSEL 13 https://doi.org/10.5194/gmd-2019-315 Preprint. Discussion started: 21 January 2020 c Author(s) 2020. CC BY 4.0 License. land-surface model, an important part of the improvements may be attributed to the improved land parameterisation. However, a validation of some key meteorological variables against in situ measurements also showed better quality of these parameters from ERA5 than from ERA-I. These results were largely confirmed by an experiment where GLEAM was forced with meteorological fields retrieved from both reanalyses, showing a higher quality of the output based on ERA5 forcing data. Finally, although ERA5 did not seem to perform particularly better than ERA-I in specific climates, it was shown that GLEAM forced 5 with ERA5 meteorology performed better than ERA5 in terms of estimating the surface latent heat flux in warm and dry regimes, indicating possible shortcomings in the land-surface scheme to capture the response of surface energy partitioning to heat and drought stress in ERA5.
In a second experiment, catchment-scale turbulent fluxes derived using discharge, precipitation, net radiation, and ground heat flux data were used to verify the bias in the annual turbulent fluxes from ERA-I and ERA5. Here, a substantial overesti-10 mation of the surface latent heat flux from ERA-I became evident. On the other hand, the surface sensible heat flux appeared generally underestimated. While the biases in ERA5 for the surface latent heat flux were found to be lower -a strong reduction was found along the east coast of the CONUS and in the south of Europe -a general tendency to overestimate the latent heat flux still remains in ERA5. In case of the surface sensible heat flux on the other hand, the sign of the bias reversed (i.e. in ERA5 the flux tends to be overestimated) and increased in absolute value.

15
A better quality of the surface energy partitioning in ERA5 was also confirmed by an experiment where CLASS4GL was forced with the evaporative fraction from ERA-I and ERA5. Simulations of the diurnal evolution of the ABL were validated against a global archive of balloon soundings. CLASS4GL forced with ERA5 showed an overall better skill for simulating the diurnal boundary layer dynamics than when forced with ERA-I. Especially in reproducing the tendencies of specific humidity, CLASS4GL seemed to strongly benefit from the seemingly better surface energy partitioning in ERA5, resulting in a substan-20 tially lower bias. The latter could be attributed to the lower bias in the surface latent heat flux in ERA5 than in ERA-I. Since In summary, based on the validation data and tools used in this study, the quality of the turbulent fluxes (and near-surface meteorology) from ERA5 shows a higher accuracy upon than its predecessor. Although biases (especially in the surface latent