1 Introduction

Snow plays a substantial role in the climate system (Vavrus 2007; Thackeray et al. 2019). It affects soil, ground, and land surface thermodynamic properties, land–atmosphere coupling, the surface radiation budget, and the hydrological cycle (Xu and Dirmeyer 2013; Henderson et al. 2018; You et al. 2020b). Besides its direct impact on local meteorological conditions, such as decreased air temperatures, terrestrial snow cover can also influence atmospheric phenomena at larger spatial and longer temporal scales; e.g., snow cover on the ground and its variations have been linked to general circulation anomalies, Rossby wave altering, sudden stratospheric warming events, and the East Asian summer monsoon (Saito and Cohen 2003; Orsolini and Kvamstø 2009; Xiao and Duan 2016; Lü et al. 2020). A widespread decrease in mean snow depth has been observed over Europe in recent decades in accordance with global warming (Fontrodona Bach et al. 2018). The reduction of snow cover is projected to continue throughout the twenty-first century, as the fraction of liquid precipitation and snowmelt increase (Räisänen and Eklund 2012; de Vries et al. 2014; Mudryk et al. 2020). As our understanding of future changes is mainly based on climate model experiments, it is crucial to assess the performance of numerical simulations in terms of snow-related variables.

Earth system models participating in the Coupled Model Intercomparison Project Phase 6 (CMIP6) show acceptable performance in simulating snow extent and snow mass (Mudryk et al. 2020). It has been indicated, however, that kilometer-scale regional climate models (RCMs) might be needed to resolve important local-scale processes affecting snow accumulation, evolution, and ablation, especially over areas with complex topography (Ikeda et al. 2010). Accordingly, numerous studies evaluated the performance of convection-permitting RCMs over mountainous terrain (e.g., Minder et al. 2016; Wang et al. 2018; He et al. 2021). RCMs with horizontal grid spacings of the order of 10 km have also been assessed for snow variables (e.g., Klehmet et al. 2013; Terzago et al. 2017; McCrary et al. 2017; Matiu et al. 2019). These studies suggest that RCMs can capture many aspects of snow cover evolution but also exhibit substantial biases. Uncertainty in snow modeling by RCMs can arise from several factors like forcing data, land cover and vegetation parameters, model configuration, and physical parameterizations (Ikeda et al. 2010; Terzago et al. 2017; Krinner et al. 2018; Yang et al. 2020; Li et al. 2021).

An important factor in the accurate simulation of snow cover evolution is the land-surface model (LSM) (Krinner et al. 2018). The Noah-MP LSM (Niu et al. 2011), coupled with the WRF model, is widely popular and is also utilized in this study. Noah-MP offers several different options for the representation of snow-related processes, such as the partitioning of precipitation into rainfall and snowfall, snow surface albedo, the snow and soil temperature time scheme, and the lower boundary condition of soil temperature. These settings have been shown to influence snow simulations to a varying extent, especially during the melting period (Jiang et al. 2020; You et al. 2020a; Li et al. 2022a). Liu et al. (2017) argued that the partitioning of precipitation into rain and snow has a substantial role in modeled snowpack behavior and suggested diagnosing precipitation type based on the microphysics parameterization of the atmospheric model.

The poor performance of RCMs regarding standard climatological variables has been linked to deficiencies in the representation of snow processes; e.g., WRF configurations participating in the European Coordinated Regional Downscaling Experiment (EURO-CORDEX) tend to underestimate air temperatures significantly over snow-covered surfaces (García-Díez et al. 2015; Katragkou et al. 2015). During our preliminary sensitivity tests conducted with WRF over the Pannonian Basin region, located within Central and Eastern Europe, we also implied that winter and spring cold biases are closely related to excessive snowfall and snow depth in the model (Varga and Breuer 2020). Therefore, the findings of this study could potentially contribute to a better understanding of model errors pointed out in earlier works.

A major challenge in evaluating simulated snow variables is the lack of spatially and temporally extensive, reliable observations (Alonso-González et al. 2018; Mudryk et al. 2020; Li et al. 2022b). Because of shortcomings in both in situ and remotely sensed datasets, and the unavailability of raw snow variables in numerical model output, several studies used snowfall proxies constructed from more common parameters like precipitation and temperature (Frei et al. 2018; Lin and Chen 2022). Reanalysis data have also been extensively utilized in snow research. For example, the most recent products of the European Centre for Medium-Range Weather Forecasts (ECMWF), the ERA5 (Hersbach et al. 2020), and its land version, ERA5-Land (Muñoz-Sabater et al. 2021), have been assessed for snow variables over high-altitude regions (Orsolini et al. 2019; Lei et al. 2022; Li et al. 2022b). Mortimer et al. (2020) concluded that ERA5 performs well compared to other datasets when validated against snow course measurements from Russia, Finland, and Canada. Nevertheless, they encourage using multiple data sources and types when evaluating snow-related parameters.

This study aims to evaluate snow depth (SD) from several observation-based and state-of-the-art reanalysis datasets over the Pannonian Basin region (part of Central and Eastern Europe). In addition, WRF regional climate simulations are also included in the verification. We use in situ SD measurements as a reference for comparison. Such an investigation of observational and model-based snow datasets has not yet been carried out for this region. The results from this study aim to shed light on the usefulness of measurement-based datasets and RCM simulations for snow research over mid-latitude, low-elevation continental regions. Furthermore, most of the previous RCM validation studies focused on mountainous terrain. However, also over flat areas, the significant and extensive cold bias in RCM-simulated surface air temperatures has been linked to discrepancies in simulated snow cover (García-Díez et al. 2015; Varga and Breuer 2020). Specifically, the snow depth overestimation and the excessive persistence (i.e., unrealistically slow melting) of the snow cover in spring result in too high albedo values, low solar radiation absorption at the surface, and consequently reduced air temperatures. Exploring SD errors could therefore aid our understanding of model performance regarding more common climate variables over a wide range of spatial domains.

The paper is structured as follows. The “Data” section describes the different data sources utilized in the study. The “Methods” section outlines the analysis methods. The “Results and discussion” section presents and discusses the results. Finally, the “Conclusions and outlook” section summarizes the main findings and provides a brief outlook for future work.

2 Data

We use weather station observations to evaluate snow depth from a combined spaceborne and ground-based snow water equivalent (SWE) product (CGLS), a dataset constructed from temperature, precipitation, and relative humidity (CARPATCLIM), the ERA5 and ERA5-Land reanalyses, and WRF regional climate simulations over a Central-Eastern European domain. Table 1 summarizes the basic features of the datasets investigated in this study. A brief description of each data source and the verification approach is provided below.

Table 1 Overview of the gridded snow datasets used in this study

2.1 Station observations

In situ (weather station) snow depth observations for the 1985–2010 period were obtained from the Integrated Surface Database (ISD) (https://www.ncei.noaa.gov/products/land-based-station/integrated-surface-database, accessed 13 Dec 2022). The ISD repository is produced by the National Oceanographic and Atmospheric Administration’s National Centers for Environmental Information (NOAA NCEI) in the USA. The database contains hourly surface-based synoptic observations from multiple data sources (Smith et al. 2011). The compilation process of ISD consists of several objective quality control algorithms aiming to eliminate erroneous data records. Figure 1 presents the geographical locations of the four Hungarian weather stations included in this study. These stations were selected because of their sufficiently long and continuous snow depth records, following a manual inspection of available data series from several measurement sites in the area. The geographical characteristics and identification numbers of the stations are listed in Supplementary Table S1.

Fig. 1
figure 1

Study region with terrain height from the 0.1° CARPATCLIM dataset. Red points mark the locations of the weather stations included in the analysis. The red rectangle represents the area for the spatial averaging. For a geographical context, see the cyan rectangle in Fig. 2

Fig. 2
figure 2

Geographical coverage and terrain height of the WRF model domains (red rectangles). The cyan rectangle represents the study region (see Fig. 1)

2.2 CGLS dataset

The Copernicus Global Land Service (CGLS) daily gridded snow water equivalent (SWE) product is generated by combining satellite-derived information (passive microwave radiometer brightness temperature and optical snow extent) with surface-based synoptic snow depth measurements (Pulliainen 2006; Takala et al. 2011). The dataset is available at a 0.05° latitude–longitude grid for the Northern Hemisphere, with a temporal coverage starting from January 2006 and a new product generated near real time. We calculated snow depth by dividing the SWE values by 240 kg m−3 (Sturm et al. 2010), a temporally and spatially constant snow density applied in the retrieval algorithm.

2.3 CARPATCLIM dataset

The CARPATCLIM dataset consists of quality-checked, homogenized, harmonized, and interpolated daily ground-based observations of more than ten meteorological parameters collected from several Central and Eastern European countries (Szalai et al. 2013; Spinoni et al. 2015). The data is available at a 0.1° grid for the 1961–2010 period and covers the Carpathian Region between the geographical coordinates 44°–50°N and 17°–27°E. Due to the scarcity of reliable long-term station observations for the interpolation procedure, the snow depth variable has been derived from daily grids of mean air temperature, precipitation, and relative air humidity using a complex snow model (Cheval et al. 2014; Chervenkov and Slavov 2016). Standard climate variables from the CARPATCLIM dataset, such as temperature and precipitation, have been extensively used as a reference for regional climate model verification (e.g., Kotlarski et al. 2019; Vautard et al. 2021). However, a detailed comparison of the snow depth records to other data sources has not yet been carried out.

2.4 ERA5 reanalysis

ERA5 is the latest, fifth-generation global atmospheric reanalysis product of ECMWF (Hersbach et al. 2020). It is based on the Integrated Forecasting System (IFS) Cy41r2, ECMWF’s operational medium-range forecasting system from March to November 2016. The ERA5 reanalysis offers a wide range of quality-assured meteorological variables from 1959 to the present, at 0.25° (≈31 km) spatial and 1-hourly temporal resolution. With 137 sigma-pressure model levels in the vertical, ERA5 provides an exceptionally detailed description of the atmospheric processes. Observations are ingested into ERA5 using a state-of-the-art hybrid incremental 4D-Var data assimilation (DA) system. The snow DA is based on a two-dimensional optimal interpolation scheme for in situ snow depth observations from the synoptic network. Starting from 2004, the DA algorithm also incorporates satellite-based snow cover information for the Northern Hemisphere from the Interactive Multisensor Snow and Ice Mapping System (IMS). In this work, SWE and snow density values were used to compute the actual snow depth, as the latter is not directly available from the ERA5 dataset.

2.5 ERA5-Land reanalysis

ERA5-Land is produced by high-resolution numerical integrations of the ECMWF land surface model (the Carbon Hydrology-Tiled ECMWF Scheme for Surface Exchanges over Land (CHTESSEL)), forced by downscaled meteorological data from the ERA5 reanalysis, including lapse rate correction (Muñoz-Sabater et al. 2021). Thus, it can be viewed as an enhanced version of the land component of ERA5. The various land variables are available at a 0.1° grid (≈9 km) from 1950 to the present, at 1-hourly time intervals. No direct data assimilation is performed in the production of ERA5-Land. As in the case of ERA5, we computed actual snow depth using SWE and snow density. Muñoz-Sabater et al. (2021) note that despite its higher horizontal resolution, ERA5-Land does not necessarily perform better in capturing snow depth values than the original ERA5 reanalysis, especially over areas with a dense network of observations for the assimilation procedure.

2.6 WRF model simulations

Regional climate simulations were produced for the 1985–2010 period using the WRF model version 4.2 (Skamarock et al. 2019). The configuration consists of two one-way nested domains on a Lambert conformal projection (Fig. 2). First, a model run with a grid spacing of 50 km (hereafter WRF50) was produced forced by ERA5 reanalysis data as initial and boundary conditions. Then, the 50-km run was further downscaled to a 10-km grid (hereafter WRF10). The simulations start on 1 July 1984 at 0000 UTC and end on 1 January 2011 at 0000 UTC. The first six months were considered a spin-up time and discarded from the analysis. No spectral nudging was applied, and the model was integrated continuously (i.e., without periodical reinitialization from ERA5). The vertical discretization was achieved using 61 sigma-pressure model levels, with a higher resolution near the surface and the tropopause. The model top was set to 50 hPa. Output fields were written every 3 h. Table 2 shows the main physical parameterization schemes used in the WRF runs. The configuration is based on our preliminary sensitivity studies and further test experiments (Varga and Breuer 2020). Surface snow processes in the RCM simulations are accounted for using the Noah-MP LSM, which consists of an advanced 3-layer snow model that describes the evolution and physical mechanisms of the snowpack and predicts snow depth values (Niu et al. 2011; He et al. 2021). Noah-MP offers numerous adjustable options to represent different land-surface processes, some of them directly affecting snow variables. A so-called namelist file, containing all model settings, can be found in Supplementary information.

Table 2 Physical parameterization schemes used in the WRF regional climate simulations

3 Methods

3.1 Comparison methods

The evaluation is carried out using daily snow depth (SD) values as a baseline for the calculations. The weather stations included in this study generally report SD two times a day (at 0600 UTC and 1800 UTC), which were averaged out to obtain the daily SD observation. We also computed daily mean SD from 3-hourly WRF, ERA5, and ERA5-Land time series (reanalysis data were downloaded at 3-hourly increments to match the WRF output frequency). The CGLS and CARPATCLIM datasets are inherently available at a daily time scale. Considering the different temporal coverage of the data sources utilized in this study, we decided to perform the analysis for two separate periods. The primary study period is chosen as 1985–2010 based on the temporal coverage of the WRF simulations. In addition, results from the observational and reanalysis datasets are also presented for the 2006–2010 period so that CGLS can be included in the analysis. The study region is selected based on the geographical extent of the CARPATCLIM dataset shown in Fig. 1. For the pointwise comparison with station observations, SD from the geographically nearest grid point was extracted from each dataset. A similar approach was taken by, for example, Li et al. (2022b). The stations are located at altitudes around or below 200 m above sea level.

Besides the pointwise evaluation, we also compare the grid-based datasets in terms of spatial averages for a flat lowland area within the study region (Fig. 1). For this purpose, all datasets were interpolated to a uniform 0.5° latitude–longitude grid, the resolution of which is close to the coarsest dataset used in the analysis (WRF50). For SD, ordinary block kriging interpolation (Kottek and Rubel 2007) was applied.

The results are presented in the form of average daily SD values for the 2006–2010 and 1985–2010 periods. We also present the mean error (ME), mean absolute error (MAE), root-mean-square error (RMSE), and Pearson correlation coefficient (PCORR) calculated between the daily SD series of station observations and the different data sources. Only the months between November and March are considered when detectable snow cover might be present in the study area. The spatial distribution of monthly mean SD derived from each gridded database is also displayed.

3.2 Uncertainty sources and dataset limitations

The data types used in this study unavoidably have potential shortcomings (Mudryk et al. 2020). Uncertainty in long-term station observations can arise from possible condition changes around the location and the subjectivity introduced by the involvement of a human surveyor. Moreover, SD at synoptic measurement sites is usually only recorded one or two times a day (Dong 2018), which is not necessarily sufficient to calculate a representative daily mean value. Signals detected by microwave sensors are sensitive to various snowpack characteristics that potentially affect measurement accuracy. In addition, the algorithms used in the retrieval process are based on assumptions regarding some parameters (Mortimer et al. 2020). Optical sensors only provide information on cloud-free days (Dong 2018). Combining remotely sensed products with surface-based observations, as in the case of the CGLS dataset, can fill in data gaps and reduce retrieval uncertainties. The temporally and spatially fixed snow density value (240 kg m−3) utilized in the CGLS retrieval algorithm, despite being justified by earlier studies, is also a possible error source (Takala et al. 2011). CARPATCLIM consists of SD series derived from a limited number of meteorological parameters using an independent snow model, which is prone to many simplifications and crude assumptions. In the case of the reanalyses and WRF RCM simulations, errors can arise from model formulation (e.g., spatial and temporal discretization, model resolution, and physical parameterizations).

The nearest grid point method also has potential limitations, as the grid cell value might not be representative of the station observation. However, as the stations are located in an extensive low-altitude region (with terrain height mainly below 200 m), elevation differences between the measurement sites and the corresponding grid points are generally small. The comparison of spatially averaged SD values from the gridded datasets is intended to increase the robustness of the results, as the number of stations included in the analysis is very limited.

4 Results and discussion

In this section, evaluation results are first presented for the 2006–2010 period, covered by all observational and reanalysis products. Then, the analysis is extended to a 26-year period (1985–2010) for the datasets with a sufficiently long temporal coverage, including the WRF regional climate simulations.

4.1 Evaluation for the 2006–2010 period

Table 3 presents the statistical parameters (Pearson correlation coefficient, ME, MAE, RMSE) calculated between station measurements and the gridded observation-based datasets. The ERA5 reanalysis and the CGLS product perform remarkably well, with correlation coefficients exceeding 0.9 and ME values close to zero.

Table 3 Pearson correlation coefficient (PCORR), mean error (ME), mean absolute error (MAE), and root-mean-square error (RMSE) of daily mean snow depth from the different gridded datasets, compared to station observations (period: November to March 2006–2010)

In the case of ERA5, the correlation coefficient even reaches 0.99 for two stations. The good performance of ERA5 and CGLS demonstrates the advantages of using direct observations in the construction of snow datasets, e.g., through the advanced data assimilation system of ERA5 and the incorporation of ground-based snow depth (SD) measurements to complement satellite retrievals in CGLS. ERA5-Land and CARPATCLIM display slightly lower correlation coefficients in the range of 0.8–0.9 and 0.7–0.8, respectively. With a PCORR number of 0.55, the CARPATCLIM dataset performs the worst in the case of the Debrecen station. The highest error numbers can also be observed for CARPATCLIM, followed by ERA5-Land. Muñoz-Sabater et al. (2021) noted that ERA5-Land does not necessarily perform better than ERA5 over relatively data-rich and non-mountainous regions, which is confirmed by our results.

The mean day-to-day variation of SD in the 2006–2010 period is generally reproduced by all datasets (i.e., the day-to-day variations are similar in the observations and the different data products), with CARPATCLIM and ERA5-Land displaying noteworthy biases (Fig. 3). The mean errors of ERA5 and CGLS mostly remain below 1 cm and 2 cm, respectively, which suggests an exceptionally good agreement with station observations. On the other hand, biases of the CARPATCLIM dataset reach 4–7 cm in the case of the Budapest and Pécs measurement sites. For the other two stations, the overestimation of CARPATCLIM remains below 2 cm. The occasional noisiness of the data is presumably a consequence of the relatively short (5 years) study period and the fact that snow cover has considerable year-to-year variability in the investigation area with intermittent snow-free periods even in the winter months (i.e., during the 5-year period, it is likely that for a given day and station, there is no snow cover on the ground in any of the years). The discrepancies of the CARPATCLIM dataset likely originate from the application of an independent snow model to derive SD from the predictor variables, as its temperature and precipitation records have been proven to be of high quality and utilized in several RCM verification studies (e.g., Kotlarski et al. 2019; Vautard et al. 2021). In addition, the relationship between standard meteorological parameters and SD is not straightforward, and the snow model might not be appropriately tuned for this particular area. Daily mean SD is overestimated in the ERA5-Land reanalysis, too, although the magnitude of the positive bias is lower compared to CARPATCLIM (mostly less than 3 cm). Moreover, ERA5-Land slightly underestimates peak SDs in the case of Budapest at the beginning of February. Supplementary Fig. S1 displays daily mean SDs from the different datasets averaged across the four stations in the 2006–2010 period.

Fig. 3
figure 3

Daily mean snow depth from station observations (OBS) and the different gridded datasets in the 2006–2010 period

To address the possible representativeness issue arising from the nearest grid cell method, Fig. 4 presents daily mean SDs from the gridded datasets in the 2006–2010 period, spatially averaged over the geographical area shown in Fig. 1. The relative performance of each dataset is very similar in the case of the pointwise comparison and the spatial averages. Moreover, the areal mean patterns shown in Fig. 4 are almost indistinguishable from those obtained by averaging out daily mean SDs across the four stations (Supplementary Fig. S1). The ERA5 and CGLS datasets are very close to each other; meanwhile, ERA5-Land displays slightly higher values (by less than 1 cm). Mean SDs from CARPATCLIM are 1–2 cm higher compared to the other data sources in January and February.

Fig. 4
figure 4

Daily mean snow depth from the different gridded datasets in the 2006–2010 period, spatially averaged over the region shown in Fig. 1

The spatial distribution of mean February SD is similar among the gridded observation-based and reanalysis datasets (Fig. 5). SD minima can be observed over southeastern Hungary and northern Serbia. SD values increase from the south toward the north in every dataset. Although the differences are small over the low-altitude regions of Hungary, CARPATCLIM displays slightly larger values throughout the country than the other data sources. Over the ranges of the Carpathian Mountains, marked peak values are present in CARPATCLIM and ERA5-Land, presumably caused by the better description of topography at the higher horizontal resolution, as such pronounced maxima cannot be observed in the case of ERA5. Mountain regions are masked out in the CGLS product. In summary, Fig. 5 indicates a considerable dataset uncertainty over complex terrain.

Fig. 5
figure 5

Spatial distribution of mean February snow depth from the different gridded datasets in the 2006–2010 period

However, quantitative verification in mountainous areas is out of the scope of this paper. Several studies evaluated the ERA5 and ERA5-Land reanalyses over regions with complex topography, indicating the superiority of ERA5-Land. Muñoz-Sabater et al. (2021) demonstrated an improved performance of ERA5-Land compared to ERA5 over mid-altitude mountains due to the enhanced horizontal resolution. Larger SDs in ERA5-Land compared to ERA5 agree better with in situ observations over the Tibetan Plateau according to Lei et al. (2022). Li et al. (2022b) indicated that both ERA5 and ERA5-Land overestimate SD in the Tianshan Mountains (Central Asia), although the magnitude of the positive bias is larger in ERA5. Maps of the monthly mean SD for December, January, and March are shown in Supplementary Figs. S2, S3, and S4. Conclusions for the other months are identical to those discussed above in the case of February.

4.2 Evaluation for the 1985–2010 period

Table 4 displays the statistical metrics calculated between the different datasets and the station observations for the 1985–2010 period. The correlation coefficients of ERA5 are around 0.96–0.97. The PCORR numbers of ERA5-Land and CARPATCLIM are in the range of 0.83–0.91 and 0.73–0.82, respectively.

Table 4 Pearson correlation coefficient (PCORR), mean error (ME), mean absolute error (MAE), and root-mean-square error (RMSE) of daily mean snow depth from the different gridded datasets, compared to station observations (period: November to March 1985–2010)

Correlation coefficients of the WRF regional climate simulations are between 0.52 and 0.65, implying a moderate performance compared to the observation-based and reanalysis datasets. Error numbers (ME, MAE, and RMSE) are the lowest for ERA5, followed by CARPATCLIM and ERA5-Land, displaying errors of similar magnitude. Error values are the largest for the WRF simulations. Positive MEs indicate a general overestimation for every dataset, although with different magnitudes. Reducing the grid spacing from 50 to 10 km in WRF does not improve the results considerably.

The observation-based datasets, the reanalysis products, and the WRF regional climate simulations correctly depict the day-to-day variability of measured mean SDs in the 1985–2010 period (Fig. 6). However, magnitude errors exist, occasionally reaching significant values in the case of the RCM runs. The overestimation of ERA5 is almost negligible, remaining below 1 cm. ERA5-Land and CARPATCLIM show a comparable positive bias of 2–3 cm in the case of the Budapest and Debrecen stations in January, and excessive SDs persist throughout the entire period. The overestimation of ERA5-Land is lower (remains below 1 cm) in the case of Pécs and Szeged than for the other two stations. The WRF simulations significantly overestimate daily SD values. In the ablation period (February and March), WRF model biases exceed 5 cm (and even reach 8 cm) for some measurement sites. The melting period is delayed in the case of all four stations, but discrepancies are more severe for the sites located further north (Budapest and Debrecen). A similar SD overestimation has been found across the Tianshan Mountains in the melting period by Li et al. (2022a) using the Noah-MP land-surface model. You et al. (2020a) indicated that the different options for the description of physical processes in Noah-MP affect SD in the ablation period, and the overestimation might be reduced by the optimal selection of model settings (e.g., using a semi-implicit snow temperature time scheme accelerated snow ablation compared to a full-implicit scheme). The delay of snowmelt impacts land-surface interactions and can have a deteriorating effect on simulated surface air temperatures, as shown in the case of the ERA5 reanalysis by Lei et al. (2022). Our results indicate that the increase in horizontal resolution does not lead to better model performance in the WRF simulations. We assume that the beneficial effects of the reduced grid spacing would be more evident over regions with complex topography. In addition, we propose that the significant overestimation of SD is a reasonable explanation for the extensive cold bias found in previous RCM evaluation studies in winter over large parts of Europe (García-Díez et al. 2015; Katragkou et al. 2015; Varga and Breuer 2020). The delayed ablation might also indicate problems in the representation of snowmelt processes in Noah-MP that should be addressed in further model improvement efforts. Supplementary Fig. S5 displays daily mean SDs from the different datasets averaged across the four stations in the 1985–2010 period.

Fig. 6
figure 6

Daily mean snow depth from station observations (OBS) and the different gridded datasets in the 1985–2010 period

The inspection of spatially averaged daily mean SDs from the grid-based datasets confirms the findings of the pointwise comparison (Fig. 7). The ERA5 reanalysis displays the lowest values, followed by ERA5-Land, CARPATCLIM, and the WRF regional climate simulations. Surprisingly, in most of the months, WRF10 presents even higher SDs than WRF50, i.e., the overestimation increases with the decreased grid spacing. On the other hand, in the ablation period (late February–March), WRF10 SDs are slightly below those of WRF50, suggesting that snowmelt processes might be better resolved at the higher horizontal resolution. However, the better performance of WRF10 is not as consistent at the station scale as in the spatial averages.

Fig. 7
figure 7

Daily mean snow depth from the different gridded datasets in the 1985–2010 period, spatially averaged over the region shown in Fig. 1

The excessive snow cover in the RCM simulations might at least partially originate from the overestimation of precipitation, as suggested by Orsolini et al. (2019) for the ERA5 reanalysis and Frei et al. (2018) for the EURO-CORDEX regional climate model ensemble. Indeed, WRF simulations participating in EURO-CORDEX overestimate winter precipitation over Europe (García-Díez et al. 2015; Vautard et al. 2021). Moreover, we also encountered a general wet bias in winter and spring during our preliminary sensitivity tests with the WRF model over the Pannonian Basin region (Varga and Breuer 2020). Lin and Chen (2022) demonstrated a significant overestimation of snowfall in the CMIP6 ESM ensemble and the ERA5 reanalysis over many parts of the world, using proxy snowfall data constructed from 2 m temperature and precipitation. According to their results, snowfall overestimation is the largest in spring, and the simulated snowfall duration is considerably longer than in reality. In summary, the overestimation of precipitation and, therefore, snowfall is quite common in state-of-the-art model-based datasets. It is challenging, however, to separate the relative contribution of precipitation and temperature errors, and land-surface model deficiencies to SD biases (McCrary et al. 2017).

The significant February–March SD overestimation in the WRF simulations is alarming for several reasons. First, the unrealistically deep and persistent snow cover can induce excessive snow-albedo feedback (Minder et al. 2016) that would lead to further atmospheric cooling in the model, resulting in an increased amount of solid precipitation in late winter and early spring. Moreover, the representation of albedo over snow surfaces might essentially (i.e., without major SD biases) be inaccurate. For example, Minder et al. (2016) demonstrated a large overestimation of surface albedo in the snowmelt season over the central Rocky Mountains, which they attributed to the lack of considering impurities deposited onto the snow in the LSM. Second, the positive albedo bias likely affects the surface energy budget and air temperatures aloft. In addition, the role of the temperature-albedo feedback is larger in spring when solar radiation fluxes are increasing (Mudryk et al. 2020). Finally, Thackeray et al. (2019) stated that the snow-atmosphere coupling is the strongest during the springtime melting period, and biases in snow cover can have a long-lasting impact on soil moisture.

Figure 8 shows the spatial distribution of the mean February SD from the observation-based CARPATCLIM dataset and the reanalysis products for the 1985–2010 period. The conclusions are very similar to those for the 5-year period (Fig. 5), except for ERA5 also displaying peak values over the Carpathian Mountains. Nevertheless, high-elevation maxima are more spatially extensive and larger in magnitude in the case of CARPATCLIM and ERA5-Land. Over the low-altitude Hungarian areas, CARPATCLIM SDs are generally higher compared to the reanalysis products. Similar monthly mean SD maps for December, January, and March are shown in Supplementary Figs. S6S7, and S8.

Fig. 8
figure 8

Spatial distribution of mean February snow depth from the different gridded datasets in the 1985–2010 period

The spatial distribution of the mean February SD bias of the WRF simulations with respect to CARPATCLIM, ERA5, and ERA5-Land can be seen in Fig. 9. Generally, an increasing overestimation of SD can be observed from southwest to northeast within the investigation area, implying that the snow excess is also greater where average SDs are greater. Over southwestern Hungary and northern Croatia, the WRF regional climate simulations overestimate SD compared to the ERA5 and ERA5-Land reanalyses but slightly underestimate it with respect to the CARPATCLIM dataset. The contrasting sign of the biases highlights the difficulty of snow verification originating from dataset uncertainties, pointed out by several earlier studies (e.g., Mortimer et al. 2020; Li et al. 2022b). The observational uncertainty is amplified over the Carpathian Mountains, where WRF50 consistently overestimates monthly mean SDs compared to ERA5 but underestimates them with respect to CARPATCLIM and ERA5-Land. On the other hand, WRF10 displays excessive SDs compared to every dataset over the high-elevation regions of the Carpathians (Fig. 9). Therefore, the 10-km model generates a deeper snowpack over the mountains than the 50-km simulation, presumably because of higher precipitation amounts due to the enhanced resolution and better-resolved orography. Terzago et al. (2017) also stated that dataset uncertainty can be very high over mountainous areas, undermining our ability to quantitatively evaluate RCM simulations over complex terrain. However, according to Terzago et al. (2017), despite the considerable spread in the observational products, many EURO-CORDEX regional climate models display even larger SWE values than any reference dataset over the greater Alpine region. Over low-altitude areas, the bias patterns of WRF50 and WRF10 are very similar regarding spatial distribution and magnitude. Bias maps for December, January, and March are shown in Supplementary Figs. S9, S10, and S11.

Fig. 9
figure 9

Spatial distribution of the mean February snow depth bias of WRF50 (left column) and WRF10 (right column) in the 1985–2010 period, compared to the CARPATCLIM, ERA5, and ERA5-Land datasets (first, second, and third row, respectively)

5 Conclusions and outlook

This study evaluated snow depth (SD) from a combined satellite-based and in situ SWE product (CGLS), a dataset constructed from temperature, precipitation, and relative humidity using a snow model (CARPATCLIM), two state-of-the-art reanalyses by ECMWF (ERA5 and ERA5-Land), and WRF regional climate model simulations at horizontal grid spacings of 50 km and 10 km, using weather station SD observations as a reference. The study area covered the Pannonian Basin region (part of Central and Eastern Europe). Results were presented separately for the 2006–2010 and 1985–2010 periods. The main findings of the paper are summarized below.

The ERA5 reanalysis and the CGLS product represent SD at the station scale remarkably well, with correlation coefficients above 0.9 and mean errors close to zero. ERA5 and CGLS outperform ERA5-Land and CARPATCLIM, highlighting the importance of using in situ snow observations in constructing such datasets. Therefore, the CGLS dataset and the ERA5 reanalysis can be used with high confidence for snow research over the continental, low-altitude regions of Central and Eastern Europe and presumably other flat mid-latitude areas of the world as well. It is unclear if the high quality of the ERA5 SDs is maintained away from the observation sites participating in the data assimilation algorithm. CARPATCLIM tends to overestimate SD compared to the other datasets, implying that although its temperature and precipitation records are of high quality and frequently utilized in Central European climate studies, the SD records must be used with caution. ERA5-Land also overestimates SD, although to a lesser extent than CARPATCLIM. Therefore, the higher resolution of ERA5-Land does not bring improvement compared to ERA5 over the flat, low-altitude study region, suggesting that the role of the increased resolution is more important over areas with complex topography.

The WRF regional climate simulations overestimate SD, especially at the end of winter and the beginning of spring, i.e., in the snowmelt period. The overestimation is presumably caused by temperature and precipitation errors and deficiencies in the representation of snow-related physical processes in the Noah-MP LSM. The increase in horizontal resolution does not result in performance improvement regarding SD over the flat investigation area, suggesting that the mechanisms responsible for the overestimation are not scale-dependent. We propose that the excessive snow cover might cause discrepancies in simulated land-surface interactions and possibly explain cold biases found in previous RCM verification studies over Europe. Further research is needed to explore the connection between the overestimation of SD and low-temperature biases. Moreover, our future work will focus on improving SD simulations in WRF by testing different Noah-MP options for the parameterization of snow-related physical processes. A detailed investigation of the reasons behind SD overestimation must be conducted as a prerequisite for model improvement. This study shows that SD errors can be substantial even in regions where snow does not exert a major influence on the local climate and should be considered in model evaluation.

The results of this paper highlight the need for adequately verified, good-quality snow datasets. We found a high level of disagreement between the different data sources, especially over the mountain ranges of the Carpathians. Even the sign of the model biases depends on the reference product choice. Despite the high need for an accurate representation of snow-related processes in climate models, verifying snow variables remains challenging due to the lack of reliable observational datasets with sufficiently high spatial and temporal coverage.