The Year of Polar Prediction site Model Intercomparison Project (YOPPsiteMIP) phase 1: project overview and Arctic winter forecast evaluation

. Although the quality of weather forecasts in the polar regions is improving, forecast skill there still lags behind lower latitudes. So far there have been relatively few efforts to evaluate processes in numerical weather prediction systems using in situ and remote sensing datasets from mete-orological observatories in the terrestrial Arctic and Antarctic compared to the mid-latitudes. Progress has been limited both by the heterogeneous nature of observatory and forecast data and by limited availability of the parameters needed to perform process-oriented evaluation in multi-model forecast archives. The Year of Polar Prediction (YOPP) site Model Inter-comparison Project (YOPPsiteMIP) is addressing this gap by producing merged observatory data ﬁles (MODFs) and merged model data ﬁles (MMDFs), bringing together observations and forecast data at polar meteorological ob-servatories in a format designed to facilitate process-oriented evaluation. Anevaluation of forecast performance was performed at seven Arctic sites, focussing on the ﬁrst YOPP Special Ob-serving Period in the Northern Hemisphere (NH-SOP1) in February and March 2018. It demonstrated that although the characteristics of forecast skill vary between the different sites and systems, an underestimation in boundary layer temperature variability across models, which goes hand in hand with an inability to capture cold extremes, is a common issue at several sites. It is found that many models tend to under-estimate the sensitivity of the 2 m air temperature (T2m) and the surface skin temperature to variations in radiative forcing, and the reasons for this are discussed.


Introduction
Recent decades have seen a marked increase in human activity in the polar regions leading to an increasing societal demand for weather and environmental forecasts (Emmerson and Lahn, 2012;Goessling et al., 2016).Despite this growing need, the skill of weather forecasts in the polar regions lags behind that of the mid-latitudes (Jung et al., 2016;Bauer et al., 2016).This is partly the result of the relatively low density of conventional observations at high latitudes compared to mid-latitudes (Lawrence et al., 2019) but is also related to the occurrence of meteorological situations and phenomena that are historically difficult to model, such as stable boundary layers (e.g.Atlaskin and Vihma, 2012;Sandu et al., 2013;Holtslag et al., 2013) and mixed-phase clouds (e.g.Pithan et al., 2014Pithan et al., , 2016;;Solomon et al., 2023), and the importance of coupling between the atmosphere and snow and ice surfaces (e.g.Day et al., 2020;Batrak and Müller, 2019;Svensson and Karlsson, 2011).
The ability of climate models to represent atmospheric processes in polar regions has recently been assessed highlighting deficiencies in near-surface and boundary layer properties (Pithan et al., 2014;Svensson and Karlsson, 2011;Karlsson and Svensson, 2013).Since many climate models are based on global weather forecasting systems, understanding the causes of forecast error after 1-2 d may help develop understanding of the sources of error in climate models (Rodwell and Palmer, 2007).Nevertheless, until recently there has been little focus on evaluating numerical weather prediction (NWP) models using in situ data from the terrestrial Arctic and Antarctic (Jung and Matsueda, 2016;Jung et al., 2016).
Recent studies, conducted as part of the World Weather Research Programme's Polar Prediction Project (PPP, Jung et al., 2016), have started to address this gap, assessing the skill of both the large-scale circulation (Bauer et al., 2016) and surface weather properties (Køltzow et al., 2019).The Year of Polar Prediction (YOPP) site Model Intercomparison Project (YOPPsiteMIP) was designed to build on these earlier studies by utilising process-level data from polar observatories to diagnose the causes of forecast error from a process perspective and ultimately inform model development.Although process-oriented evaluation studies focussing on polar processes are not new, those that have been done have tended to focus on one or two sites or a specific field campaign (see Day et al., 2020;Batrak and Müller, 2019;Miller et al., 2018;Tjernström et al., 2021;Kähnert et al., 2023, for some recent examples).A key aim of YOPPsiteMIP is to provide a pan-Polar perspective on forecast evaluation and process representation.
YOPPsiteMIP participants were asked to provide data in so-called merged data files (MDFs), which includes both merged observatory data files (MODFs), for observatory data, and merged model data files (MMDFs), for model data.These data standards, which were developed specifically for YOPPsiteMIP, are described by Uttal et al. (2024).Using this common file format, with consistent naming and metadata, facilitates equitable and efficient comparisons between models and observations.This standardisation of the data from different observatories also aids interoperability in the sense that the same evaluation code can be applied at different sites.These MDF filetypes were developed as part of PPP, following the FAIR (Findable, Accessible, Interoperable, Reusable) data principles (Wilkinson et al., 2016).Details of the MDF concept and specifics of the data processing chain for producing MDFs are described in Uttal et al. (2024).
The observatories selected for YOPPsiteMIP represent a geographically diverse set of locations (see Mariani et al., 2024).At these sites a wide range of instruments measuring properties of the air, snow and soil are employed, extending far beyond the traditional synoptic surface and upper-air observation network, which are collected for use in the production and evaluation of NWP systems (Uttal et al., 2015).Taken together, the observations collected at these observatories offer opportunities to develop a deeper understanding of the physical processes governing the weather in the polar regions, their representation in forecast models and how this varies from site to site.The processes and phenomena targeted in YOPPsiteMIP include boundary layer turbulence, surface exchange (including over snow and ice) and mixedphase clouds.
A benefit of organising coordinated evaluation involving several NWP systems and multiple sites is that it helps clarify if the issues revealed by the analysis are model or location specific.The modelling community has organised model inter-comparisons to target various atmospheric processes relevant for Arctic conditions (e.g.Cuxart et al., 2006;Pithan et al., 2016;Tjernström et al., 2005;Sedlar et al., 2020;Solomon et al., 2023), with each using its own protocol for data sharing.However, the newly developed standardisation of the observational and forecast model data developed for YOPPsiteMIP is planned to be used for future MIIPs (Model Intercomparison and Improvement Projects).Converging on a standard like this will aid interoperability, making it easier for model developers to expand their evaluation to new sites or observational campaigns but also to other models or forecasting systems.
MDFs were requested for the locations listed in Table 1 and shown in Fig. 1 during the YOPP Special Observing Periods (SOPs), during which the observations taken at many polar observatories (e.g. the frequency of radiosondes) was enhanced (see Lawrence et al., 2019;Bromwich et al., 2020).For the Northern Hemisphere the periods February-March 2018 and July-September 2018 were selected and named NH-SOP1 and NH-SOP2, respectively.For the Southern Hemisphere or SH-SOP the period November-February 2018/2019 was chosen.At the time of publication MMDFs have been produced and archived from seven NWP systems for these periods, and all of the sites listed have MMDFs from at least one model.MODFs have been produced and archived for seven of the sites so far and it is hoped that addi-tional MODFs will be produced in the future to fill the gaps, particularly in the Southern Hemisphere.
The purpose of this paper is twofold.First, it seeks to document the first version of the YOPPsiteMIP dataset along with a basic description of the forecasting systems and their respective MMDFs that are archived at the YOPP Data Portal, hosted by the Norwegian Meteorological Institute (MET Norway).Second, the paper presents a multi-site evaluation of seven forecasting systems during NH-SOP1 at seven Arctic observatories that have produced MODFs.The locations are indicated by the white stars in Fig. 1a, and the MODFs and full details of the sites are described in Mariani et al. (2024).
The seven Arctic sites used for evaluation in this study cover both high Arctic and sub-Arctic climate zones.Tiksi, Utqiaġvik, Iqaluit, Ny-Ålesund and Eureka all sit in the Arctic tundra characterised by low vegetation.The remaining two sites Whitehorse and Sodankylä are sub-Arctic, with higher vegetation corresponding to the boreal cordillera and taiga ecozones, respectively.Whitehorse, Iqaluit, Ny-Ålesund and Eureka are characterised by complex topography in the surrounding area, whereas the other sites are flatter.All the sites are in close vicinity to either frozen ocean (sea ice) or frozen inland waterbodies at this time of year, and the land surrounding each observatory is covered in snow throughout the period February-March 2018.A visual representation of the model grids with respect to the landscape surrounding these stations can be seen in Fig. 2 of Mariani et al. (2024) in which a more detailed description of the site characteristics may be found.

Description of simulations, model formulation and output protocol
To date, six NWP centres have submitted forecasts from seven forecasting systems for NH-SOP1 and NH-SOP2, with two systems submitted for the SH-SOP (see The following three systems are regional: the Canadian The domain boundaries of the regional forecasting systems can be seen in Fig. 1 (note that only two of the observatories are within the AROME domain).The forecasts analysed here were initialised at 00:00 UTC for each day of the SOPs (although 12:00 UTC forecasts are also available on the archive for many of the systems).The forecast lead time varies between the different systems but all forecasts are at least 2 d long (see Table 2 and Figs. 2 and 3).
The files for some of the systems (CAPS, SLAV, ARPEGE, AROME-MF) are provided with multiple grid points centred on the observatory location.For others only a single grid point was provided.Multiple grid points centred around the observatory location were requested because many of the observatories are located in the vicinity of coasts, which leads to representativeness issues when comparing the land-based observation to model output for grid points being partially or entirely over the ocean.In this study, when there are multiple grid points we choose the closest 100 % land point to the supersite location, with the exception of CAPS, for which the central grid point within a beam of 7 × 7 grid points was considered (since nearest to the observation site) and ICON which provided the single closest grid point to the station location.As a result, the evaluation utilises a 100 % land grid box at all models and locations, with the exception of ICON, which has 23 % land cover at the Utqiaġvik and 73 % at Ny-Ålesund, and CAPS, which has 37 % land cover in Utqiaġvik; 71 % and 77 % in Tiksi and Iqaluit, respectively; and over 90 % land cover for the other sites.Comparison of the CAPS grid points surrounding Utqiaġvik with each other indicated that the evaluation would not be much influenced by the choice of grid cell (not shown) since during the Arctic winter the frozen ocean grid points have similar properties to the snow-covered land surface (e.g. when analysing the surface energy budget sensitivity to radiative forcing in Sect.3.4).The grid resolutions range from 2.5 to ∼ 30 km, and the model time step varies from 1.5 to 7.5 min (see Table 2).
The models have quite a diverse mixture of formulations for atmospheric dynamics, land surface, sub-grid-scale parameterisations, and initialisation and data assimilation procedures.More details about the simulations with specific models are provided below and a summary of the key model components and parameterisations used in each model is included in Table 3      vided at the model time step (7.5 min) for a single model grid point closest to the observatory.In addition to the grid point data, a number of parameters (including albedo, surface temperature, and surface energy fluxes) are provided on the land surface model tiles to enable detailed evaluation of processes even at heterogeneous sites.A complete description for the two versions of the IFS can be found at the following link: https://www.ecmwf.int/en/publications/ifs-documentation (last access: 10 July 2024).

ARPEGE-MF
The version of ARPEGE submitted to YOPPsiteMIP was a pre-operational version based on the cy43t2_op1 operational system but coupled with the 1D sea ice model GELATO (Bazile et al., 2020).The resolution of the model used for these simulations is the same as is used operationally at Météo-France, which is variable (using a stretching factor of 2.2) with the pole (highest resolution of 7.5 km) over France for NH-SOP1 and NH-SOP2 and over Antarctica in SH-SOP and 105 vertical levels.The horizontal resolution is about 8-9 km over the North Pole, and time series have been provided for the three SOPs in the MMDF format for the 21 YOPP observatories with an hourly output for both state variables (instantaneous) and fluxes (accumulated).

SLAV-HMRC
MMDFs were produced by the SLAV model (Tolstykh et al., 2018) for both NH-SOP1 and NH-SOP2 containing 7 d forecasts starting at 00:00 UTC.The output is available for four horizontal grid points surrounding selected observatories every 15 min (i.e.every fourth time step).Depending on variable, the output is instantaneous or a 15 min averaged value.Data for 13 of the Arctic observatories in Table 1 are provided.The selection of observatories is based on model resolution in latitude, which is relatively low, i.e. ∼ 16 km in northern polar areas; in addition, the ao2 point is not included because the model grid does not contain the poles.

ICON-DWD
MMDFs from DWD's ICON (Zängl et al., 2015) are available from February 2018 to June 2020 containing 7.5 d forecasts starting at 00:00 and 12:00 UTC for Sodankylä, Ny-Ålesund and Utqiaġvik (Barrow).The mesh width is 13 km.Different model versions are used during this period.In February icon-nwp-2.1.02was used followed by icon-2.

AROME-MF
The AROME-MF system from Météo-France and AROME-ARCTIC from MET Norway are both configurations of the same model system but use different parameterisations of turbulence, shallow convection, cloud microphysics, and sea ice.The system used for the YOPPsiteMIP differs from the operational AROME-France configuration (Seity et al., 2011) and the version evaluated for NH-SOP1 in Køltzow et al. (2019) in that it is coupled with the GELATO 1D sea ice model.However, the domain (see Fig. 1a) and horizontal and vertical grids are exactly the same as the AROME-ARCTIC operational system (see Sect. 2.6).The ICs and LBCs are interpolated from the global model ARPEGE-MF simulation described above (Sect.2.2).The MMDF files have been produced for Ny-Ålesund, Sodankylä and Pallas with hourly output.

Output format
For each forecast initial time and each forecasting system a single netCDF file containing all variables was archived following the MMDF format, which use the same nomenclature, metadata and structure as the MODFs.In order to be able to assess process representation, the YOPPsiteMIP protocol requested that atmospheric fields were provided on native model vertical levels and all fields should be provided with high frequency (every 5 or 15 min), ideally at the frequency of the model time step if practical to support detailed process investigations without the confounding effect of time averaging.
The actual variables archived, frequency and number of grid points vary from model to model.For example, ECCC provided a comprehensive set of parameters for the CAPS model focusing on precipitation and clouds microphysics to allow studies on the representation of different types of hydrometeors by the P3 scheme (Morrison and Milbrandt, 2015;Morrison et al., 2015;Milbrandt and Morrison, 2016).A full list of requested variables, along with a schema for producing the MDFs, is given in a document known as the H-K Table (Hartten and Khalsa, 2022).The table is available in both human and machine-readable form (PDF and JSON, respectively).The H-K Table relies on standards and conventions commonly used in the Earth sciences, including netCDF encoding with CF naming and formatting conventions, and is an evolving document that is expected to evolve to fulfil the requirements of future MMDFs and MODFs.The prescribed metadata make data provenance clear and encourage proper attribution of data origin (see further information in Uttal et al., 2024).
Although we only focus on model performance during NH-SOP1, a full set of MMDFs and MODFs was produced for both SOPs.The MODFs for Iqaluit (Huang et al., 2023b), Whitehorse (Huang et al., 2023a), Utqiaġvik (formerly known as Barrow; Akish and Morris, 2023c), Eureka (Akish and Morris, 2023a), Tiksi (Akish and Morris, 2023b), Ny-Ålesund (Holt, 2023) and Sodankylä (O'Connor, 2023) are described in detail in Mariani et al. (2024) along with descriptions of the site geography.MMDFs have also been produced for the SH-SOP with the ECMWF-IFS and ARPEGE models (See Table 2), but no MODFs for the Antarctic observatories have been produced yet.
3 Evaluation of basic surface meteorology and vertical profiles

Evaluation and scores
As mentioned in the Introduction, the combination of MODFs and MMDFs allow detailed process-oriented diagnostics to be performed for the models.However, it is first important to assess what the errors are for standard variables such as 10 m wind speed and 2 m temperature.This first step is important because if they are stationary with lead time one can simply consider a 24 h time range in the forecasts such as T + 25 until T + 48 (the second day of the forecast), simplifying the analysis.
The 2 m temperature errors during February and March 2018 have quite different properties at each site and for each model (Fig. 2).The models are typically too warm at Utqiaġvik and Tiksi and too cold at Ny-Ålesund and Whitehorse, with the sign of the bias varying between the models at Iqaluit and Eureka.At both Sodankylä and Whitehorse, which are situated at lower latitudes than the other sites, there is a distinct diurnal cycle in the bias and standard deviation that is not there at higher-latitude sites.At both sites the night-time temperature bias is typically more positive than the daytime bias, indicating an underestimate of the diurnal temperature range.In the case of the CAPS and the IFS, the bias in the diurnal cycle at these observatories are representative of those seen over wider region (e.g.Casati et al., 2023, andHaiden et al., 2018).
In terms of wind speed, the forecasts all have a positive wind speed bias at Utqiaġvik and a negative bias at Iqaluit and Whitehorse (Fig. 3).At Tiksi, Eureka, Sodankylä and Ny-Ålesund, the sign of the bias varies between the models.Interestingly, the largest inter-model spread and biases in wind speed is observed at the sites surrounded by the most complex orography (i.e.Iqaluit, Ny-Ålesund, Eureka and Tiksi; see Fig. 2 of Mariani et al., 2024), likely due to the difficulties in representing the mesoscale flow patterns typically generated in such locations.Interestingly, there does not seem to be an obvious benefit from the increased resolution, with the AROME configurations and CAPS model actually having worse biases than the lower-resolution global models at Ny-Ålesund.
Although there is some sub-daily variability with a diurnal frequency in the bias that is more pronounced in the wind speed bias (Figs. 2 and 3), the size of the biases does not grow dramatically with time.Thus, we consider a 24 h time range between the T + 25 and T + 48 forecast steps (i.e. the second day of the forecast) to be representative of the general error, simplifying the analysis.

Vertical profiles
To gain further insights we investigate the vertical structure of the errors by comparing the model output to observahttps://doi.org/10.5194/gmd-17-5511-2024 Geosci.Model Dev., 17, 5511-5543, 2024 tions from radiosonde and tower.To do this the model and tower data were thinned to the same frequency as the radiosonde prior to calculating the median and inter-quartile range shown in Figs. 4 and 5.The median temperature and specific humidity within the boundary layer is overestimated at Tiksi, Eureka, Utqiaġvik and Iqaluit (see Fig. 4), and the models underestimate the strength of temperature and humidity inversions as a result.The picture is more mixed at Ny-Ålesund and Sodankylä where most models are too cold and humid, and two out of the three models are too dry at Whitehorse.The biases in the upper-air temperatures, 2 m air temperature and the surface skin temperature tend to go hand in hand, i.e. the model with the warmest or coldest surface temperature tends to have the warmest or coldest 2 m and upper-air temperatures.As a result, the mean 2 m temperature errors seen in Fig. 2 give a sense of the sign of the error in the lowest 100 m or so of the atmosphere.This coupling between the lowest model level, the surface skin temperature and the 2 m temperature is to be expected since the 2 m temperature is a diagnostic calculated as a function of the lowest atmospheric model layer and the surface skin temperature.
Air temperature variability in the lower boundary layer is generally underestimated by the models, except at Iqaluit (Fig. 5).This generally translates to an underestimation of the 2 m temperature variability at these sites.Interestingly, at Ny-Ålesund some models severely overestimate the 2 m temperature variability despite underestimating the variability aloft, possibly due to the overestimation of the surface skin temperature variability.For specific humidity the observed inter-quartile range tends to sit within the range of the models; however, it is overestimated at Eureka and underestimated at Tiksi and Whitehorse in the lower boundary layer.
The median of the modelled wind speed is too high in the boundary layer at Sodankylä, Utqiaġvik and Tiksi but more mixed at other sites (Figs. 4 and 5).The variability in the wind speed is within the model range, with the exception of Iqaluit, where it is underestimated.The overestimation of the wind speed at these sites is likely a contributing factor in the underestimation of the temperature and humidity inversions, since a positive bias in the wind speed will drive excessive turbulent mixing of heat and moisture inhibiting the decoupling of near-surface and upper-air temperatures that occurs during periods of radiative surface cooling and low wind (Van de Wiel et al., 2017).Other factors which could play a role are the radiative forcing at the surface or the response of the surface to radiative forcing.Both aspects will be addressed in the following subsection.

Links between errors in boundary layer temperature variability and surface radiation
In this section we investigate the role of radiative forcing in the underestimation of near-surface and boundary-layer temperature variability at Sodankylä, Utqiaġvik and Tiksi where the models underestimate the temperature variability.At these sites all upwelling and downwelling radiation components are available in the NH-SOP1 MODFs allowing us to investigate whether the suppressed temperature variability is related to suppressed variability in the radiative forcing at the surface, a lack of sensitivity of the near-surface temperature to radiative forcing or something else.The boxplots shown in Fig. 6a-c confirm the underestimate of near-surface temperature inter-quartile range (IQR) at Tiksi (except CAPS), Sodankylä, and Utqiaġvik, and further show that the cold tail of the distribution is generally shorter in the models meaning there is a warm bias during cold periods.The warm bias in cold conditions is well known at Sodankylä and is typical of NWP systems (see Atlaskin andVihma, 2012, andDay et al., 2020), but this feature has not been shown before at the other two sites to our knowledge.
The models typically also show differences in the distribution of the downwelling radiation at the surface, LW↓ + SW↓ compared to observations (Fig. 6d-f).The IQR is underestimated at Tiksi (except for CAPS) and Utqiaġvik.However, at Sodankylä all the models overestimate the IQR (except for CAPS) but also do not capture the highest values of incident radiation observed at the top of the distribution.Since errors in the incident radiation likely relate to interactions with clouds, which are not included in this iteration of the MODFs, we will not investigate the causes of these discrepancies between the observed and forecast radiation distributions further, leaving this for a more focussed future study, and we will instead move on to focus on the response of the near-surface air temperature and the surface energy budget.
As LW↓ + SW net is the effective radiative forcing for the surface skin temperature (and indirectly for the 2 m temperature), errors in 2 m air temperature are either due to errors in this driving term itself, the relationship between LW↓ + SW net and 2 m temperature, or more likely a combination of both (assuming that errors in advection are negligible).Because the model median surface albedo (except for SLAV at Tiksi) is close to the observed estimate (Fig. 7), we can focus on how 2 m temperature varies as a function of LW↓ + SW net , to more deeply investigate the causes of error.
At Sodankylä, Tiksi and Utqiaġvik all the models have a warm 2 m temperature bias at low levels of incoming radiation (LW↓ + SW net ) (see Fig. 8).At Tiksi, Utqiaġvik and Sodankylä the overall sensitivity of T2m to radiative forcing, as measured by the slope of the regression coefficient between 2 m temperature and LW↓ + SW net is underestimated in all the models with one exception.The AROME-Arctic model seems to be too sensitive at Sodankylä according to this diagnostic, but it captures the observed temperature range at low levels of LW↓ + SW net .
Note that the LW components used for Sodankylä in this study are not those provided in the NH-SOP1 MODF, which  are collected at the top of the 45 m tower, rather they are from a dedicated radiation tower located near the sounding station where the downwelling component is at a height of 16 m and the outgoing is at 2 m.These were swapped due to a concern over the accuracy of the LW radiation data collected at the met tower (Roberta Pirazzini, personal communication, 2023).
To investigate the role of surface-atmosphere decoupling in the 2 m temperature cold-tail warm bias and lack of 2 m temperature variability at low levels of incident radiation, we plot the thermal stratification as a function of near-surface wind speed at the three sites (Fig. 9) for situations where the model or observed LW↓ + SW net is below the 20th percentile.In the observations one can see the typical pattern seen at other sites (e.g.Van de Wiel et al., 2017) that shows that inversions are weak for strong winds, whereas large inversions are found under weak wind conditions with a transition found between those regimes at some critical wind speed.The models generally capture this qualitative regime behaviour (Fig. 9), although the magnitude of the thermal stratification, the wind speed and the critical wind speed for the regime transition varies between the models.

Surface energy budget sensitivity to radiative forcing
Further insight into the role of the land surface and surface exchange processes in the T2m errors outlined in the previous section, particularly the lack of T2m sensitivity to radiative forcing, can be gained by constructing surface energy budget sensitivity diagrams, following Miller et al. (2018) and Day et al. (2020).The idea here is that the surface energy budget can be separated into a "driving term" (LW↓ + SW net ) and "response terms" (sensible heat flux (SHF), latent heat flux (LHF), ground heat flux (GHF), and LW↑).The relationship between the driving term and each response term can be summarised with regression coefficients; e.g. for the SHF the following equation is used: where each of the α values can be interpreted as a coupling strength parameter between the driving term and each response term.These α values provide direct information  sion coefficients one can derive physical understanding of the causes of model error.
Note that in convective cases the main driver of turbulent heat fluxes is indeed the convective instability at the surface driven by radiative forcing.However, in stratified conditions the main driver of turbulence in the boundary layer (and of the sensible and latent heat fluxes) is the mechanical forcing, i.e. the large-scale wind speed (Van Hooijdonk et al., 2015;Van de Wiel et al., 2017;Vignon et al., 2017).As a result, one expects the turbulent fluxes to have little sensitivity to the radiative forcing in stable conditions, with the ground heat flux taking a larger role in balancing changes in radiative forcing and the converse in convective cases (see Day et al., 2020).As a result, at Utqiaġvik and Tiksi where stable conditions dominate, the ground heat flux varies with changes in radiative forcing more than the turbulent fluxes, as indicated by higher regression coefficients.At Sodankylä there is more of an even partitioning between the turbulent fluxes and the ground heat flux into the snow.
It is clear from Figs. 10-12 that all the models generally underestimate the surface temperature sensitivity to radiative forcing at Sodankylä, Utqiaġvik and Tiksi because the rate of change in LW↑ with changes in radiative forcing, LW↓ + SW net , i.e. α LW↑ , is typically too low (i.e.α LW↑ mod < α LW↑ obs ).Since the 2 m temperature diagnostic in the models is calculated as a function of the surface skin temperature, the underestimation of the 2 m temperature and LW↑ sensi-tivity to radiative forcing and the positive bias in those variables in cold conditions are likely to be closely related (i.e.comparing Fig. 8 to Figs. 10-12).For example, at Sodankylä the CAPS model T2m and upwelling longwave (LW↑) sensitivities are very close to what is observed, AROME-Arctic slightly overestimates these sensitivities while SLAV underestimates them.A similar proportionality can be seen between these properties of the models at the other two sites.Note that because the LW↑ at Sodankylä was observed at 2 m and thus has a rather small footprint compared to the sensor on the 16 m mast, the sensitivity is more representative of the bare snow than the forest canopy.As a result, one might expect the area mean LW↑ sensitivity to be higher than the value presented here.
This mismatch in terms of LW↑ sensitivity goes hand in hand with differences in the other α coefficients, and by comparing the sensitivities of the other response terms in the surface energy budget we can develop some hypotheses about what is leading to this mismatch in surface temperature sensitivities.For example, at Utqiaġvik, all the models tend to overestimate the sensitivity of the GHF, α GHF , which was calculated as the residual of the observed radiative and turbulent fluxes.This can be an indication of non-sufficient thermal representation of the land surface, e.g. a lack of a multilayer snow model (e.g.Day et al., 2020;Arduini et al., 2019).Unfortunately, we are not able to perform a similar calculation to that performed for Sodankylä to estimate the GHF, as  the longwave observations thought to be most reliable are not co-located with the other flux observations, or at Tiksi, as we do not have the turbulent fluxes in the MODF.As a result, we cannot calculate the GHF as a residual of the other terms.
Where we have turbulent flux observations, we can also evaluate the α SHF and α LHF terms.At Utqiaġvik, an underestimation of the sensitivity of the turbulent fluxes, too low α SHF and α LHF in the ARPEGE and SLAV models goes hand in hand with an overestimation of α GHF mentioned above.The IFS and ECCC models are closer to observations, with smaller values of α GHF and larger values of α SHF and α LHF .At Sodankylä, the α SHF varies quite a bit from model to model, but all the models where the LHF was available overestimate the α LHF .
At all three sites the relative size of the coefficients varies, with α LW↑ , α SHF and α GHF typically being an order of maghttps://doi.org/10.5194/gmd-17-5511-2024 Geosci.Model Dev., 17, 5511-5543, 2024 nitude larger than α LHF .This is likely to be typical of cold and dry snow-covered environments where the magnitude of the latent heat flux is low.However, the difference in the relative size of the other three terms varies quite a bit between sites with, for example, the turbulent flux playing a larger role at Sodankylä than at Tiksi and Utqiaġvik at this time of year.This reflects the larger surface roughness at Sodankylä associated with the trees at this site.Before moving on it is worth noting that as well as being used to develop hypotheses about the causes of errors related to the surface energy budget, these process diagrams and sensitivity metrics could also be applied to test new configura-tions of NWP systems with modifications to the land surface, boundary layer or related schemes and evaluate whether such modifications are improving the dynamic behaviour with respect to the surface energy budget in line with observed behaviour or not.

Evaluation of wind stress and sensible heat flux
The previous examples highlight discrepancies between forecast and observations and provide hints as to which processes are responsible for the documented errors.The observed conditions also provide multi-variate targets for updated forecasting systems.However, the observations can also help us https://doi.org/10.5194/gmd-17-5511-2024 Geosci.Model Dev., 17, 5511-5543, 2024 evaluate a specific process and thereby target a specific parameter or parameterisation to change.
The Sodankylä and Utqiaġvik MODFs include turbulent fluxes and profiles of wind speed and temperature, allowing us to investigate the parameterisation of turbulent exchanges of heat and momentum at the surface.Turbulent surface fluxes in NWP models are often parameterised according to Monin-Obukhov (M-O) similarity theory where they are related to the gradient in the lowest atmosphere (e.g.Beljaars and Holtslag, 1991): where τ is the wind stress; U is the wind speed; θ is potential temperature; ρ is the air density and the transfer coefficients; and C M and C H , used to in each computation, are a function of the roughness length of momentum and heat, z oM and z oH , and a stability parameter.In these equations, U ref and θ ref are the wind speed and potential temperature at a reference height, respectively, which in the case of the models is the lowest atmospheric model level, the height of which varies from around 10 to 30 m above the surface depending on the model (see Table 3).Successfully parameterising τ and SHF relies on defining a reasonable function for C M and C H and selecting the appropriate parameters and a proper aggregation of the fluxes in the cases of a tiled surface.Because we have observed and forecasted values for both the fluxes and the bulk parameters in Eqs. ( 2) and (3), we can diagnose how appropriate the choices in each model are for the conditions at a particular site.This is done by examining the relationship between the bulk parameters, U and θ , and the fluxes τ and SHF (see , as done previously by Tjernström et al. (2005) and more recently by Day et al. (2020).
In the case of wind stress, in neutral conditions, the points in Figs. 13 and 14 would sit on the straight line following where z ref is the height of the lowest model level, k is the von Kármán constant and z 0m is the aerodynamic roughness length.The slope of this line is determined by z 0m .However, this formula provides an overly simplified view as the atmospheric stability varies from neutral conditions, and as a result there is scatter in the values of τ for any given wind speed.
The relationship between τ and U for Sodankylä (Fig. 13) differs between the models and between the models and the observations.An estimate of the observed roughness length was calculated following the equation above after selecting for neutral conditions, and the value is presented in Table 4 along with the value used in each of the models.In the AROME-Arctic and ICON models, τ increases too slowly with increasing U .This is consistent with the fact that the roughness length for momentum is too low in these models, which have roughness lengths an order of magnitude lower than that derived from observations (see Table 4).Increasing z 0m in the AROME-Arctic and ICON models would likely reduce the positive bias in the wind median wind speed profile seen in Fig. 4; however, the other models that have roughness lengths closer to what was observed also have a positive wind speed bias, suggesting another cause.
Interestingly, all models fail to adequately capture the spread of τ for a given value of U , likely because the models underestimate the atmospheric stability as is suggested by the weaker than observed thermal stratification indicated by in Figs.4d and 5d.A more detailed study including numerical experimentation would be needed to demonstrate this further.
At Utqiaġvik, the aerodynamic roughness length is 3 orders of magnitude lower than at Sodankylä, reflecting the difference in surface type: snow-covered tundra compared to the forested taiga of northern Finland (Table 4).Here the IFS and SLAV models have roughness lengths close to those derived from observations, whereas ARPEGE and ICON have values that are higher.As a result, for a given wind speed the surface stress is too high in these two models (Fig. 14).
The scatterplots for the sensible heat flux (Figs. 15 and 16) also provide some insights into the differences in the process representation between the models.All the models capture the link between the SHF and the temperature gradient, T , dictated by M-O theory (see Eq. 3); however, the shape of the relationship varies between the models.For example, for the ARPEGE and AROME-MF models the sign of the sensible heat flux does not change in a binary way with T , instead there is spread in the location along the x axis where this occurs.This could be due to differences in the numerical formulation of the models, i.e. the time step at which the flux and temperature terms are stored, or due to the fact that we are looking at the grid box mean values where the fluxes are aggregated from values computed on different surface tiles.At Sodankylä, the IFS, SLAV and AROME-ARCTIC model have a clear tapering in the scaled sensible heat flux towards zero for high values of T .However, AROME-MF, ARPEGE and ICON do not have such a tapering, and the scaled heat flux continues to grow with larger T , which is qualitatively inconsistent with the observations and will lead to higher fluxes in very stable conditions inhibiting cooling of the surface.There is also a clear difference in the range of T between the different models; however, in the models this is an aggregate of different surface types representing forest canopy top, bare snow and frozen water, and because we do not have a trustable observation of the temperature of the top of the canopy frozen water during freezing conditions, it is not clear what the realistic range should be.Note also that the SHF at Sodankylä is measured at 24.5 m, and for process consistency T is calculated using the air temperatures observed at 18 and 32 m, which is not directly comparable with the models.Note that at Sodankylä the SHF is measured at 24.5 m, and for process consistency T is calculated using the temperatures observed at 18 and 32 m, meaning that it is not directly comparable with the models that use the skin temperature, T skin , and the lowest model level, T lml .Except for ICON, differences between the models at Utqiaġvik are less pronounced.IFS, SLAV and ARPEGE have quite a similar shape, and all underestimate the magnitude of the scaled heat flux for low values of T , potentially due to the slow bias in wind speeds near the surface.Note that the large values of T for the SLAV model are because the lowest model level is at ∼ 30 m compared to ∼ 10 m for the other models.Note that the ICON model has a large fraction of open ocean in the grid cell considered, and therefore the model tends to be biased towards convective conditions (i.e.most points are in the top-left quadrant of Fig. 16 where the sensible heat flux is heating the atmosphere), this is likely the main reason for the warm bias in surface skin temperature and 2 m air temperature.For the other models shown in Fig. 16, the grid point considered is 100 % land.

Conclusions and future plans
In this paper we have outlined the motivation for YOPP-siteMIP; documented the current status of the YOPPsiteMIP forecast MMDF data archived on the YOPP data portal (hosted by MET Norway); and presented some multi-model forecast evaluation examples to demonstrate the utility of the MMDFs and MODFs using data from the YOPP NH-SOP1, which occurred during February and March 2018.The main conclusions from this analysis are as follows.
-Near-surface temperature and wind speed forecast errors vary considerably between the different sites, reflecting both a range of climate conditions and forecast performance across the selected sites.-A common feature of several sites, namely Sodankylä, Barrow, Tiksi and Eureka, is a warm bias during periods of extreme cold that goes hand in hand with a lack of temperature variability in the lowest ∼ 100 m of the atmosphere.
-This lack of variability is investigated further at Utqiaġvik, Tiksi and Sodankylä where radiation components were observed and provided in the MODFs and MMDFs, which enabled us to investigate the sensitivity of T2m to radiative forcing.
-At all three sites the models tend to underestimate the sensitivity of T2m and the surface skin temperature (or LW↑) to variations in radiative forcing and do not capture extreme minima in these variables, although the AROME-Arctic and CAPS models perform better in this regard.
-At Utqiaġvik and Sodankylä, since turbulent fluxes were also provided, we were able to investigate the link between these fluxes and the bulk parameters.This highlighted the following points.
-Differences were found in the parameterisation of turbulent fluxes, particularly the specification of the roughness length for momentum, which varies by a little less than an order of magnitude between different models.
-The high importance of the atmosphere-to-snow heat flux was also noted, particularly at the Utqiaġvik and Tiksi sites, where stable conditions dominate.Note that despite this importance, this flux is not observed at these sites.
Process studies that compare point observations to gridded model output need to be carried out in awareness of sub-tile representativeness issues.For fine-resolution models it is always recommended to provide output from multiple grid points (as in this study) centred on the observatory to be able to pair land-based observations to a model tile with dominant land cover.For coarse-resolution models, we recommend providing variables for the different sub-tile components (bare soil, vegetation, water, ice, . ..).The more the site characteristics are matched to the correct model output, the more reliable diagnosis on the model capability to reproduce the observed physical process.In this study we found that the land-ocean contrast in the Arctic in winter does not significantly affect the surface energy budget sensitivity to radiative forcing in the CAPS model (in Sect.3.4, the oceanhttps://doi.org/10.5194/gmd-17-5511-2024 Geosci.Model Dev., 17, 5511-5543, 2024 dominated Utqiaġvik grid points of CAPS do not stand out with respect to the other models) because the frozen ocean has similar characteristics to the snow-covered land surface.
On the other hand, the ICON model, which has very low sea ice values (∼ 10 %), has much warmer temperatures than the other models at Utqiaġvik, and as a result the sensible heat flux behaves differently compared to the other models.Accounting for the land-ocean contrast will be crucial in the sea-ice-free summer NH-SOP2 period that will be evaluated in the future.The development of the MODFs and MMDFs is ongoing and will be completed in phases.The initial phase was to collect basic meteorology data and the main components of the radiation budget.Work on this initial phase is completed, and the next phase will provide a wider range of parameters (e.g.turbulent fluxes and cloud parameters) included in the MODFs.This is a more complicated but very necessary step since the models differ significantly in terms of surface heat and momentum fluxes and cloud properties (not shown).There are also plans to extend the MODF and MMDF concept to Antarctica, focussing on the Southern Hemisphere SOPs.These future phases of the YOPPsiteMIP will allow more detailed studies of, for example, the following avenues: cloud cover, microphysics, and radiative forcing; assessment of forecast models in Antarctica; testing of specific model developments; observatory representativeness studies.This will allow a more process-focussed understanding of the forecasts in the YOPPsiteMIP archive but also provide a test bed for model developers to use when testing new model formulations relevant for the Arctic.Further details on the MODF concept and the NH-SOP1 and NH-SOP2 MODFs can be found in Uttal et al. (2024) and Mariani et al. (2024), respectively.A Python-based toolkit for producing the MODFs is under development, and it is hoped will speed up and simplify the production of MODFs and facilitate timely evaluation of forecast models to inform the model development process.
Appendix A: All MMDF and MODFs are available on the YOPP Data Portal (https://yopp.met.no(last access: 16 July 2024), hosted by the Norwegian Meteorological Institute, for perpetuity (i.e.longer than 10 years).The YOPP Data Portal is relying on the Arctic Data Centre (https://adc.met.no, last access: 16 July 2024) for data stewarding and the YOPPSiteMIP data can be programmatically accessed using the machine interface for the Arctic Data Centre or can be accessed directly from https://thredds.met.no/thredds/catalog/alertness/YOPP_supersite/obs/catalog.html(last access: 16 July 2024) for the MODFs and https://thredds.met.no/thredds/catalog/YOPPSiteMIP-models/catalog.html (last access: 16 July 2024), for the MMDFs.

Figure 1 .
Figure 1.Maps of the ERA5 2 m temperature climatology (1990-2019) for February-March (time of NH-SOP1) for the Arctic (a) and for November-February (SH-SOP) for the Antarctic (b).The observatories used in YOPPsiteMIP are marked with stars.White stars indicate the sites where MODFs are currently available, which are the subject of this study; black stars indicate the sites whose MODFs are not yet complete.The orange and green boxes depict the extent of the ECCC-CAPS and AROME-Arctic domains, respectively.

Figure 2 .
Figure2.Mean bias (solid lines) and standard deviation (dashed lines) of the 2 m temperature error (in °C) at each observatory (see Fig.1a) for forecasts initialised at 00:00 Z during NH-SOP1, described in Table2.Night-time periods (with mean SW ↓ < 15 W m −2 ) are indicated with grey crosses along the x axis.

Figure 3 .
Figure 3. Mean bias (solid lines) and standard deviation (dashed lines) of the 10 m wind speed error (in m s −1 ) at each observatory for forecasts initialised at 00z during NH-SOP1.Night-time periods (with mean SW↓ < 15 W m −2 ) are indicated with grey crosses along the x axis.

Figure 4 .
Figure 4. Median temperature (left), specific humidity (middle) and wind speed (right) from the radiosonde (solid black line), the tower (dashed black line), and the numerical models (during the second day of the forecast: colour lines).The mean surface skin temperature is indicated by a dot and 2 m temperature (left), 2 m specific humidity (middle) and 10 m wind speed (right) are shown with a square.Note that wind speed and humidity profiles from the tower are not available in the Tiksi and Ny-Ålesund MODFs, respectively.The numbers in the left-hand panels correspond to the verification sample size, which was dictated by the availability of radiosonde profiles.

Figure 6 .
Figure 6.Boxplots of T2m (a-c) and LW↓ + SW↓ (d-f) for Sodankylä, Utqiaġvik and Tiksi in observations and during the second day of the forecast.The text above the boxplots states the median (and inter-quartile range) of each distribution, which are also shown by the orange line and box edges, respectively.The 5 %-95 % range is plotted by the whiskers and points outside this are shown in dots.

Figure 7 .
Figure 7. Boxplots of surface albedo for Sodankylä, Utqiaġvik and Tiksi in observations and during the second day of the forecast.The text above the boxplots states the median (and inter-quartile range) of each distribution, which are also shown by the orange line and box edges, respectively.The 5 %-95 % range is plotted by the whiskers and points outside this are shown in dots.

Figure 8 .Figure 9 .
Figure 8. Scatterplots of 2 m temperature as a function of LW↓ + SW net for Sodankylä, Utqiaġvik and Tiksi (from left to right) for the second day of the forecast.The regression slope between the 2 m temperature and the LW↓ + SW net is stated in the title for the observations (in grey) and each model (various colours).

Figure 10 .
Figure 10.Process relationship diagrams and sensitivity parameters for upwelling longwave radiation (LW↑; left), sensible heat flux (SHF; middle left), latent heat flux (LHF; middle right) and ground heat flux (GHF; right) at Utqiaġvik.Observed values are shown in grey, model values during the second day of the forecast are shown in colour.The line of best linear fit is shown for observations (grey line) and each model (pink line).The sensitivity parameters, α, describing the coupling strength between the driving (LW↓ + SWnet) and each response term are printed above each diagram, with the observational (modelled) relationship on the left (right).

Figure 13 .
Figure 13.Scatterplots of wind stress vs. the square of the near-surface (lowest model level) wind speed at Sodankylä.The observed points are shown in black, while hourly values during the second day of the forecast forecast are shown in colour.

Figure 14 .
Figure 14.The same as Fig. 13 but for Utqiaġvik.

Figure 16 .
Figure16.The same as Fig.15but for Utqiaġvik.Note that for the observations T is calculated using the 10 m air temperature and an estimate of the surface temperature from an infrared sensor.

Table 1 .
. List of YOPPsiteMIP observatory locations: name, name as used in filenames (shown in italics), latitude, longitude and elevation.Where an elevation range is stated, this is because the instruments at a given observatory extend over a range of values due to variations in local topography.

Table 2 .
Summary of forecasting systems.

Table 3 .
Details of physical processes and parameterisations of the forecasting systems (see Appendix A for list of acronyms).Note that slashes are used in the right-hand column to separate information on the dynamical core of each model: model grid type/numerics/hydrostatic assumption.
https://doi.org/10.5194/gmd-17-5511-2024Geosci.Model Dev., 17, 5511-5543, 2024 Donlon et al., 2012)and Sea Ice Analysis (OSTIA;Donlon et al., 2012)product.To represent variations in subgridscale surface characteristics ICON uses a tile approach.Since 16 July 2018 the tile values of surface fluxes and other tiledependent variables are included in the MMDFs in addition to the grid average values.Hourly output is available based on a time step of 120 s.CaLDAS).The CAPS time series are produced for a beam of 7×7 grid points centred on each of the 12 landbased Arctic observatories listed in Table1.Time series up to 48 h lead time are made available for the daily runs initialised at 00:00 UTC.The data are archived with a time frequency of 7.5 min, equivalent to five time steps of 90 s each.
dataset came into operation; however, for the three data points provided the changes were less than 1 m in height.The sea ice analysis used in ICON was based on the Real-Time Global SST high-resolution analysis of NCEP until 16 July 2018.Since then it has been based on the Operational Sea

Table 4 .
Roughness lengths for momentum (m) at Sodankylä and Utqiaġvik from observations and models.For the models the mean is stated, while the range of values is stated in parentheses.

Table of
Code and data availability.Apart from the ECMWF-IFS, for which an open-access version of the code is available here: https: //confluence.ecmwf.int/display/OIFS(last access: 16 July 2024), the model codes are not open access.