An NWP Model Intercomparison of Surface Weather Parameters in the European Arctic during the Year of Polar Prediction Special Observing Period Northern Hemisphere 1

Increased human activity in the Arctic calls for accurate and reliable weather predictions. This study presents an intercomparison of operational and/or high-resolution models in an attempt to establish a baseline for present-day Arctic short-range forecast capabilities for near-surface weather (pressure, wind speed, temperature, precipitation, and total cloud cover) during winter. One global model [the highresolution version of the ECMWF Integrated Forecasting System (IFS-HRES)], and three high-resolution, limited-area models [Applications of Research to Operations at Mesoscale (AROME)-Arctic, Canadian Arctic Prediction System (CAPS), andAROMEwithMétéo-France setup (MF-AROME)] are evaluated. As part of the model intercomparison, several aspects of the impact of observation errors and representativeness on the verification are discussed. The results show how the forecasts differ in their spatial details and how forecast accuracy varies with region, parameter, lead time, weather, and forecast system, and they confirm many findings from midor lower latitudes. While some weaknesses are unique or more pronounced in some of the systems, several common model deficiencies are found, such as forecasting temperature during cloudfree, calm weather; a cold bias in windy conditions; the distinction between freezing and melting conditions; underestimation of solid precipitation; less skillful wind speed forecasts over land than over ocean; and difficulties with small-scale spatial variability. The added value of high-resolution limited area models is most pronounced for wind speed and temperature in regions with complex terrain and coastlines. However, forecast errors grow faster in the high-resolution models. This study also shows that observation errors and representativeness can account for a substantial part of the difference between forecast and observations in standard verification. Denotes content that is immediately available upon publication as open access. Corresponding author: Morten Køltzow, famo@met.no This article is licensed under a Creative Commons Attribution 4.0 license (http://creativecommons.org/ licenses/by/4.0/). AUGUST 2019 KØLTZOW ET AL . 959 DOI: 10.1175/WAF-D-19-0003.1 2019 American Meteorological Society Unauthenticated | Downloaded 12/16/20 11:55 AM UTC


Introduction
The Arctic is experiencing rapid changes in its harsh climate and environment, for example, the observed annual averaged near-surface temperatures at Svalbard are now increasing at between 1.048 and 1.768C decade 21 (Hanssen- Bauer et al. 2019). Anticipated increases in ship traffic, resource exploitation, tourism, and other activities (WMO 2017) call for accurate and reliable weather predictions for safe and efficient operations. Despite improved Arctic forecast skill in recent decades (Bauer et al. 2016;Jung and Leutbecher 2007), Jung et al. (2016) argues that existing numerical weather prediction (NWP) systems do not meet existing user requirements. Furthermore, forecast errors in the Arctic are larger than at lower latitudes (e.g., Nordeng et al. 2007;Bauer et al. 2016;Gascard et al. 2017). Nordeng et al. (2007) argue that the main reasons for this are the sparse conventional observational network and the small spatial scales of many (high impact) Arctic-specific weather phenomena. NWP systems are also often developed and tuned with a focus on mid and lower-latitude weather.
Arctic verification studies of global model systems often use model analyses as truth, given the relative sparseness of observations. However, this introduces uncertainty in the interpretation because a higher spread between analyses compared to lower latitudes is found since the analyses are less constrained by observations and are closer to their inherent model climatology (Jung and Matsueda 2016;Bauer et al. 2016). Bauer et al. (2016) found that verifying near-surface temperatures in the Arctic against observations gave substantially larger errors compared to verifying against model analyses. To establish the state of the art on Arctic forecast capabilities, more verification of nearsurface parameters, including snow and sea ice characteristics, are needed (Jung et al. 2016).
The use of regional models can, compared with global models, improve forecast accuracy by the use of optimized physics for the targeted area and finer horizontal and vertical resolution (Jung et al. 2016). However, operational convection permitting resolution models have just recently started to appear for the Arctic domain. Müller et al. (2017) and Yang et al. (2018) describe added value from operational high-resolution HIRLAM-ALADIN Research on Mesoscale Operational NWP in Euromed (HARMONIE)-Applications of Research to Operations at Mesoscale (AROME) runs in the Arctic compared to coarser resolution systems. Furthermore, specific Arctic weather phenomena, often connected to high-impact weather, have been studied in both global and regional highresolution models. Models have been compared with field observations and used as a tool to better understand the investigated phenomena. For example, polar lows have received substantial attention (e.g., Kristjánsson et al. 2011), but remain a challenge in operational forecasting because of their rapid growth and mesoscale nature (e.g., Spengler et al. 2017). Arctic cyclones (e.g., Yamagami et al. 2018), and sudden stratospheric warming events (e.g., Jung and Leutbecher 2007;Karpechko 2018) have also been the subject of recent Arctic forecast skill evaluations. Other examples of high-impact weather that have been studied are severe precipitation events at Svalbard (Hansen et al. 2014;Serreze et al. 2015) and maritime icing on vessels (Sultana et al. 2018;Samuelsen 2018).
The difference between a forecast value (grid box average from an NWP system) and a point observation can be decomposed into model, observation, interpolation, and representativeness errors (Kanamitsu and DeHaan 2011). The latter three components are nonnegligible for verification studies, in particular in the Arctic environment characterized by spatiotemporal sparseness and uncertainty in the observations (Casati et al. 2017). The observation uncertainty has been neglected in verification practices for several decades. As forecast capabilities improve, however, a larger part of the forecast-observation difference is due to observational uncertainty and representativeness mismatch rather than to model errors alone, in particular for short-range forecasts.
The Year of Polar Prediction (YOPP), with extra availability of observations and model simulations (Jung et al. 2016), is a great opportunity to improve our understanding of forecast capabilities in the Arctic. In this study we compare three high-resolution regional NWP systems and one global NWP system during the YOPP Special Observing Period Northern Hemisphere 1 (SOP-NH1, 1 February-31 March 2018) with focus on surface weather parameters. In addition, issues related to observational uncertainty are discussed to improve our interpretation of the verification results.
The NWP systems are briefly described in section 2, together with the observations and weather during YOPP SOP-NH1. The models are compared in terms of objective verification scores in section 3, including a discussion on some aspects of observation errors and representativeness issues. In section 4, two cases of high-impact weather are discussed before we summarize main findings in section 5.

a. NWP systems
The NWP systems included in the comparison are the high-resolution version of the global ECMWF Integrated Forecasting System (IFS-HRES) with 9-km grid spacing (Buizza et al. 2017) and the three regional convection permitting NWP systems: AROME-Arctic with 2.5-km grid spacing (Müller et al. 2017;Bengtsson et al. 2017), the Canadian Arctic Prediction System (CAPS) with 3-km grid spacing (G. C. Smith et al. 2019, unpublished manuscript), and AROME with Météo-France setup (MF-AROME) with 2.5-km grid spacing (Seity et al. 2011). Apart from spatial resolution, the four forecast systems differ in their model formulations, initialization methods, and in lateral and surface forcing (details in Table 1). IFS-HRES and AROME-Arctic forecasts are taken from daily operational runs and include data assimilation. CAPS and MF-AROME have been set up as a dedicated effort during YOPP and are initialized from global models without direct assimilation of observations. Furthermore, AROME-Arctic and MF-AROME are both configurations of the same model system but use different parameterizations in the turbulence representation and for shallow convection, and in addition a sea ice model is used in AROME-Arctic. Despite their differences, they all provide short-range forecasts for a common domain covering northern Scandinavia, the Barents Sea, and Svalbard ( Fig. 1) during YOPP SOP-NH1.

b. Observations
In this study, we use quality controlled observations from the Norwegian Meteorological Institute (MET Norway; eklima.met.no). The quality control system consists of both automatic and human quality control routines to flag or remove suspicious or erroneous observations (Kielland 2005). In this study we only use observations flagged as high-quality observations. Pressure and temperature observations are from instantaneous measurements, while 10-m wind speed is the mean wind over the last 10 min. Total cloud cover is visually observed, which has some implications for the verification process, for example, the observations represent a larger spatial area, are taken less frequently, and have different uncertainty characteristics than automatic cloud cover observations (Mittermaier 2012). Furthermore, most of the precipitation gauges have single-Alter shields (or are less shielded) implying an undercatch of solid precipitation (Rasmussen et al. 2012).
To stratify the verification we divide the observation sites into six regions (Fig. 1); Svalbard (14 stations), islands (3 stations), coast (40 stations), fjords (39 stations), inland (25 stations), and mountains (9 stations). The assignment of each station to a region is done subjectively by operational forecasters at MET Norway based on their knowledge about individual stations.
Over the open ocean, we utilize near-real-time data from the global Advanced Scatterometer (ASCAT) coastal wind product on a 12.5-km grid (Verhoef et al. 2012). The ASCAT wind products, provided by the EUMETSAT Ocean and Sea Ice Satellite Application Facility (OSI SAF), include a thorough quality control. We utilized only the data with the highest-quality flags. For the model comparison, the ASCAT data were reprojected on the intercomparison domain, and NWP model data were regridded onto a 12.5-km grid corresponding to the grid spacing of the ASCAT data.

c. Weather during YOPP SOP-NH1
February 2018 was dominated by high pressure systems over Scandinavia and low pressure activity in the Iceland-Greenland Sea, which led to a negative temperature anomaly over northern Scandinavia and warm anomalies over the ocean and at Svalbard (ECMWF Copernicus Climate Change Service, https://climate.copernicus.eu/). In March, the pressure patterns were less consistent, but on average a high pressure system was present north of Svalbard with a low pressure system northeast of Scandinavia organizing the advection of cold air southward over the Barents Sea. In March a positive temperature anomaly was only present in the northwestern part of the intercomparison domain. The sea ice concentration anomaly was negative for the entire domain and period.
The North Atlantic Oscillation (NAO) is the dominant mode of variability in the North Atlantic region from synoptic to interannual and decadal time scales (Woollings et al. 2015). It indicates that February was an unusual month. An NAO index of 1.58 is the fifth-highest value for all February months from 1950 to 2018 (NOAA Climate Prediction Center, https:// www.cpc.ncep.noaa.gov/). An NAO index of 20.93 in March indicates a clear difference in weather during the two months. However, March (ranked as the fifteenthlowest value of all March months) was not as extreme in terms of NAO as February.

Model intercomparison
In the following we first present a general overview of verification results before we focus on individual parameters. At the end of the section some aspects of observation, interpolation and representativeness errors are discussed.

a. General verification
Standard deviation of the error (SDE) and bias, averaged over all stations, for mean sea level pressure (MSLP), 2-m air temperature (T2), and 10-m wind speed (WS10) are presented in Fig. 2 (Balsamo et al. 2009;Dutra et al. 2010) SURFEX with 1-layer snow scheme; prognostic water equivalent, snow density, and surface albedo (Douville et al. 1995. ISBA with 1-layer snow scheme; prognostic water equivalent, snow density, and surface albedo (Noilhan and Planton 1989;Bélair et al. 2003) SURFEX with 1-layer snow scheme; prognostic water equivalent, snow density, and surface albedo (Douville et al. 1995 information about statistical significance. It is important to note that errors in Fig. 2 are not weighted and therefore do not represent the model domain average, but the average errors over the irregularly distributed observational network shown in Fig. 1. The Initial MSLP SDEs are small for all forecast systems, but increase rapidly with lead time. For shorter lead times than 110 h AROME-Arctic has significant smaller SDEs than IFS-HRES, while after 130 h the opposite is true. CAPS has significantly larger errors than the other models after 112 h, which is however not found in the driving Canadian Global Deterministic Prediction System, and which is under investigation. While AROME-Arctic and IFS-HRES show a negligible bias, CAPS develops a positive bias after a few hours. Only forecasts initialized at 0000 UTC are included in the statistics, and the results indicate a small common diurnal cycle in SDE with a maximum error in the morning (16 and 130h). The T2 forecasts show a large SDE already in the analysis (38-48C), and the increase with lead time is more moderate than for MSLP. Furthermore, a diurnal cycle in SDE is present with higher accuracy during daytime, in the presence of solar radiation and higher temperatures, while larger errors are found during nighttime (similar to MSLP). While AROME-Arctic and MF-AROME only have minor biases, CAPS and IFS-HRES show a diurnal cycle with a cold bias during daytime. Most of the differences seen between model performances are significant. AROME-Arctic and MF-AROME show slightly lower SDE for WS10 than CAPS and IFS-HRES (only significant for the shortest lead times). The largest difference between the models is found in the biases, which for most lead times are statistically significant. AROME-Arctic and CAPS have negligible biases, while IFS-HRES and MF-AROME on average underestimate WS10 by ;1 m s 21 . Only a weak diurnal cycle is seen in WS10 biases (maximum underestimation during daytime). A short spinup time of CAPS WS10 from the initial conditions is seen.
For all three parameters, errors grow more slowly in IFS-HRES than in the three high-resolution models (i.e., the added value of high-resolution models are dependent on lead time). In the case of MSLP, which is a surface field but represents a vertically integrated quantity, this reflects the leading role of the IFS in terms of synoptic-scale dynamics (Haiden et al. 2018a). In the case of T2 and WS10, the apparently slower error growth actually results from larger errors already at FIG. 1. Model integration domains: CAPS is employed inside the black frame, AROME-Arctic and MF-AROME are inside the blue frame, and IFS-HRES has global coverage. The model intercomparison area is inside the blue domain. Norwegian SYNOP observation used for verification are plotted as black (3 island stations), yellow (14 Svalbard stations), orange (40 coast stations), blue (39 fjord stations), green (25 inland stations), and red (9 mountain stations) circles. Not all stations observe all parameters. Shown in gray colors is the sea ice concentration from IFS-HRES 0000 UTC 1 Mar and in green/brown colors the model topography from AROME-Arctic.
initialization time in the IFS compared to the higherresolution models, as can be seen in Fig. 2. Statistics averaged over all stations, as presented in Fig. 2, may hide important information. Figure 3 shows verification of MSLP, T2m, WS10, daily precipitation (precip24), and total cloud cover (TCC; no observations available in mountain areas) forecasts for different regions (see section 2b for details). To give information about statistical significant differences between regions and forecast systems, 95% confidence intervals are calculated by bootstrapping (not shown). For MSLP, T2, WS10, and TCC these confidence intervals are 0.1 hPa, 0.148C, 0.13 m s 21 , and 3.3% respectively. Differences seen for these parameters are therefore mostly significant. For daily precipitation, the uncertainty is much higher due to fewer observations, and the differences are not all significant.
The first feature to notice is the huge spread in forecast accuracy across regions, parameters, and models. Furthermore, no model is superior for all parameters and regions. IFS-HRES verifies consistently better for MSLP than AROME-Arctic and CAPS across regions. The inaccurate treatment of lateral boundary forcing in regional models is discussed, for example, by Warner et al. (1997) and Davies (2014) and may explain part of this behavior. Other possible explanations are better assimilation of large-scale weather in global models, tuning of global models with focus on synoptic cyclones (e.g., Sandu et al. 2013), more small-scale noise in higher-resolution systems, and for AROME-Arctic the use of 6-h older LBC from IFS-HRES. Furthermore, all models have a pronounced positive MSLP bias in mountain regions (and inland and in Svalbard) most likely to be attributable to the uncertainty in reduction of observations and/or forecasted pressure to MSLP (Pauley 1998).
The largest T2 errors are found inland, in mountains (IFS-HRES and MF-AROME), and at Svalbard (CAPS). The bias varies from 248C (CAPS at Svalbard) to 118C (MF-AROME at islands), while SDE varies from ;18C (IFS-HRES at islands) to ;68C (MF-AROME inland). The CAPS bias at Svalbard (stations at the coast and in fjords) is related to an unrealistic sea ice cover around Svalbard (not shown). In general, small forecast errors are found where sea surface temperatures, which most of the time are reasonably well described in the models, have a substantial influence (i.e., coasts, fjords, and islands). Also, WS10 biases vary substantially across regions and models from 23 m s 21 in mountain regions (IFS-HRES and MF-AROME) to 11.5 m s 21 at islands (AROME-Arctic and CAPS). In general, SDE can be expected to scale with wind speed itself and is therefore higher in windier regions. However, forecast accuracy of WS10 is not fully evaluated by SDE and bias, and other aspects will be discussed below. While precip24 scores vary across regions and models, some common significant features are low (high) SDE at islands (mountains and fjords) and a positive bias in mountain regions. In addition, AROME-Arctic and MF-AROME forecast less precipitation than CAPS and IFS-HRES (significant at the coast, fjords and inland). Undercatch of solid precipitation in observations (Rasmussen et al. 2012) is a severe problem for precipitation verification at high latitudes and/or altitudes. This is not taken into consideration in Fig. 3 (but discussed below) hence we suspect that the positive bias in the mountains is actually smaller, that the small positive bias at the coast and in the fjords for IFS-HRES and CAPS most likely will change to a negative bias, and that the underestimations of AROME-Arctic and MF-AROME are actually even more pronounced.
For TCC, forecast characteristics are more dependent on the forecast model than on the region. IFS-HRES has a smaller SDE than the other forecast systems, which at least partly can be attributed to manual observations representing a larger area, and a more binary cloud cover field in the high-resolution models. IFS-HRES has a positive bias and AROME-Arctic and MF-AROME have smaller biases, while CAPS has a negative bias partly related to a long spinup of cloud properties which currently is under investigation. Ideally, (gridded) high-resolution observation datasets are needed to evaluate spatial patterns in the forecasts. However, in this study we use point observations for verification. We therefore calculate the correlation between all observation sites for T2, WS10, TCC, and precip24. Correlations are then averaged in bins by the distance between the stations and plotted as variograms in Fig. 4 (Marzban et al. 2009). A rapid decorrelation with distance indicates stronger dominance of smallscale features. The observations of WS10 show a steep drop of correlation (approximately 0.35 after 100 km), followed by TCC (approximately 0.55 after 100 km), precip24 (approximately 0.6 after 100 km), and T2 (approximately 0.7 after 100 km). It should be noted that WS10, T2, and TCC are hourly data while precipitation are daily totals due to the limited availability of hourly precipitation data. A shorter accumulation period for precipitation would most likely reduce the spatial correlation. In general, IFS-HRES has a higher spatial correlation than the other models, which is expected due to the coarser horizontal resolution. Furthermore, none of the models is able to reproduce the very steep FIG. 4. Variograms showing spatial correlation between sites in observations and forecasts. Correlation as a function of distance between SYNOP sites is calculated, and the average over stations with similar distances are plotted. Observations are in green, IFS-HRES in red, AROME-Arctic in blue, CAPS in black, and MF-AROME in cyan. Parameters are 2-m air temperature, 10-m wind speed, daily precipitation, and total cloud cover. observed spatial decorrelation of WS10. For T2 and WS10 AROME-Arctic and MF-AROME are closest to the observed decorrelation, while CAPS matches best the observed curve for precip24 up to about 400 km. The lower spatial resolution of IFS-HRES shows up most clearly for precipitation. That the models find the small scales difficult to simulate is not surprisingly all the time the effective resolution of the models are even larger than the model grid spacing (Skamarock 2004). For distances beyond ;650 km the correlations are mainly between the Norwegian mainland and Svalbard where the forecasts underestimate (overestimate) the correlation of observed temperature (precipitation).

b. Temperature
All forecast systems struggle with T2 forecasts inland (Fig. 3). To better understand this problem we verify T2 stratified by TCC and WS10 (Fig. 5) and separate results into 0600, 1200, and 1800 UTC (when reliable cloud observations are available). T2 forecast errors increase in cloud-free conditions, while an increase in TCC reduces forecast errors. During calm conditions, a large spread in errors is seen as well as a positive bias, while errors are reduced in windy conditions, but with a small negative bias for all models. The fact that this negative bias is present throughout the day points to turbulent FIG. 5. Conditional verification of T2 for inland stations. Box-and-whiskers plot of T2 errors (forecasted minus observed) conditioned by (top) TCC and (bottom) wind. Cloud-free is defined as TCC less than 30% and cloudy as TCC larger than 70%. Calm conditions are defined as WS10 less than 1.5 m s 21 and windy conditions as WS10 larger than 3 m s 21 . Each box is divided into models (IFS-HRES in red, AROME-Arctic in blue, CAPS in black, and MF-AROME in cyan) and time of day. Number of cases is plotted at top, and outliers are omitted to increase readability in plots.
mixing rather than cloud effects, however only IFS-HRES and MF-AROME show an underestimation of wind speed in cases where both observations and forecasts are .3 m s 21 . Both stratified by TCC and WS10 the forecast errors decrease in the presence of solar radiation and higher temperatures (i.e., smaller errors at 1200 UTC). The conditional verification has some uncertainties for a number of reasons: 1) the need to set specific thresholds for cloud-free, cloudy, calm, and windy situations; 2) there is no one-to-one relation between TCC and cloud radiative effect; 3) dependence on the weather development prior to the verification time; and 4) limited sample size in terms of station number and total number of pairs of observations. Nevertheless, the increase in T2 forecast error during calm, cloudfree conditions without the presence of solar radiation points toward issues in the representation of the stable boundary layer as a common problem for all forecast systems. Haiden et al. (2018b) have recently investigated this problem for the IFS. They found that in areas with persistent snow cover the nighttime drop of T2 in the model is underestimated due to the use of a singlelayer snowpack representation. It distributes the surface cooling over the entire depth of the snow, thereby underestimating the speed and magnitude of the nearsurface drop in snow temperature, which adversely affects T2 evolution. Furthermore, nighttime wind speeds near the surface tend to be too high in low-wind conditions, which contribute to a positive bias, as well as a higher RMSE, in T2. When the near-surface energy budget is determined by local processes, the representation of surface conditions becomes critical. We choose two days with cold temperatures and two days with more mixed conditions for reruns of MF-AROME starting from AROME-Arctic initial surface conditions. In the original runs MF-AROME initial surface conditions are interpolated from the coarser-resolution global ARPEGE model, while AROME-Arctic performs its own surface analysis. The results (Table 2) show that the initial differences in the analysis explain almost the entire difference found in T2 errors (approximately a difference in SDE of 1 K in Fig. 3) between AROME-Arctic and MF-AROME. Table 3 shows that the height differences are substantial between model and actual station elevation for some regions and models. This contributes to the T2 difference between models and observations. However, no height correction between model and station height is applied in the verification process since this potentially can introduce errors and noise during stable conditions. Furthermore, the implementation of well-behaving height corrections during stable conditions is beyond the scope of this study, but will potentially reduce the errors (e.g., Sheridan et al. 2010).
The ability to forecast thawing conditions (T2 above 08C) is assessed using traditional categorical scores evaluated from the contingency table: the equitable threat score (ETS) and frequency bias index (FBI). The ETS is an accuracy measure evaluated from the threat score 5 hits/(hits 1 false alarms 1 misses), which then is modified for hits obtained by a random forecast. The FBI assesses the bias in the forecasted frequency of an event [see Wilks (2011, chapter 8) or Jolliffe and Stephenson (2012) for more details]. Forecast skill varies across regions (Fig. 6), from highest skill at islands, decreasing via coast to fjords to inland stations, but with slightly higher skill in the mountains (temperature more decoupled from surface) and at Svalbard (coast and fjord stations). The low skill inland is consistent with the larger T2 errors there, due to the generally higher T2 variability away from the coasts, and lower representativeness due to small-scale terrain features. In general, AROME-Arctic shows a similar or slightly better performance than the other models for all regions, which at least partly can be explained by high-resolution surface analysis and better representation of the topography.
Not all of the model differences show up in the objective verification since observations are not available for all areas. To supplement the evaluation of T2 we therefore show the average of hourly forecasts for day 2 during YOPP SOP-NH1 (Fig. 7). All models behave very similarly over the open ocean. However, over areas covered by sea ice (upper-left part of domain), CAPS and, in particular, MF-AROME have lower temperatures than IFS-HRES and AROME-Arctic. This suggests TABLE 2. T2 errors inland for MF-AROME, for MF-AROME initialized by AROME-Arctic surface conditions, and AROME-Arctic. Errors are averaged over lead times (11, 12, 13, . .  that these T2 differences are due to differences in the representation of sea ice (see section 2a). Further inland at Svalbard and in the mountainous areas at the border between Norway and Sweden, MF-AROME is clearly colder than the other models, for example, a negative bias in the mountains is seen in Fig. 3. A similar behavior is found in the Alps for MF-AROME due to an underestimation of cloud cover (Vionnet et al. 2016). In general, the high-resolution models forecast the lowest minimum temperatures and, as expected, have more small-scale details than IFS-HRES (see also Fig. 4).

c. Wind speed
In addition to overall error metrics such as SDE, knowledge about forecast skill as a function of wind speed is of practical interest in the prediction of highimpact weather. This aspect is evaluated in Fig. 8 by the ETS and FBI obtained from a contingency table for different wind speed thresholds. The relative differences in skill between the forecasts are more pronounced for these metrics than in the SDE (Figs. 2 and 3). The frequencies of occurrence of the highest wind speeds are underestimated by CAPS, MF-AROME, and IFS-HRES, while AROME-Arctic is closer to the observed frequency. The skill (ETS) reflects the forecast climatologies, with AROME-Arctic scoring better than the other models, followed by CAPS, MF-AROME, and IFS-HRES. Large intermodel differences over land can be attributed to different representations of local processes, for example, AROME-Arctic applies a smaller surface roughness than MF-AROME. The benefit of higher spatial resolution for the prediction of high-wind events is shown by the low ETS values of IFS-HRES. Figure 9 shows again ETS and FBI for all lead times from 125 to 148 h, but this time against scatterometer-estimated wind speed in the Barents Sea (details in section 2). Forecasts are more similar and perform better than over land. However, when wind speeds exceed 12-13 m s 21 the models start to diverge and IFS-HRES (MF-AROME and AROME-Arctic) underestimates (overestimate) the observed frequency. For wind speeds up to 12-13 m s 21 , AROME-Arctic and IFS-HRES have higher skill than MF-AROME and CAPS. Above 12-13 m s 21 , the relative skill of IFS-HRES compared with AROME-Arctic is reduced at the same time as IFS-HRES starts to underestimate the observed frequencies.
Since all forecast climatologies are quite similar over the ocean, we speculate that the higher skill of AROME-Arctic and IFS-HRES (,12 m s 21 ) originates from more accurate initial conditions. In a case study in section 5a, this is further investigated by using initial conditions from AROME-Arctic in a MF-AROME run.
WS10 forecasts are more skillful over ocean than over land (Figs. 8 and 9) in spite of the added predictability which may be expected from topographic and coast line forcing. However, the representativeness of observations is an issue in the verification process, and especially in complex terrain. Since the scatterometer estimated wind speed represents a coarser resolution (grid size 12.5 km) over a relatively homogeneous ocean we argue that differences in observation representativeness (discussed further in section 3f) explain a large part of the difference in ETS between land and ocean.
To get a more complete overview of forecast differences in wind speed, forecast averages during YOPP SOP-NH1 are shown in Fig. 10. Wind speed forecasts over the ocean show clear similarities, but slightly less (more) wind in IFS-HRES (AROME-Arctic and MF-AROME). Also over sea ice areas the forecast systems are very similar, but MF-AROME has slightly lower wind speed than the three other forecast systems. However, when comparing land areas we find large differences. In general, AROME-Arctic, followed by CAPS, forecast more windy conditions that are most pronounced over Svalbard and in the mountain regions, which agrees with the objective verification. As for temperature, IFS-HRES shows more smooth patterns than the high-resolution models. A closer inspection of CAPS is also in agreement with smoother fields as indicated in Fig. 4.

d. Precipitation
To assess the forecast capabilities for precip24 further, we use ETS and FBI (Fig. 11). The MF-AROME forecasts have a similar frequency of occurrence as the  observations (FBI ;1) except for the highest precipitation amounts, while CAPS forecasts precipitation too frequently. AROME-Arctic overestimates the number of precipitation events but underestimates the frequency of events between 5 and 25 mm day 21 . The underestimated precipitation frequency originates mainly from coast and fjord regions (not shown). IFS-HRES produces too frequent small amounts of precipitation, which is a known problem, and underestimates the frequency of heavy precipitation. This is in part related to the coarser resolution of IFS-HRES, which means that the parameterization schemes represent the precipitation averaged over a wider area, which tends to generate a small precipitation trace and decrease intense precipitation values. The forecast skill measured by ETS reflects to some extent also the forecast climatologies. MF-AROME and IFS-HRES score better than AROME-Arctic and CAPS. In general, forecast skill decreases for high-precipitation events.
Observations of solid precipitation are associated with a high uncertainty due to wind-induced undercatch (Rasmussen et al. 2012). The undercatch varies with the type of precipitation gauge, windshield configurations, and the weather itself. In this study, most of the precipitation gauges are Geonor rain gauges with single-Alter shields, and for 21 of them precipitation, temperature, and wind speed are measured hourly and the undercatch of solid precipitation can be estimated. We use Eq. (4) in Kochendorfer et al. (2017), Eq. (13) in Wolff et al. (2015), and Eq. (4) in Smith (2007) to adjust the observed precipitation. Figure 12 shows the accumulated precipitation from YOPP SOP-NH1, averaged over these 21 sites, from the four model systems, from the raw measurements and from the adjusted measurements. The precipitation is divided into rain, mixed precipitation, and solid precipitation by temperature thresholds. CAPS and IFS-HRES have more precipitation than the AROME models (as in Fig. 3), but all models slightly overestimate the raw measurements. Despite spread between the adjusted precipitation estimates, all models clearly underestimate the adjusted mixed phase and solid precipitation. The possible underestimation is so large that it raises a question about the adjustment of the observations. However, at Sih c cajávri (68.75508N, 23.53698E) two estimates of accumulated precipitation during YOPP SOP-NH1 are available. One is based on a precipitation gauge and one is derived from changes in observed surface snow water equivalent, provided by the Norwegian Water Resources and Energy Directorate. While the precipitation gauge-based estimate gives 19.7 mm, the increase in snow water equivalents gives 59.0 mm, indicating a substantial underestimation by the precipitation gauge in support of the adjusted accumulations in Fig. 12. Ideally, the verification with adjusted precipitation should have included other metrics than only the accumulated precipitation (e.g., skill scores). However, in single cases the undercatch is also influenced by particle shape, fall speed, and other microphysical properties in such a way that unrealistic errors will be introduced in skill verification. The adjustment algorithm therefore performs best averaged over many cases and is most appropriate for the estimation of systematic errors.

972
The gross features of forecasted spatial precipitation patterns are similar for all models (Fig. 13). All forecasts show maximum precipitation over steep topography at Svalbard, along the Norwegian coast and mountains, and at Nova Zemlja. However, the amplitude of the precipitation differs between models, that is, the high-resolution models produce higher maxima connected to the topography than IFS-HRES which has smoother precipitation (in agreement with Fig. 4). Another difference is that AROME-Arctic and MF-AROME have less precipitation over the ocean (and coast and fjords) than the other models. Note also that IFS-HRES has slightly more precipitation in sea ice covered areas which may be important for example, when forcing sea ice models.

e. Total cloud cover
The large-scale spatial patterns of TCC are similar in all forecast systems, but regional differences are found in the forecast climatologies (Fig. 14). All forecasts agree on a cloudy atmosphere over the ocean, but CAPS has less, while MF-AROME and IFS-HRES have a higher TCC. A noticeable difference in total cloud cover between the two AROME models are expected due to differences in their turbulence schemes (Bengtsson et al. 2017). Another noticeable feature is the maximum in cloud cover from IFS-HRES on the east side of the mountains at the border between Norway and Sweden not seen in the highresolution models. The differences in forecasted cloud cover call for more investigations beyond the scope of this intercomparison study by using more appropriate cloud observations (e.g., satellite based measurements).

f. Observation, interpolation, and representativeness errors
The difference between forecasts and observations can be divided in model, observational, interpolation, and representativeness errors (Kanamitsu and DeHaan 2011). The actual performance of NWP systems will become apparent only by taking the latter three components into consideration. In particular for shortterm forecasts with relatively small forecast errors all components contribute significantly. We have tried to minimize the observational error by employing quality controlled observations and taking the undercatch of precipitation into account.
A station measurement for MSLP, T2, WS10, and precip24 represents a point-observation, which differs from what the gridbox value in a NWP system represents. This is due to subgrid phenomena (e.g., small-scale FIG. 12. Accumulated precipitation (estimated by temperature thresholds; rain in red, sleet in black, and solid precipitation in blue) for AROME-Arctic, CAPS, IFS-HRES, and MF-AROME with lead times from 125 to 148 h, observed precipitation from Geonor rain gauges with single-Alter shields, and observed precipitation corrected with Wolff et al. (2015), by Kochendorfer et al. (2017), and by Smith (2007). The accumulated precipitation amounts are averaged over 21 stations. precipitation) and local effects, which cannot be reproduced by the model. Some representativeness issues are therefore present. To estimate these we include a simple example based on the approach of Göber et al. (2008). If several observations exist within a model grid box their average is assumed to represent an approximation of the grid box mean and will be treated as a ''perfect'' forecast.
However, the perfect forecast will not get perfect scores (e.g., SDE will not be 0 unless all observations are the same apart from constant differences), and the resulting error can be regarded as the representativeness error between a point and grid box average. Due to the sparse observational network a general estimate is difficult to establish. However, the two stations Tromsø (69.65368N,  Table 4 we verify a perfect forecast constructed by averaging these two observations and compare with the 4 NWP systems verified for the same observation sites. The representativeness part, estimated by the perfect forecast error divided by the NWP forecast error is relatively small for MSLP (6%-11%), but higher for T2 (19%-35%), WS10 (36%-42%), and precip24 (15%-20%). Note that these are conservative estimates (for this kind of coastal location) since two stations are insufficient for generating a true grid box average. If the results from this example are more generally valid they can explain parts of the large (small) initial errors for WS10 and T2 (MSLP) in Fig. 2 supported by Fig. 4 showing the rapid spatial decorrelation of wind speed. In addition, the better verification scores over ocean than over land discussed in section 4c can also be explained by representativeness issues. Haiden et al. (2012) have used the ''perfect forecast'' approach to estimate the effect of representativeness on precipitation scores such as ETS and FBI for a grid spacing of 25 km in Europe. They obtained a maximum achievable ETS around 0.75 and an FBI of 1.05. In summary, NWP forecasts perform better than the first impression given by verification statistics, and interpreting NWP output as point forecasts leads to scale mismatch effects that need consideration.
To estimate the sensitivity of the results to the interpolation method we calculate the root-mean-square error (RMSE) by using nearest grid point (used in all verification above) and bilinear interpolation methods. For MSLP the changes are negligible (less than 0.5%), while bilinear interpolation reduces errors for T2 (less than 4%), WS10 (less than 3%), precip24 (less than 2%), and TCC (less than 2%). Furthermore, we also upscale the three high-resolution models to a grid spacing comparable to IFS-HRES. In general, the RMSE changes by less than 5% with a few exceptions. For T2 the errors are reduced (in particular the nonsystematic part) by ;10% at the coast and island stations. An interpretation is that the high-resolution models have too sharp temperature gradients along the coast and a smoother field reduces the number of large errors. On the other hand, in the fjords and inland the systematic T2 error increases by 6%-7%. The interpretation is that the upscaling creates an undesirable mix of characteristics (e.g., fjords, valleys, mountains) in these areas. For WS10 (TCC) the errors increase (decrease) by less than 5%. Daily precipitation scores improve with upscaling inland (6%), while decreasing in fjords (5%).

High-impact weather case studies
To supplement the summary verification, we look in more detail at two high-impact cases during YOPP SOP-NH1: 1) a mesoscale low pressure system in the Barents Sea and 2) a severe precipitation event at Svalbard.

a. Mesoscale low pressure system in the Barents Sea
In a southerly flow, a mesoscale disturbance with deep convection (Fig. 15a) and strong winds (Fig. 15b) developed south of the sea ice edge in the Barents Sea on 24 March 2018. Based on model analyses (a small spread between models exists) the low was located east of Bear Island at 1200 UTC (marked with L in Figs. 15a,b). All NWP systems develop a mesoscale disturbance with 24-h lead time (Figs. 16a-d) and also 48 h ahead (not shown). However, wind speed and minimum pressure and location vary. IFS-HRES forecasts (Fig. 16a) are less intense (higher minimum pressure and less windy) compared to the high-resolution models (Figs. 16b-d). The highresolution models forecasted 25 m s 21 (AROME-Arctic), 24 m s 21 (MF-AROME), and 22 m s 21 (CAPS) as maximum wind speed, while maximum wind speed in ASCAT measurements are 22 m s 21 . In comparison IFS-HRES forecasted maximum wind speed of 20 m s 21 . At the Bear Island meteorological station (74.58N, 19.08E marked as red circle in Figs. 15 and 16) the maximum observed wind speed during the day is 19 m s 21 compared to 16 m s 21 from IFS-HRES, 18 m s 21 from AROME-Arctic and CAPS, and 19 m s 21 from MF-AROME. The observed wind speed is close to the observed maxima for a duration of 6 h and this is also seen, together with good timing of maximum wind, in all models with the exception of MF-AROME, which only gives a wind speed peak for 1 h. At Bear Island the minimum pressure in all forecasts is almost identical, about 2 hPa higher than observed.
The location of the mesoscale disturbance is similar in IFS-HRES forecasts for 124 and 148 h (50-100-km misplacement). However, the location of the system varies more with lead time in the high-resolution models (not shown). A closer inspection of the wind pattern of MF-AROME (Fig. 16d) indicates a significant change in location compared with IFS-HRES, AROME-Arctic, and CAPS forecasts and available  Göber et al. (2008) and for IFS-HRES, AROME-Arctic, CAPS, and MF-AROME during YOPP SOP-NH1. The last row shows the percentage of SDE from perfect forecast for the model with lowest/highest error. observations and analysis. To investigate this further, reruns of MF-AROME with initial conditions from AROME-Arctic were performed. Only changing the initial surface conditions (Fig. 16e) did not improve the location of the mesoscale disturbance. However, additionally changing the upper-air initial conditions in MF-AROME by using analysis from AROME-Arctic (Fig. 16f) improved the low pressure position significantly (misplacement reduced from approximately 230 to 90 km).
In this case all forecast systems simulate the mesoscale low pressure system. The benefit of IFS-HRES was more consistent forecasts of location for different lead times, while the high-resolution models better captured the highest wind speeds in agreement with earlier studies (e.g., McInnes et al. 2011). A bad location of the system in the 124-h forecast from MF-AROME was drastically improved by changes in the initial conditions.

b. Precipitation event at Svalbard
On 26 February 2018, a high pressure system over northern Scandinavia and a low pressure system west and north of Svalbard provided favorable conditions for the transport of heat and moisture (mainly below 800 hPa) toward Svalbard. This type of atmospheric large-scale setup is responsible for a majority of the high-impact precipitation events (rain on snow) at Svalbard, which have a substantial impact on infrastructure, society, and wildlife (Serreze et al. 2015;Hansen et al. 2014). The maximum precipitation measured was 61.0 mm in 36 h at Ny-Ålesund (marked with A in Fig. 17). This might seem small compared to midlatitude extreme values, but 46.0 mm (measured in the first 24 h of the period) was the fourth-largest daily accumulated precipitation amount between August 2008 and August 2018. In addition, METAR temperature observations indicate that the majority of the precipitation was rain on frozen ground. Already on 28 February the daily mean temperature was close to 2108C and stayed below 2108C for the next two weeks, maintaining the surface ice conditions. Precipitation forecasts for the Svalbard area (36-h accumulations) are shown in Fig. 17. All forecasts have a general agreement with the observations (Table 5) in that the highest precipitation amounts are in the northwest of Svalbard (point A; Ny-Ålesund), but which model is closest to observed values varies between observation sites (Table 5). Furthermore, the highresolution models have more spatial details and higher maximum values than IFS-HRES (Fig. 17). However the local details are difficult to verify due to the lack of observations. One exception is the area around Longyearbyen (points B, C, and D), where there are sharp gradients in the observations from Platåberget, 450 m MSL (point B) 13.4 mm, Svalbard Airport (point C) 17.2 mm, and Adventdalen (point D) 2.8 mm. MF-AROME was able to capture some local differences with forecasts between 4.1 and 21.5 mm (36 h) 21 in the same area (see reduced precipitation in the Adventdalen east of points B,C, and D). It should be noted that even if local maximum precipitation values are higher in the high-resolution forecasts the average precipitation over the entire Svalbard archipelago is 18%-26% higher in IFS-HRES. It is important to correctly forecast precipitation type in these situations. Since direct observations of precipitation type are rare in time and space we use 2-m air temperature as a proxy. Evidently, such a proxy has limitations since it neglects information about the temperature and humidity profile. Averaged over the 36-h precipitation accumulation period the forecasts have negative temperature biases: IFS-HRES, 22.08C; AROME-Arctic, 21.48C: CAPS, 21.88C; and MF, 22.38C, indicating too much solid precipitation and too little rain. If we assume that the precipitation will be rain when the temperature exceeds 118C (Jennings et al. 2018), we find that the forecasts suggest that 70% (AROME-Arctic), 16% (IFS-HRES), 5% (CAPS), and 43% (MF-AROME) of the precipitation fell as rain at the observation sites. However, the METAR observations in Ny-Ålesund and Longyearbyen indicated rain for most of the period and if we replace the forecasted temperature with observed temperature and keep the 18C threshold we get approximately 80% as rain.
In summary, the potential added value of the highresolution models for this case is associated with higher maximum precipitation and a redistribution of the precipitation patterns (forced by topography). In addition, the high-resolution models have the potential to improve precipitation type in complex terrain, compared to IFS-HRES.

Summary
In this study, short-range forecasts from one global (IFS-HRES) and three regional NWP systems (AROME-Arctic, CAPS, and MF-AROME) are compared in the European Arctic (Fig. 1). The model intercomparison seeks to establish a baseline or reference for Arctic forecasting capabilities of near-surface parameters as suggested by Jung et al. (2016). The forecast systems differ in model formulation, resolution, initialization methods and lateral boundary forcing (Table 1). IFS-HRES and FIG. 16. MSLP and 10-m wind speed forecasts with 124-h lead time for (a) IFS-HRES, (b) AROME-Arctic, (c) CAPS, (d) MF-AROME, (e) MF-AROME with surface initial conditions from AROME-Arctic, and (f) MF-AROME with surface and upper-air initial conditions from AROME-Arctic. Notice that MSLP is not available from MF-AROME. The red circle indicates Bear Island. AROME-Arctic are operational systems (i.e., real-time multiple daily runs), which include data assimilation, while CAPS and MF-AROME are specific contributions to YOPP and initialized from global models. The lateral boundary conditions for the three regional systems are taken from different global forecast systems; IFS-HRES forces AROME-Arctic, GDPS forces CAPS, and AR-PEGE forces MF-AROME. Differences in forecast characteristics, weaknesses, and strengths therefore can have a variety of sources which are not always easy to pinpoint. The comparison is performed for YOPP SOP NH1, a winter period with availability of extra radiosondes in the Arctic which are expected to improve the actual forecast skill. The period includes a range of large-scale flow configurations and periods with both positive (February) and negative NAO values (March).
Forecast accuracy varies across regions, parameters, lead times, and NWP systems, and no NWP system is superior to the other systems in all aspects. However, compared to the other models, AROME-Arctic has the advantage of surface and upper-air assimilation (as IFS-HRES), high horizontal resolution (as MF-AROME and CAPS), and model development with a focus on the specific area of this comparison. These advantages are reflected in the verification, where AROME-Arctic on average performs better than the other models.
There is a general agreement between models on the larger-scale patterns of average cloud cover, temperature maxima, and wind speed maxima over ocean areas; temperature minima over sea ice; and precipitation maxima connected to topographic and coastal forcing. IFS-HRES verifies best regarding MSLP, but all systems are in good agreement with observations (SDEs less than 1 hPa initially and 2 hPa or less after 148 h). Larger differences between forecasted and observed MSLP are mainly found in mountain areas, where it is problematic to reduce surface pressure to mean sea level as shown by Pauley (1998).
Several common model deficiencies are noted, although their magnitude varies between the different NWP systems. Problems associated with T2 forecasts inland in cloud-free and calm conditions during nighttime, related to the representation of the stable boundary layer are well known and studied (e.g., Sandu et al. 2013;Haiden et al. 2018b;Esau et al. 2018). Opposite to this all models show a cold bias under windy conditions. Another common deficiency is the low skill in distinguishing between freezing and nonfreezing conditions inland which is important for Arctic infrastructure, society, and wildlife (Hansen et al. 2014). For wind speed forecasts, the models find it difficult to reproduce the high spatial variations of WS10 over land and the high-resolution models forecast generally more wind than IFS-HRES [e.g., as seen in DuVivier et al. (2017) and Walsh et al. (2007)]. However, in particular over the ocean the skill is not necessarily improved by finer horizontal resolution alone (similar to results from Kalverla et al. 2019). Furthermore, adjusting for the undercatch of solid precipitation in observations reveals that most likely all forecast systems have too little precipitation in the area studied. This is an important finding because this feature is not apparent if undercatch in observations is not considered in the verification process (which often is the case).
For near-surface weather parameters (i.e., T2, WS10, precipitation) there are also several examples of differences in local forecast skill between NWP systems, for example, a cold bias is found related to overestimation of sea ice in the surroundings of Svalbard for CAPS, while AROME-Arctic has a pronounced underestimation of precipitation at the coast and fjords (still under investigation). Furthermore, over land, IFS-HRES and MF-AROME underestimate the wind speed. It is particular that at higher elevations (e.g., mountain, inner part of Svalbard) the two models have less wind than AROME-Arctic, which on average has the highest wind speeds (Fig. 10). In addition, wind forecasts over ocean from MF-AROME and CAPS are less accurate than IFS-HRES and AROME-Arctic. The sensitivity to initial conditions is investigated in a rerun of MF-AROME. The original initial conditions (dynamical adaptation from the global ARPEGE model) are replaced with initial conditions from the AROME-Arctic data assimilation. For a case study with a mesoscale low in the Barents Sea the new initial conditions improve the location of the mesoscale low by 140 km. Similar runs with initial surface conditions from AROME-Arctic in MF-AROME runs reduce the MF-AROME T2 errors to the same level as AROME-Arctic and highlight the importance of surface assimilation as also shown in Randriamampianina et al. (2019) The forecast climatologies also reveal that there are differences that are not evaluated in this study due to the sparseness of observations. This includes differences over areas covered by sea ice (e.g., T2, TCC, and pre-cip24), ocean areas (TCC, precip24), and inland and mountain areas at Svalbard (e.g., WS10 and T2). A comparison of forecasted TCC with satellite based TCC estimates would be a natural extension of this work, together with the use of available field campaign, ship, and buoy data over the sea ice and ocean.
Regional high-resolution models can add value compared to global models by using finer resolution and domain-tailored process representations (Jung et al. 2016). In this study, the added value of the highresolution models compared to IFS-HRES is most pronounced and significant for WS10 and T2 in regions with complex terrain and coast lines, as also found in numerous other studies (e.g., Rummukainen 2016;Schellander-Gorgas et al. 2017). In contrast, in this study the added value is negligible or negative for some parameters and regions, for example, MSLP and total cloud cover in general, and for temperature and wind speed at islands. In addition, it is shown that the errors grow faster in the high-resolution models, indicating that the added value of high-resolution models depends on lead time.
In polar regions, the limited availability of reliable observations is one of the greatest challenges in the verification process (Casati et al. 2017). Furthermore, verification often compares grid box values with point observations. It is important to acknowledge that differences between forecasts and observations arise from observation, interpolation and representativeness errors in addition to model errors. In this study, it was found that observation errors and representativeness issues contribute substantially to the difference between forecasted and observed WS10, T2, and (solid) precipitation. We found large initial errors for WS10 (SDE ;2.5 m s 21 ) and T2 (SDE ;38C) indicating observation representativeness issues. In addition, an example from two observation sites situated close to each other shows that the subgrid variability, even for high-resolution models, for this particular example contributes a large part of the difference between predicted and observed WS10 (;40%), T2 (;25%), and daily precipitation (;15%). Furthermore, more skillful WS10 forecasts are seen over ocean (against ASCAT data) than over land (against SYNOP), which may be due to representativeness issues of wind observations (Wieringa 1996). As the forecast systems improve, and in particular for short-range forecasts, it is important to quantify and understand all error components and interpret results accordingly.