A Performance Baseline for the Representation of Clouds and Humidity in Cloud‐Resolving ICON‐LEM Simulations in the Arctic

In the context of Arctic amplification many of the feedback mechanisms, decreasing or enhancing the warming, involve clouds and water vapor. Currently, there is a gap in understanding the role of clouds which leads to uncertainties in climate simulations. Modeling frameworks such as the ICOsahedral Non‐hydrostatic model (ICON) are used to understand the Arctic atmospheric processes as well as predict future changes. In this study, we challenge ICON in the large‐eddy setup (ICON‐LEM) by performing cloud‐resolving simulations over parts of Svalbard, including Ny‐Ålesund. We ran daily simulations over 5 months and analyzed the column above Ny‐Ålesund. The local supersite's observations enabled us to create a baseline for the model performance focusing on the representation of liquid water and water vapor. We narrow in on possibilities to improve the cloud microphysical representation based on statistical evaluations, not just single cases. We found an astonishing agreement between most of the analyzed variables. For instance, the model integrated water vapor showed only a low bias of 0.21 kg m−2. The number of cloudy days is slightly higher in the model (+4%). Further, we found that the model produces an unrealistically high number of pure ice clouds. Small to medium precipitation events are similar in amount and time but the number of strong precipitation events is underestimated. Further results are discussed and show that ICON‐LEM is a useful tool to study the Arctic. With this thorough analysis, we highlight the value of local cloud‐resolving simulations to understand changes in the Arctic atmosphere.

Feedback mechanisms, which may either increase or decrease the warming, are manifold and their impacts and interactions are difficult to quantify (Goosse et al., 2018). Several of these feedback mechanisms involve water vapor and clouds. Clouds impact the radiation budget and changes in cloud cover, optical thickness or the phase composition can have both warming and cooling effects. Current climate models reveal that the cloud feedback is possibly of smaller importance (Block et al., 2020;Pithan & Mauritsen, 2014). However, these models are still prone to large uncertainties in the representation of clouds resulting also in large uncertainties in climate prediction. These uncertainties stem from insufficient process understanding and the misrepresentation of clouds in models and make climate predictions vary widely (Schneider et al., 2017). Kay et al. (2016) discusses the connection of these knowledge gaps with challenges in measuring the cloud properties on large and small scales accurately.
Water vapor is another crucial variable and there is a clear positive feedback of water vapor versus the downward longwave radiation flux (Ghatak & Miller, 2013). Long-term trend analyses, such as in Rinke et al. (2019) which showed the consistent moistening of the Arctic over two decades, are important from a climate perspective. But measuring water vapor with high spatiotemporal resolution in the Arctic is difficult. Satellite retrievals provide helpful data but are biased and can show large differences depending on the product (Crewell et al., 2021). Dedicated campaigns in the Arctic such as Multidisciplinary drifting Observatory for the Study of Arctic Climate (Shupe et al., 2022), Arctic CLoud Observations Using airborne measurements during polar Day (Wendisch et al., 2019) or Surface Heat Budget of the Arctic Ocean (Uttal et al., 2002), provide more detailed data with the disadvantage of being limited in area and time. On the other hand, supersites, such as Ny-Ålesund (Svalbard), provide long-term and continuous observations of the Arctic atmosphere. This is necessary to establish changes in the local climate. A study by Maturilli and Kayser (2017) for instance, analyzing the radiosonde measurements from Ny-Ålesund for two decades, showed the increase of integrated water vapor (IWV) during the winter, which generally is the season with the lowest humidity.
In light of the lack of high-quality observations in the Arctic, models can help to support the analysis and close some of the gaps. One of these models is the ICOsahedral Nonhydrostatic model (ICON) which has been primarily developed for the mid-latitudes. This model consists of a non-hydrostatic dynamical core  and can be used with different modules for climate simulations, numerical weather prediction, and large-eddy simulations (Dipankar et al., 2015). For the mid-latitudes ICON is a reliable and tested model although, as in other models (Morrison et al., 2020), clouds and precipitation remain persistent challenges. This could also be seen in the highly resolved ICON-LEM simulations performed over Germany by  and Heinze et al. (2017). Deficiencies in the micro-physics schemes implemented in ICON have been pinpointed in other studies as well (Karrer et al., 2021;Kretzschmar et al., 2020;Ori et al., 2020). As interest in the Arctic environment has been growing, ICON has been used for the higher latitudes in different setups for case studies and campaigns (Bresson et al., 2022;Gruber et al., 2019;Wendisch et al., 2019).  performed high-resolution simulations using ICON-LEM for several days focusing on a small domain around Ny-Ålesund. They showed that the model can capture local humidity variability and is capable of representing the mixed-phase clouds although deficiencies in the microphysical parameterization are also discussed. They further explored the benefits of using resolutions up to 75 m. The work presented here can be seen as a more detailed and expanded extension of this.
It is clear that to better understand and predict the atmospheric processes in the Arctic, we must know how well the models can represent them. Only if this is the case we can effectively work on improving the models not just for single cases. ICON has not yet been evaluated for a large set of cases covering a diverse range of synoptic events in the Arctic. For this reason, we set up a semi-operational workflow and ran daily cloud-resolving simulations for a complex domain in the Arctic. The setup is described in Section 2.1. Using the super-site location Ny-Ålesund we were able to exploit a variety of observational data (Section 2.2). The results of the evaluation using 5 months of these daily simulations can be found in Section 3. First, we evaluate the performance of the model exploring the impact of the topography on the flow (Section 3.1) and then focus on the typical modeling weak spots; clouds, humidity, and precipitation (Sections 3.2-3.5).

The ICON Model and Simulation Setup
The ICOsahedral Non-hydrostatic (ICON) unified modeling framework was originally developed by the German Weather Service and Max-Plank Institute for Meteorology. We used the ability of the model to run as a numerical weather prediction model (ICON-NWP) as well as a large-eddy model (ICON-LEM). This enables consistency throughout the workflow as the dynamical core of ICON is the same for both versions. The differences lie in the parameterizations of sub-grid scale processes such as turbulence and cloud fraction (Dipankar et al., 2015). Further, in ICON-NWP we use 150 vertical levels versus 100 levels in ICON-LEM, as the higher resolution is computationally more expensive. Both setups use the hybrid Smooth LEvel VErtical coordinates SLEVE (Leuenberger et al., 2010;Schär et al., 2002) with the layer thickness for ICON-LEM increasing from 20 to 140 m in the lowest 3 km. One reason why Ny-Ålesund is challenging is that the topography plays an important role in the mountainous area of Svalbard. Following the assumption that the resolution is fine enough to resolve the impact of the orography on the flow, the parameterization of sub-grid scale orographic drag is turned off. The radiation is computed using the Rapid Radiative Transfer Model (Barker et al., 2003;Mlawer et al., 1997) in both setups. Continuing with further differences, the cloud-resolving grid in ICON-LEM makes it possible to use an all-or-nothing scheme for the cloud cover. In ICON-NWP this is computed using a diagnostic PDF (Köhler et al., 2011). In ICON-NWP the convection still relies on the parameterization by Tiedtke (1989) and Bechtold et al. (2008) but is turned off in ICON-LEM. Fundamental to the value of the large-eddy simulations is the 3D Smagorinsky scheme used for turbulent diffusion in comparison to the prognostic TKE computation implemented for ICON-NWP. Further, the use of the two-moment cloud microphysics scheme by Seifert and Beheng (2006) enables a better representation of the cloud micro-scale processes in ICON-LEM in contrast to the single-moment scheme used for ICON-NWP.
Our workflow consisted of two simulations for each day. The first simulation, with 2.4 km resolution, used ICON-NWP and the second, with 600 m resolution, ICON-LEM. The simulation setup used by  for ICON-LEM was adopted for this purpose.
Here we focus mainly on the results of the simulations using ICON-LEM. In two cases the ICON-NWP results are included to highlight improvements of the ICON-LEM but also to show when 2.4 km can be a sufficient choice. As can be seen in Figure 1 the domain for ICON-LEM only covers a limited circular area with 110 km diameter centered around Ny-Ålesund and including the neighboring fjord Kongsfjorden. Therefore, it was necessary to provide forcing data. For this, we acquired data from the daily global operational ICON simulations, performed by the German Weather Service with a resolution of 13 km. Remapping the 13 km data to 600 m is not ideal due to the large resolution jump. Hence, we decided to additionally run ICON-NWP with 2.4 km resolution for a limited domain covering most of Svalbard and including parts of the Arctic Ocean surrounding the archipelago. These intermediate simulations were forced using the ICON global simulations and the output was remapped to force the ICON-LEM runs. Both simulations were initialized at 00:00 UTC and simulated a 24 hr period. For the analysis, the first 3 hr were excluded to avoid the spin-up period.
The simulation set used for this study includes all available daily simulations from August to December 2020. During this period the fjord remained ice free. Only some variables are stored as this enables us to use a very high temporal resolution of 9s. This subset is a Meteogram containing the output of a single column. For our purposes, the Meteogram for the grid-cell in which the Arctic Research Base "AWIPEV" (see next section) in Ny-Ålesund is located, is returned.

Observations and Data Processing
The observations for Ny-Ålesund are obtained from the before mentioned AWIPEV station. An overview is given in Table 1. The data that we used include the daily operational 12:00 UTC radiosondes (Maturilli, 2020) which are typically launched around 11:00 UTC. The lowest 100 m have larger uncertainties because the radiosondes are closer to the balloon while the attachment cord is still unraveling. For the comparison between the radiosondes and the simulations, only a subset of the model data was used. Depending on the variable evaluated, either the 12:00 UTC Meteogram was selected or the average of the Meteograms from 11:00 to 11:20 UTC was taken. This is mentioned accordingly in the results. The precipitation is measured by the University of Cologne's Pluvio precipitation gauge. Several corrections following Wolff et al. (2015), Førland and Hanssen-Bauer (2000), and Kochendorfer et al. (2017) have been applied to the original Pluvio data to account for wind-induced loss and give an idea of the uncertainty of the observed precipitation estimate.

The Humidity and Temperature Profiler HATPRO
The vertically integrated cloud water (liquid water path [LWP]) and IWV are taken from the Humidity And Temperature PROfiler (HATPRO) of the Alfred Wegener Institute for Polar and Marine Research (Ebell & Ritter, 2022). HATPRO is a passive microwave radiometer (MWR) providing brightness temperature measurements at 14 frequencies in two different spectral bands with a temporal resolution of about 1 s. Based on these brightness temperatures, LWP and IWV have been retrieved (Nomokonova et al., 2019). Uncertainties for LWP are typically 20-25 g m −2 and for IWV smaller than 1 kg m −2 . The data is interpolated to 1-min values for the comparison with the model data. In this analysis, the measurements have been discarded in cases where the quality flag indicated precipitation. The precipitation is registered by a sensor on the instrument. Precipitation might result in a wet radome, leading to erroneous LWP and IWV estimates. A blower prevents the deposition of snow on the instrument. An additional check was applied to exclude times when the MWR brightness temperatures of one frequency channel were spectrally inconsistent with the other channels of the same band. To compare the observed LWP and IWV to the model we applied an artificial flagging in the model data as well by excluding all time periods where liquid precipitation is above 0.05 mm in the model. This leads to a justifiable comparison but one must keep in mind, that it is still likely that cases remain in the model data which would be excluded in the observational set.

Separating Cloudy and Clear Columns
To distinguish between cloudy and clear columns the Cloudnet data products were used. Cloudnet was specifically developed by Illingworth et al. (2007) to facilitate the operational evaluation of clouds in NWP models using ground-based observations. In particular, the Cloudnet target classification product enables the user to differentiate between water in different phases, aerosols, and other non-hydrometeors in the cloud radar height bins. The column height reaches up to 12 km.
To create this target classification, several ground-based remote sensing instruments are combined. For AWIPEV, this includes the AWI Ceilometer CL51 and the University of Cologne 94 GHz cloud radar JOYRAD-94. The Cloudnet target classification also needs thermodynamic information (temperature, pressure, and humidity profiles) which is taken from the ICON-NWP global operational runs (13 km resolution). For the ICON-LEM data, the creation of such a target categorization is done by applying thresholds to the hydrometeor concentrations of the different particles (ice, cloud droplet, raindrop, graupel, and hail). For this, a common threshold of 10 −8 kg kg −1 was applied to the concentrations in the columnar output of the Meteogram .

Boundary Layer and Free Atmospheric Flow
As mentioned before, the terrain around Ny-Ålesund provides the opportunity to test the capability of the model to capture the flow in the boundary layer (BL) and free troposphere. Both can be distinguished very clearly from each other and the sea-land breeze, as well as katabatic winds in the fjord, are well-studied phenomena (Beine et al., 2001;Esau & Repina, 2012;Vihma et al., 2011). The radiosondes from 1993 to 2014 analyzed by Maturilli and Kayser (2017) show the southeast low-level flow which is channeled through the fjord (see Figure 1 in Maturilli and Kayser (2017)). At roughly 1 km height the impact of the topography has decreased so far that the flow begins to rotate toward a south-westerly flow, which then dominates in the free troposphere. Reassuringly, both aspects of the flow can be seen very clearly in the ICON-LEM and ICON-NWP simulations.
To depict this in Figure 2 only the 12:00 UTC profiles from the simulations were used. The radiosonde profiles were used as reference. The height bins are computed using the lowest vertical resolution of the three data sets which is the ICON-LEM, as these simulations only have 100 model levels in total. For the wind direction, 10° bins are used. The advantage of using the higher resolved simulation becomes clearer when looking at single layers in the BL. The smoothed distribution of the wind direction for the fifth model level (≈180 m) is shown in Figure 3. The maxima are underestimated in both model versions but the higher resolved simulations capture the accumulation around 110° more accurately, although it is slightly shifted southwards. The higher accuracy for the ICON-LEM simulations makes sense as the higher resolution can resolve more of the flow and has a better representation of the topography. As the topography input is limited in resolution we can see though that for other study purposes the 2.4 km resolution of ICON-NWP may be sufficient regarding the wind flow. Here ICON-NWP too captures the main features of the flow accurately.
Another phenomenon of the BL around Ny-Ålesund are the orographically induced waves. We were able to observe these in the model (not shown). These waves occur when the air masses overflow the mountains to the south of Ny-Ålesund. In some of the simulations they led to high vertical winds (>4 m s −1 ) and typical wave patterns in the wind signal. Understanding how they influence clouds would need further research.

Temperature and Humidity
The representation of the vertical profiles in the lower levels plays an important role for the low-level clouds in the model. Therefore, we investigated the temperature and humidity using the daily soundings. The model is able to capture the general structure of the temperature and humidity profiles well. Figure 4 depicts a close-up of the mean profiles below 2.5 km with the interquartile ranges. It is noticeable that there is on average a cold bias of −0.37 K in the model throughout the troposphere (below 10 km). The bias increases toward the surface and reaches-1.33 K at 130 m ( Figure 4a). Nomokonova et al. (2019) found that this negative bias in the BL was even stronger in ICON-NWP global simulations with 13 km resolution.
Concerning the humidity, one can see an underestimation of specific humidity in the model which increases toward the surface (Figure 4b). The cold bias on the other side balances out the lack of humidity and consequently, the mean relative humidity profiles agree well (Figure 4c). A relative humidity inversion can be seen around 100 m which is slightly more pronounced in the observations mean than in the model mean. Between 0.5 and 1.5 km, ICON-LEM is on average drier but has a larger spread. This is the layer where the majority of the low-level mixed-phase clouds develop.

Relative Humidity Modes
When one now expands the view to include the variability of the relative humidity in each level an interesting feature becomes visible in the observations. As depicted in Figure 5, the radiosondes show a bi-modal distribution of the relative humidity up to a height of 1 km. Especially between 500 m and 1 km, this is visible. When looking at the simulations only one mode is visible. This feature is not restricted to specific seasons and is connected to the drift of the radiosondes. To show this, the trajectories were split into two groups. As the criterion, the mean relative humidity between 500 and 600 m is selected. In this height range, the two modes can be distinguished well from each other. In Figure 6 the histogram makes the two modes visible and based on the distribution a threshold of 87% relative humidity is chosen to distinguish the two sets. Using this the trajectories are visualized in Figure 7 where it becomes clear that the majority of the mode with the higher relative humidity stems from drifts toward and over the fjord.
This gives some insights into the spatial variability of the humidity in the local area around Ny-Ålesund. The evaporation from the fjord seems to cause a higher humidity over the fjord. It further highlights the necessity for caution when comparing the single-column output of a model with non-stationary observational sets. Here, large 10.1029/2022MS003299 7 of 14 differences between the model and the radiosondes can partially be explained by the trajectory of the radiosonde. Therefore, studies containing only a single or few cases must take this into account.

Integrated Water Vapor
In this section, we analyze the capability of the model to accurately capture the IWV. We use two different approaches: First, using the radiosondes in comparison to 20-min time averages from the model and HATPRO and second evaluating the entire time period using HATPRO and ICON-LEM.
As mentioned in Section 2.2, the IWV can be retrieved from the HATPRO as well as calculated from the radiosondes directly. At first, we look how well both the model and HATPRO compare to the radiosondes which we use as reference. For the calculation of the IWV from the radiosondes, we used the saturation vapor pressure formula derived by Goff and Gratch (1946). As the operationally launched radiosondes are only available once a day at 12:00 UTC, the IWV from HATPRO and the model were computed using a 20-min average from 11:00 to 11:20 UTC. In Figure 8, both (a) HATPRO and (b) ICON-LEM are plotted against the IWV from the radiosondes.
One can see that the model has a larger spread but a smaller bias (−0.11 kg m −2 ) than HATPRO (−0.41 kg m −2 ). This bias becomes more visible in HATPRO for larger values as a linear regression (red line in Figure 8a) shows. This typical increase of uncertainties for larger values was also found by Nomokonova et al. (2019). What one can also see in Figure 8, is that there are fewer cases with higher values for the HATPRO selection. This is likely connected to the flagging applied for HATPRO and reveals an issue when exploring cases with high IWV. In such cases, precipitation is more likely which leads to a reduction of cases with high IWV in the observations and thus leads to a bias in the cases studied. This case reduction limits the generalization of the comparison between model and observations.
The time series of the IWV for the entire period ( Figure 9) show that in general, the model captures the IWV very well with only a small bias of 0.19 kg m −2 and a root mean squared error of 1.14 kg m −2 when comparing ICON-LEM to HATPRO. It can be seen in Figure 9 that the IWV follows a prominent seasonal trend and decreases toward November and December, which is a typical feature to be expected for this location. Additionally, one can see that the shape of the frequency distributions of IWV (Figure 10a   . Histogram for the mean relative humidity between 500 and 600 m for all radiosondes from August to December 2020. The dashed yellow line marks the 87% value which is used as threshold to differentiate between radiosonde profiles which belong to the higher or the lower mode. difference between HATPRO and the radiosondes into account by using the before mentioned linear regression ( Figure 8a) to adjust the HATPRO measurements. Comparing these adjusted values to the simulations though, only showed minor differences in the anyways already good agreement between the model and the observations (Bias = −0.14 kg m −2 and RMSE = 0.94 kg m −2 ).
Because water vapor is crucial for the development of clouds, we further investigated if there would be differences in the IWV histograms between the cloudy and clear columns (see methods in Section 2.2.2). This separation of columns yields a higher cloud occurrence in ICON-LEM (77%) in comparison to the observed cloud occurrence of 73% (Table 2). Both values though are in the range expected for an Arctic location and are sensitive  to the chosen thresholds to distinguish clear and cloudy cases. Interestingly, this differentiation between cloudy and clear columns does lead to larger inequalities in the IWV histograms although these too are not substantial (Figures 10b and 10c). Noticeable is the IWV range between approx. 13 and 20 kg m −2 in Figures 10b and 10c. In this range, ICON-LEM has more clouds and less clear sky than what is found in the observations. It is interesting that for these higher values HATPRO clearly shows clear sky cases exist (Figure 10c), but the cloud microphysics parameterization in ICON-LEM almost always produces clouds above IWV around 15 kg m −2 .

Liquid Water Path
The LWP is helpful to get an impression if the hydro-and thermodynamical processes in the model are able to convert the available water vapor into liquid water. It is not straightforward to retrieve vertical profiles of liquid water content resulting in high uncertainties (Ebell et al., 2010). Thus, we used Cloudnet to distinguish between clear versus cloudy columns (see Section 2.2.2). By using the LWP we evaluate if there is liquid in the cloudy columns. Based on that we distinguish between columns with liquid and mixed-phase clouds in contrast to columns which only contain ice clouds. The sensitivity of HATPRO leads to the necessity of setting a threshold for the LWP. We tried different LWP lower thresholds (5, 10, and 20 g m −2 ) to determine how sensitive the results are to the threshold and used the accuracy of HATPRO as the highest threshold. In HATPRO the percentage of columns containing liquid ranged, depending on the threshold, from 70% to 53% (51,424-39,293 cases of total cloudy columns). This is a stark contrast to the 34%-26% (28,090-21,764 cases of total cloudy columns) of cloudy columns which contain liquid in ICON-LEM. Table 2 summarizes these numbers. One feature noticeable from these numbers is the average lack of liquid water in the model. On the other hand, this implies that too many pure ice clouds are produced in the model in contrast to what is observed. It appears that the model's  microphysical implementation favors the formation of ice particles when clouds develop. Similar findings have also been shown by (Nomokonova et al., 2019) for the same location using the ICON-NWP model with 13 km resolution and a one-moment microphysical scheme.
For the comparison of the LWP distributions, we selected only cases with LWP ≥ 10 g m −2 (Figure 11). Evaluating the remaining data showed that the LWP is more skewed toward higher values in ICON-LEM with a mean of 168 g m −2 whereas HATPRO shows no LWP larger than 1,171 g m −2 and has a mean of 91 g m −2 . Figure 11b makes it clear though that cases with high LWP (>400 g m −2 ) are very rare also in ICON. The close-up using the linear scale ( Figure 11b) shows that the distribution for the values below 400 g m −2 is similar. Looking at Figure 11a which uses a logarithmic scale a surprising extended tail of high LWP values in ICON-LEM becomes visible. These cases of high LWP are generally connected to precipitation but are not filtered out either because the precipitation does not reach the ground or because it occurs with a time shift. From these findings, the picture emerges that liquid-containing clouds seem to contain more LWP in the simulations than what is measured.
With the advantage of using this large data set covering many cases where there was no rain or drizzle, we can assume a certain reliability of the statistics. Combining the above-mentioned findings and the high accuracy of the IWV points toward shortcomings in the microphysical representation of production and growth of cloud droplets.

Simulated and Measured Precipitation
Continuing in the direction of droplet and ice crystal growth, the question arises of how well the precipitation is represented for this specific location. As mentioned in Section 2.2, one challenge for the precipitation measurements is to correct for the wind-induced undercatch, especially in the case of solid precipitation. A total of 368 hr with precipitation were observed between August and December 2020. We computed the Heidke skill score (HSS), based on the hourly resolved values, which gives an estimate of the quality of the precipitation occurrence prediction. With HSS = 0.43 we show that the hourly precipitation/no precipitation occurrence is well simulated. The false alarm rate, indicating when the model predicts precipitation but there is none, is quite low with 0.08 and a proportion correct of 0.87 signals that the time of precipitation/no precipitation is mostly met.
Another aspect regarding the precipitation is the intensity. To evaluate this the hourly accumulated precipitation is used. Figure 12 depicts the distribution of the hourly precipitation sums as simulated by ICON-LEM in  comparison to the Pluvio observations using the Wolff correction (P-Wolff). From Figure 12 it is clear that the majority of the hourly accumulated precipitation is below 1 mm and few cases have more than 2 mm. Creating robust statistics for cases with higher values was therefore not possible with our limited data set. Nevertheless, the 95th percentile is slightly lower for ICON-LEM (1.75 mm) than for P-Wolff (2.03 mm). This points toward an underestimation in ICON-LEM of the precipitation amount for the more extreme precipitation cases during the analyzed period.
To gain an insight into the total amount of precipitation, the focus is now shifted toward the daily sums of precipitation. In Figure 13 the cumulative sum of the different accumulated daily precipitation classes are shown for the observations (original and corrected values) and the simulations. The spin-up time (00:00-03:00 UTC) is excluded in both sets. Further, the ICON-NWP values are included to highlight improvements gained from increasing the resolution. The total precipitation amount measured at the AWIPEV station for the analyzed time lies between 187 and 270 mm (olive and green lines in Figure 13). This range takes the different corrections into account and therefore gives an idea of the measurement uncertainty. If no corrections were applied, the total precipitation amount would lie at the lower limit of 187 mm (light green line). The corrections thus add here up to 44% to the originally measured precipitation amount. Days with accumulated precipitation higher than approx. 10 mm are the ones which the model is more likely to under represent. For instance the number of days with daily precipitation sums of more than 10 mm is 5 for ICON-LEM and just 1 for ICON-NWP in contrast to 7 for the observations (P-Wolff). Even though days with these larger precipitation amounts are also rare in the observations, they contribute to more than 50% to the total precipitation amount. So it is even more important for the models to capture these events. It is noticeable though that ICON-LEM (206 mm) improves the simulated amount in comparison to the ICON-NWP (107 mm) simulations significantly. What is important to keep in mind here is that they both use different microphysics schemes, one-moment for ICON-NWP versus two-moment in ICON-LEM. That ICON-LEM achieves a greater accuracy is somewhat expected behavior as local extremes are smoothed out more if the grid box decreases in horizontal resolution and relevant processes such as clouds are better resolved. Additionally, the microphysical processes and hydrometeors are represented in more detail in ICON-LEM.

Conclusions and Outlook
We presented the results of an extensive evaluation of ICON-LEM using a large data set of daily simulations. It covers 5 months from August to December 2020. These simulations were performed semi-operationally using a setup with lateral boundary conditions and heterogeneous surface types. It provided a remarkable basis for comparisons with different observational products. The motivation was to create a thorough and elaborate reference for those who apply this model in the Arctic. Further, we explored some limitations of our approach where we combine model data and observations and highlighted features special to Ny-Ålesund and its surroundings. Yet the results are not only relevant for the Arctic, but also from a general model development perspective, especially for the representation of clouds. The different aspects covered were (a) the representation of the wind, (b) humidity and temperature in the BL, (c) the IWV and LWP in the Ny-Ålesund columns, and (d) the precipitation. For the analysis, several instrument products from the AWIPEV super-site were used. These were an MWR, Cloudnet, a rain gauge, and radiosondes.
Starting with the wind, we found that the model captures the two different wind regimes in the BL and the free atmosphere very well. In the BL the higher resolution proves helpful leading to a more accurate distribution  showing the channeling of the flow through the Kongsfjorden. The temperature profile is represented correctly but shows a cold bias for the model values especially toward the surface. The relative humidity comparison in the BL proved challenging as we found that the drifts of the radiosondes create a clear signal in the soundings. We, therefore, emphasize the need to consider the local variability in such a diverse surrounding. The bias increase toward the surface also motivates to consider aspects such as surface type representation.
Using the Cloudnet target classification we showed that ICON-LEM simulates around 4% more clouds than were observed. The cloud occurrence was in a typical range for Ny-Ålesund in both model and observations. The exact definition of what counts as a cloud depends on thresholds and sensitivities in the observations as well as the model, which may impact the cloud occurrence results. The similarities in the modeled and observed IWV were striking. The model can capture long and short term changes very well. An exception are clear sky cases with large IWV values which the model underestimates. Another interesting finding was the abundance of ice clouds in the model. Using the LWP (>5 g m −2 ) we differentiated between columns containing liquid and those only containing ice. This distinction was then used to evaluate the occurrence of liquid water. In the observations up to 70% of the columns contained liquid whereas it was only up to 34% in the model. Further, in the LWP data from HATPRO no values are above 1,171 g m −2 . This limit is partially related to the fact that such high values are related to precipitation cases which are likely to be filtered out by the instrument's quality flag.
Combining the findings of IWV and LWP there are three limitations in ICON which are likely linked. (a) too efficient ice production, (b) overestimation of cloud occurrence, and (c) too much liquid water when droplets form. The linking factor is the cloud microphysical parameterization in the model. This does not only include how processes are implemented but also how the parameters are set. In first sensitivity studies we could see that switching to an alternative cloud condensation nuclei (CCN) scheme which included different numbers of CCN, showed different results in the hydrometeor composition. This will be part of further and more detailed research.
For the precipitation, we found a better agreement of ICON-LEM with the measured accumulated precipitation during that period than ICON-NWP. Both models underestimated the cases with daily high precipitation amounts.
The ICON-LEM model shows a similar distribution for the hourly accumulated precipitation in comparison to the observations. However, ICON-LEM overestimates the frequency of occurrence of low amounts of hourly accumulated precipitation sums and underestimates the occurrence of very high amounts. During the evaluated period there were few cases with strong precipitation but these were important for the total precipitation amount.
This evaluation encourages the use of realistic high-resolution simulations to study changes in the Arctic related to Arctic amplification. Further, it enables us to focus on improving the cloud microphysical processes. Especially when it comes to the formation of cloud droplets and ice particles we saw that this can be improved. Additionally, it should be noted that the settings of this mountainous region and the strong contrasts in the surface composition highlight the adaptability of the model for complex terrain. Generally, the advantages of being able to perform realistic simulations for these challenging environments are not limited to the Arctic climate. In summary, this analysis shows the feasibility of running high resolution simulations for longer time periods, when focusing on a specified area of interest, and how we can benefit from the statistical analysis.