Evaluation of the WMO-SPICE transfer functions for adjusting the wind bias in solid precipitation measurements

The World Meteorological Organization (WMO) Solid Precipitation Inter-Comparison Experiment (SPICE) involved extensive field intercomparisons of automated instruments for measuring snow during the 2013/2014 and 2014/2015 winter seasons. A key outcome of SPICE was the development of transfer functions for the wind bias adjustment of solid precipitation measurements using various precipitation gauge and windshield 20 configurations. Due to the short intercomparison period, the dataset was not sufficiently large to develop and evaluate transfer functions using independent precipitation measurements. The present analysis uses data collected at eight SPICE sites over the 2015/2016 and 2016/2017 winter periods, comparing 30-minute adjusted and unadjusted measurements from Geonor T-200B3 and OTT Pluvio precipitation gauges in different shield configurations to the WMO Double Fence Automated Reference (DFAR) for the verification of the transfer function. Performance is 25 assessed in terms of relative total catch (RTC), root mean square error (RMSE), Pearson correlation (r), and NashSutcliffe Efficiency (NSE) for all precipitation types, and for snow only. The evaluation shows that the performance varies substantially by site. Adjusted RTC varies from 54% to 123%, RMSE from 0.07 mm to 0.38 mm, r from 0.28 to 0.94 and NSE from -1.88 to 0.89, depending on precipitation phase, site, and gauge configuration. Generally, windier sites such as Haukeliseter (Norway) and Bratt’s Lake (Canada) exhibit a net under-adjustment (17% to 46%), 30 while the less windy sites such as Sodankylä (Finland) and Caribou Creek (Canada) exhibit a net over-adjustment (2% to 23%). Although the application of transfer functions is necessary to mitigate wind bias in solid precipitation measurements, especially at windy sites and for unshielded gauges, the inconsistency in the performance metrics among sites suggests that the functions be applied with caution.


Introduction
The World Meteorological Organization (WMO) Solid Precipitation Intercomparison Experiment (SPICE) was a Commission for Instruments and Methods of Observation (CIMO) initiative to assess and compare instruments and methods for measuring solid precipitation (Nitu et al., 2012;Nitu et al., 2018). The objectives were: 1) to make recommendations for appropriate automated field reference systems; and 2) to provide guidance on the performance 5 and operation of automated systems for measuring solid precipitation and snow on the ground. SPICE was motivated by the need for accurate and homogenized solid precipitation measurements. For example, such measurements are required for climate trend analysis in northern regions (e.g. Førland and Hanssen-Bauer, 2000;Ohata, 2001, Scaff et al., 2015). Following historical works on adjusting the systematic undercatch of solid precipitation measurements due to wind (Goodison, 1978;Sevruk et al., 1991;Goodison et al., 1998;Sevruk et al., 2009;Smith, 10 2009;Wolff et al., 2015;Kochendorfer et al., 2017a;Buisan et al., 2017), a methodology and a set of widely applicable transfer functions for the adjustment of high resolution (i.e. 30-min) precipitation measurements was developed. The SPICE transfer functions discussed in this study were developed for single Alter-shielded or unshielded automated precipitation gauges (Kochendorfer et al., 2017b). Because of the symbioses between the Kochendorfer et al. (2017b) SPICE work and this evaluation, the SPICE methodology is described below in more detail and henceforth cited as  Table 1). The DFAR was developed for use as the field reference configuration for SPICE (Nitu, 2012;Nitu et al., 2016;Nitu et al., 2018). Descriptions of each site, complete with detailed layouts and photos, are available in the WMO-SPICE site commissioning reports (http://www.wmo.int/pages/prog/www/IMOP/intercomparisons/SPICE/SPICE.html) and in the WMO-SPICE final report (IMO 131 found at http://www.wmo.int/pages/prog/www/IMOP/publications-IOM-25 series.html; Nitu et al., 2018).
The DFAR consisted of either a Geonor T-200B3 or an OTT Pluvio 2 automated weighing precipitation gauge with a single Alter-shield inside the same large, octagonal double fence used for the WMO Double Fence Intercomparison Reference (DFIR). The DFIR was employed as a manual field reference configuration during previous solid 30 precipitation intercomparisons (Yang et al., 1993;Goodison et al., 1998). The DFAR incorporated a precipitation detector to reduce the probability of including false precipitation reports in SPICE data analysis. The result was the development of a high confidence reference precipitation data set called the Site Event Data Set (SEDS; Reverdin, 2016). In the SEDS, a precipitation event was a 30-minute period during which a precipitation detector (typically an optical disdrometer capable of identifying the occurrence of even light precipitation) observed precipitation for at least 35 18 minutes during the 30-minute period (60% of event duration) and the DFAR measured ≥ 0.25 mm. The justification for the filtering criteria used in K2017b is detailed in Kochendorfer et al. (2017a). The SEDS was then used to produce the SPICE transfer functions. By combining data from each site in Table 1, the intent was to make these multi-site transfer functions universally applicable.

5
The transfer functions presented in K2017b were developed for both unshielded and single Alter-shielded automated precipitation gauges by combining observations from the Geonor T-200B3 and OTT Pluvio 2 gauges (hereafter referred to as the sensors under test, or SUT), after the authors demonstrated that the unshielded catch from both SUT types were very similar. Using the SEDS, further processing was applied to the SUT data (as justified in Kochendorfer et al., 2017a) using a minimum 30-minute threshold defined as the median of the ratio of the SUT accumulation to that 10 of the DFAR, for the precipitation event threshold of 0.25 mm. This prevented the results from becoming biased toward the gauge used in the SEDS event selection. For wind speed, K2017b used the wind speed measurements available at each site and applied the log-profile law to produce a 30-minute average wind speed for both gauge height (Ugh) and the standard 10 m height (U10m), either incrementing speeds to U10m or decreasing speeds to Ugh, depending on which measured wind height speed data was deemed the best at each site. Catch efficiencies (CE) were then 15 calculated for each event as the ratio of the 30-minute SUT precipitation accumulation to that from the DFAR over the same period. From K2017b, two functional forms were used to fit the data: where U is wind speed (specifically either Ugh or U10m) in m s -1 , Tair is air temperature in degrees C, and , , and 25 are coefficients to fit the data to the model. Eq. 1 and Eq. 2 listed here are referred to as Eq. 3 and Eq. 4 in K2017b.
To reduce the impact of fewer events at higher wind speeds and the potential impacts of blowing snow, the SEDS data were filtered further to remove events with wind speeds higher than 7.2 m s -1 (9 m s -1 ) at gauge height (10 m).
The key difference between Eq. 1 and Eq. 2 is the inclusion of temperature dependency in Eq. 1. This allows for a 30 continuous (3-dimensional) transfer function at all temperatures without having explicit knowledge of the precipitation phase. This curve is shown in Fig. 2 Kochendorfer et al. (2017aKochendorfer et al. ( , 2017b examined the use of more 5 complex transfer function forms for adjusting solid precipitation measurements, such as the sigmoid function used by Wolff et al. (2015), but found that the simpler forms had similar bias and RMSE characteristics as the more complex forms.

10
Following the end of the SPICE project, the performance of the SPICE transfer functions was evaluated as described in K2017b, using the same SEDS data used to develop those transfer functions. As discussed above, the data from all eight sites in

25
RMSE, and PE were slightly improved after adjustment, but these also varied by site. K2017b also showed that, in general, the mountainous sites experienced larger errors after adjustment, with one mountainous site (WFJ) being over-adjusted and the other two (HKL and FOR) being under-adjusted.

30
The impetus for this extended evaluation was twofold: 1) The methodology used during SPICE for developing and evaluating the transfer functions used only a subset of the observed data (the SEDS), and although this was a robust methodology for developing transfer functions, it did not provide a comprehensive evaluation of the adjustments under circumstances more typical 35 of users collecting precipitation data in the field where the data is less filtered to remove smaller amounts.
2) The dataset used for the evaluation of transfer functions in K2017b was not completely independent of that used to develop the functions. A robust assessment requires additional data collected following the end of the SPICE intercomparison period that was not used in the development of the transfer functions.

5
This evaluation will examine the performance of the SPICE transfer functions for precipitation measurements from each of the eight intercomparison sites shown in Fig. 1  were obtained from the eight SPICE. The data were quality controlled using the same techniques employed in SPICE (Nitu et al., 2018), which involved automated range and jump checks and supervised removal of remaining outliers.
Where available, service logs were provided by site hosts to assist in data quality control and the identification of outliers due to servicing (e.g. rapid drops or increases in precipitation gauge bucket weights) or maintenance (e.g. instrument malfunctions or other human interventions that may impact the data). The Geonor T-200B3 gauges 20 employed three transducers and the output from all three were averaged to produce a single time series. Next, the high-resolution precipitation data were subjected to the same Gaussian filter as the SPICE data used in K2017b to dampen high frequency noise; however, it was decided not to develop an event database such as the SEDS, but rather to use an alternate process to develop a consistent 30-minute precipitation time series to more closely reflect realworld precipitation datasets. This alternate process involves the use of a modified "Brute-Force" filter initially

Neutral Aggregation Filter
The NAF algorithm removes noise in cumulative precipitation time series by iteratively balancing positive and negative noise and accumulating positive changes exceeding the noise by a user-defined threshold (Δ*, e.g. 0.05 or 30 0.2 mm, depending on the gauge precision) (Smith et al., 2019) such that the total accumulated positive increases in bucket weight after filtering are forced to equal the total end-of-season bucket weight. The algorithm removes random and systematic diurnal noise, but does not account for signal drift (an example of signal drift is a decrease in weight that occurs due to evaporation of water from the gauge bucket). Signal drift can result in estimation errors, which can be mitigated using an iterative manual process, with the NAF output as a first guess. This process is called NAF

35
Supervised (NAF-S) and lets the user select the beginning and the end points of segments within the time series where evaporation is occurring. The process then removes these segments so that they have no impact on the time series.
Because there is some user subjectivity in selecting the beginning and end points of impacted segments, this process is completed by a single user employing pre-determined and consistent criteria. Although beyond the scope of the present work, testing NAF and NAF-S on both simulated and observed precipitation time series over an entire winter season, including both noise and evaporation drift, showed the technique to be effective with low error as compared 5 to the control. The end product of the NAF-S filter is a clean, time consistent 1-minute accumulating time series with preserved data gaps (i.e. no gap filling) for each gauge configuration for each season.

Amalgamation and adjustment
To produce accumulation periods consistent with the K2017b validation, the 1-minute NAF-S precipitation 10 accumulation time series were resampled to 30-minute accumulation amounts via differentiating the bucket weights between the start and end of the 30-minute period. The continuity of the time series was maintained, despite data gaps, through the assumption that the gauge continues to accumulate precipitation despite logger or power outages. In the event of missing data, precipitation accumulation during outages was calculated based on the differential bucket weight between the start and end of the outage and recorded as an accumulation at the end of the outage. Although 15 this preserves the total accumulation during the outage, information related to the timing of the events during the outage are not preserved, and the accumulation data need to be flagged. Protocols for adjusting the data for undercatch are noted below.
The 1-minute wind speeds (U10m and Ugh where available) and air temperatures (generally measured at 1.5 m) were averaged over the same 30-minute periods as the accumulated precipitation amounts. Site specific details on the 20 ancillary measurements can be found in the SPICE site commissioning reports (referenced in Section 1). If more than 10 minutes of wind or temperature data were missing in any 30-minute period, the data were flagged as missing and were not used in the adjustment procedure.
The resultant time series for each test gauge at each site were adjusted separately using both Eq. 1 and Eq. 2. Each 30minute accumulation was adjusted individually if the following conditions were met: 1) both the start and end bucket 25 weights for the 30-minute period were not missing, such that the differential could be determined; and 2) no more than 10 of the 30-minute values of either wind speed or temperature were missing. Periods that did not meet these criteria were preserved in the time series, but were flagged as being unadjusted and were not included in the validation.
For adjustments using Eq. 1, the pre-determination of the precipitation phase was not necessary, as the transfer function is continuous with temperature and not directly dependent on precipitation phase. For the purpose of adjusting 30 precipitation using Eq. 2, phase is determined by air temperature using the phase regimes outlined in Section 1.1 with rain assuming a catch efficiency of 1 (and is therefore not adjusted). The relative total amounts (Snow/Mixed/Rain) as measured by the DFAR at each site are shown in Table 1. The same maximum wind speed thresholds for adjustments were employed here as in K2017b, which were 7.2 m s -1 and 9.0 m s -1 at gauge height (generally 2 m above ground) and at 10 m height respectively. Wind speeds above these thresholds were set at the threshold value to avoid over adjustment and the increased uncertainty in the transfer functions above the wind speed threshold. Figure   2 suggests that Eq. 2 can exceed a catch efficiency of 1 at low wind speeds. There is no obvious physical explanation for the portion of the Eq. 2 catch efficiency function (originally published by K2017b) that is > 1.0 and this is related to the empirical fit of the catch efficiency curve to the original SPICE data. For this current assessment, calculated catch efficiencies > 1 were infrequent and occurrences were automatically set to 1.

5
The resulting data (https://doi.org/10.1594/PANGAEA.907379; Smith et al., 2019b) includes a sub-set of adjusted and unadjusted 30-minute precipitation amounts for each SUT gauge configuration at each site (adjusted using Eq. 1 and Eq. 2 and using either Ugh or U10m or both where available), the 30-minute DFAR data, and the accumulated gap preserved time series for each with flags identifying the periods that were not adjusted.

10
Four complementary statistical metrics are used to assess the performance of the transfer functions for adjusting precipitation. These are relative total catch (RTC), root mean square error (RMSE), Pearson correlation (r) and percentage of events within 0.1 mm of the DFAR-reported values (PE). RTC is the ratio of accumulated catch of the SUT to that of the reference over the same period and using the same filter or thresholds (i.e. for snow). RTC is expressed as a percentage of the reference precipitation amount and reflects the capability of the transfer functions to 15 adjust seasonal and long term precipitation totals, with a perfect seasonal adjustment having an RTC of 100%. Hence, it can be used to assess the improvement in the seasonal bias following the adjustment.
Root mean square error is used to estimate the magnitude of uncertainty in the 30-min unadjusted and adjusted SUT measurements relative to the DFAR. The ideal value of the RMSE is 0 mm, indicating perfect agreement between the 30-min SUT observations and those from the reference; however, the variability in the magnitude of the RMSE 20 amongst the sites should be interpreted with caution, as a small RMSE at a site with low precipitation rates may be more significant than a higher RMSE at a site that has higher precipitation rates. For that reason, the RMSE will be used mainly to assess the relative performance of Eq. 1 and Eq. 2 at each site, as well as the relative performance of the transfer functions for the single Alter-shielded and unshielded gauges.
The Pearson correlation assesses the strength of the linear correlation between the 30-min measurements made with 25 the SUT and the reference DFAR, before and after adjustments using Eq. 1 and Eq. 2. This metric provides an overall indication of changes in the differences between SUT and reference measurements, but does not provide an indication of changes in the bias or the magnitude of differences The PE statistic is the percentage of 30-min events measured by the SUT that differ from the DFAR by less than a pre-defined threshold (0.1 mm in this analysis), and is calculated as: where n is the number of 30-min events considered in the analysis and nT is the size of the subset of n where |DFAR -SUT| < 0.1 mm. PE was introduced as a metric for assessing transfer functions in K2017b with the 0.1 mm threshold based on the measurement uncertainty of the reference configuration as investigated during WMO-SPICE (Nitu et al, 2018). Ideally, applying a transfer function to adjust the systematic bias (undercatch) in precipitation measurements would increase the magnitude of SUT observations to be closer to the reference measurements, thereby increasing PE.
Perfect agreement between the reference and the SUT, within the 0.1 mm threshold, would produce a PE of 100%.
Hence, PE provides additional event-based perspective on how adjustments impact bias and uncertainty that is not 5 captured by the other metrics used in the assessment.
The assessment considers overall performance for all precipitation types combined, as well as for snow alone. In the latter case, the assessment of snow adjustments using Eq. 1 employs the same temperature threshold for snow as

Time series
The impact and the performance of transfer functions for adjusting precipitation can be examined by comparing the accumulation time series for unadjusted and adjusted data to the reference. with the same shield configuration was present at a site, and where more than one wind speed height was available, results for only one gauge and wind speed height were selected for illustrative purposes. The precipitation amounts vary by site and season, but the general trends in SUT undercatch as compared to the DFAR are consistent. Accordingly, the impact of the adjustment also appears to be consistent. Without considering precipitation phase partitioning, unadjusted precipitation (solid lines) relative to the DFAR was always lowest at the windy sites of XBK and HKL. Referring to unadjusted precipitation, unshielded gauges (blue lines) always catch less precipitation than the single Alter-shielded gauges (red lines) at all sites, and during both winters. The transfer function (only temperature dependent Eq. 1 adjustments shown here) appears to be less effective at the windier sites (wind speeds during precipitation events are shown in Table 2), with a substantial undercatch remaining after adjustment.
Further, precipitation is over-adjusted at some of the less windy sites (CAR, CCR, and SOD).

Relative total catch
The RTC metrics for the single Alter-shielded gauges are shown in Table 2 and for the unshielded gauges in Table 3 for Eqs. 1 and 2, combining the data from the two intercomparison seasons (CCR being the exception, since data is only available for 2016/2017). RTC is shown for all precipitation phases and for snow, and for both wind measurement heights (U10m and Ugh), where available. If there are multiple gauges per site, the RTC is reported separately for each 10 gauge and for the combined gauge dataset. Figure 5 summarizes the snow RTC in the tables for the combined results for both winter seasons. When more than one gauge exists, the Ugh was used for the adjustment for all sites except FOR, which only reported U10m.
The RTC values in Tables 2 and 3 indicate that the unadjusted catch for snow is lower than the unadjusted catch for all precipitation types at most of the sites. The magnitude of the difference depends on the relative amounts of solid 15 and liquid precipitation received during the season, as well as the wind speeds during snow. The biggest difference in the unadjusted RTC values between all precipitation types and snow occurred at sites where more rain occurred during the intercomparison season, as the gauge catch for rainfall is naturally higher (Yang et al., 1998;Smith, 2008) and biases the total catch. At XBK, CAR, FOR, and HKL, removing rain and mixed precipitation from the statistics had a large impact on the RTC, both pre-and post-adjustment, and provides a more realistic metric for assessing how well 20 the transfer functions are performing for the adjustment of snow measurements.
Although the sample size is smaller (fewer unshielded than single Alter-shielded gauges), the unadjusted catch of the unshielded gauges (Table 3) was lower than the unadjusted catch of the single Alter-shielded gauges (Table 2). From the single Alter-shielded RTC in Table 2, focusing on snow, the differences between the Eq. 1 and Eq. 2 results were small, varying within 1 to 2 %. This can also be seen in the combined results in Fig. 5. The difference between Eq.1 25 and Eq. 2 was greater for the unshielded gauges (Table 3) with Eq. 1 performing better than Eq. 2 for snow at XBK (+12 %) and HKL (+10% using Ugh). Equation 2 tended to under-adjust the unshielded gauges more than Eq. 1 at XBK and HKL.
At sites with both wind speed heights available for use in the adjustment (XBK, CAR, HKL, and MAR), the data shown in Table 2 for single Alter-shielded gauges suggest that using Ugh reduces the extent of over-adjustment (CAR,

30
MAR) or under-adjustment (XBK and HKL) relative to U10m (i.e. adjustments closer to 100%). This holds for the unshielded gauge adjustments in Table 3, with the exception of MAR, which shows a large under-adjustment using Ugh as compared to U10m. This may suggest that Ugh wind speeds are biased low at MAR, which is consistent with comments made by Kochendorfer et al. 2017a stating that the ground height wind measurement were shadowed in some directions. Although the sample size was small, there is reason to suggest from an RTC perspective that Ugh outperforms U10m when used in Eq.1 and 2 for adjusting snow measurements. For this reason, where available, the Ugh rather than U10m wind adjustments are shown in Fig. 5 and subsequent figures.
The differences in RTC for adjusted measurements from single Alter-shielded gauges versus those from unshielded gauges are mixed. At the windier HKL and XBK sites, Fig. 5 suggests that the adjusted RTC for the unshielded gauge is just as high as or higher than for the single Alter-shielded gauge (Eq. 2 at XBK being the exception). The over-5 adjustments at SOD and CCR are exaggerated for the unshielded gauge, but the unshielded adjustment is closer to 100% at CAR and WFJ. MAR is an outlier in Fig. 5, possibly due to the potential issue with Ugh, but the U10m adjustment of the unshielded gauge (Table 3, snow) has an RTC closer to 100% (105% and 106% for Eq. 1 and Eq. 2 respectively) than the single Alter-shielded gauge (117%).
Including the RTC for both wind speed heights but excluding the combined gauge statistics, the adjustment for snow 10 using Eq. 1 increases the mean catch efficiency for the single Alter-shielded gauge from 61% to 88% and for the unshielded gauge from 48% to 92%. When considering each gauge at each site, the anticipated decrease in the RMSE following adjustment (whether for 25 all precipitation types or snow only) is not universal. This is illustrated in Fig. 6, which shows the RMSE results for combined SUT snow datasets before and after adjustment using U10m (where possible). For single Alter-shielded gauges, the RMSE increases with adjustment at CAR, MAR, SOD and WFJ (although the increase is small at all sites but WFJ). The decrease in RMSE is small at XBK and CCR. The differences between RMSE results using Eq. 1 and Eq. 2 for single Alter-shielded gauges are insubstantial (< 0.005 mm). These differences are larger for the unshielded 30 gauges as shown in Fig. 6 (as high as 0.05 mm).

Pearson correlation
Similar to previous metrics, the single Alter-shielded and unshielded r-values are shown separately in Tables 6 and 7 respectively, and are plotted for snow in Fig. 7. In theory, the adjustment using a transfer function should strengthen the linear relationship (increase r) between the adjusted and the reference measurements by removing the non-linearity associated with wind bias.
For single Alter-shielded gauges, unadjusted r-values for all precipitation types range from 0.83 at HKL to 0.96 at

15
For both all precipitation phases and snow only (Fig. 7), the differences between r-values following the application of Eq. 1 and Eq. 2 are negligible for the single Alter-shielded gauges. For the unshielded gauges, Eq. 2 results in higher r values than Eq. 1, but the differences are very small (< 0.03).
For sites with both wind speed measurement heights, the correlations appear to be independent from the measurement height. The only exception is for the unshielded adjustment of snow measurements at MAR, where correlations based 20 on transfer function application using Ugh are significantly less than those using U10m. This likely results from shadowing effects on the Ugh data.

Percentage of events within 0.1 mm
As shown in Table 8, the PE for the single Alter-shielded gauges did not generally increase following adjustment, contrary to what was observed in K2017b. Similar to the observed trends for r and RMSE, the change in PE following 25 adjustment of the Alter-shielded gauges was small, typically within ± 5%, with the only exception to this at FOR (+10% for snow).
The unshielded gauges showed a greater change in PE following adjustment (Table 9). For all precipitation, the difference in PE before and after adjustment was generally small (MAR and FOR being the exceptions with a 10% increase), but the change for snow was more substantial. From Fig. 8, PE for snow was found to decrease by more 30 than 6% at some sites following adjustment (XBK, CCR, SOD) and increase by more than 9% at others (CAR, MAR, WFJ), with changes varying between -8% and +17%.
Similar to the other metrics, PE does not show any consistent advantage to either Eq. 1 or Eq. 2, with differences being less than 2% (most being less than 1%). There also does not appear to be any clear or consistent advantage to using Ugh vs. U10m, with differences in PE values generally within 2%.

4 Discussion
The current application of the universal transfer functions developed in SPICE to two winter seasons of precipitation data at eight locations produced variable results depending on site location. The discussion will focus on snow to avoid the complex influence of precipitation phases, in varying proportions, on the assessment results. Based on the relative total catch results, the transfer functions tend to under-adjust snow for single Alter-shielded gauges at the windy sites 10 of XBK and HKL with mean 10 m wind speed during snow of 6.1 and 5.3 m s -1 , respectively ( Table 2). The results from FOR were similar, despite relatively lower mean wind speeds of 4 m s -1 at 10 m. The SOD, CCR, WFJ, and MAR sites were characterized by the lowest wind speeds, with mean values at 10 m smaller than 3.2 m s -1 . For single Alter-shielded gauges at these sites, the RTC following adjustment varied from 105% at SOD and CAR to 113% at MAR and WFJ. The above trends are similar to the bias results of K2017b for all precipitation phases, but more 15 pronounced due to the focus on snow.
The adjusted RTC results showed greater variability for unshielded gauges (Table 3), performing better at some sites and poorer at others relative to the performance for the single Alter-shielded gauges (Table 2). Similarly, variable trends were observed in the adjusted PE results for unshielded and Alter-shielded gauges (Tables 9 and 8 respectively).
The adjustment of the unshielded gauges at the windy sites (XBK and HKL) was found to increase the RTC closer to 20 100% than the adjustment for the Alter-shielded gauges; however, upon examination of the RMSE results (Fig. 6), the errors associated with adjusting the unshielded gauges were generally higher compared to those associated with the single Alter-shielded gauges. Insight into the higher RMSE values for unshielded gauges is provided by the PE results (Table 9 and Fig. 8). At XBK, the PE dropped by 7% after adjusting the unshielded gauge, indicating that more measurements were pushed outside of the ±0.1 mm threshold relative to the DFAR than were pushed inside the 25 threshold by the adjustment. The PE dropped after adjustment for the unshielded gauges at CCR and SOD as well, but this can be attributed to over-adjustment as shown by the RTC (Fig. 5). In summary, since the unadjusted RTC values for unshielded gauges are lower (especially at windy sites), the adjustment for unshielded measurements are necessarily larger which magnifies signal noise and other random measurement errors. This propagation of errors through adjustment is discussed in greater detail by Kochendorfer et al. (2018).

30
The evaluation of transfer function performance is complicated by observations for which the DFAR detects a measurable amount of precipitation, but the SUT does not. A gauge measurement of zero cannot be adjusted with a transfer function. This impacts the performance metrics and cannot be ignored in the context of real-world applications of transfer functions (e.g. adjusting precipitation measurements for use in forecast validation). The limited utility of transfer functions in this regard is due to the configuration's capability to catch snow and not because of the transfer function. To assess the relative impact of these types of events on the evaluations, descriptive statistics were calculated for events at XBK, HKL, and SOD.
For XBK, there were 498 30-minute snow events during which the DFAR measured an accumulation value greater than zero. The single Alter-shielded gauge did not report precipitation during 285 of those events, which accounted for 14% of the total DFAR accumulation over both measurement seasons, and had a mean wind speed of 6 m s -1 (7.5 5 m s -1 ) at gauge height (10 m). The number of events during which the unshielded gauge did not report precipitation was even higher: 376 events in total, accounting for 24% of the total DFAR precipitation, characterized by lower mean wind speeds ( Even though some adjusted RTC values are closer to 100% for unshielded gauges (such as at HKL, CAR and WFJ), the RMSE was generally lower and the PE consistently higher for single Alter-shielded gauges, and combined with a lower frequency of missed measurements, supports the use of more shielding for solid precipitation measurements.
In general, the application of transfer functions resulted in under-adjustment at the windier sites and over-adjustment 25 at the less windy sites; however, there is no clear relationship between the mean wind speed at a site and transfer function performance. The general performance of the transfer functions for single Alter-shielded gauge measurements, from the perspective of RTC, likely also depends on other factors, such as crystal characteristics (Thériault et al., 2012) or aerodynamic peculiarities at the intercomparison sites affecting the representative wind speed measurements (as discussed in K2017b). Based on the present results, we found that the transfer function The differences in performance between Eq. 1 and Eq. 2 for adjusting snow measurements from single Alter-shielded gauges were small: RTC generally varied by less than 2%, and RMSE, r and PE were nearly identical. This was likely an artifact of the way that phase was determined in the methodology; even though CE is a function of air temperature in Eq. 1, the data for snow are a subset of the precipitation data based on the same phase discrimination used for Eq.
2. However, the metrics for all precipitation types were also similar, which is consistent with the results in K2017b,

5
and suggests that transfer function selection is essentially a matter of user preference. In that respect, the "simpler" Eq. 2 is less simple in that user is required to determine the phase based on temperature thresholds, while Eq. 1 requires no phase discrimination. The decision would appear to be more complicated for unshielded gauges, likely because of the increased uncertainty in both the measurement and the adjustment. Generally, the differences are still quite small but Eq. 2 shows a slight advantage with RMSE and r. As with the single Alter-shielded gauges, the decision likely 10 should be based on personal preference. It would be interesting, however, to explore refining the coefficients for Eq. 2 using optical disdrometers or present weather sensors to identify dominant precipitation phase and employing such instruments when performing adjustments. It may also be worthwhile to assess the performance of Eq. 2 while using hydrometeor temperature approximation, as described in Harder & Pomeroy (2013), for phase discrimination.
Only four of eight test sites measured both U10m and Ugh, so it is difficult to draw conclusions regarding the influence 15 of wind speed measurement height on the performance of the transfer functions. For those four sites, the RTC using Ugh was closer to 100% for many of the adjustments, but RMSE and PE varied by site and r-values were nearly identical. The Ugh is a direct measurement of the wind speed at gauge height, and does not rely on the potentially problematic assumption that wind speed at 10 m is representative of wind speed at gauge height. This assumption relies on the estimation of surface roughness, which changes with vegetation cover, snowfall, and drifting. It also 20 neglects the impact of increasing snow depth on the relationship between the gauge height wind speed and the 10 m height wind speed. However, depending on the distance between the SUT and the Ugh measurement, the Ugh measurement may also not be representative of the wind speed at the SUT due to interference from instruments, wind shields, and other obstructions between the wind sensor and the gauge.
As noted in K2017b, discrepancies in the various wind speed measurements (whether instrument, height, or exposure 25 related) make it difficult to ascertain any advantage or disadvantage of using one wind speed height over the other. It is recommended to use the best wind speed data available at a given site for transfer function adjustment, but to be cognizant of the issues related to spatial representation of wind speed at the site. Additional uncertainty related to wind speed may be attributed to the variability within the 30-min mean period. Although this wasn't included in the current analysis, previous work by Wolff et al. (2015) and Nitu et al. (2018) at HKL showed that the impact of high frequency 30 variability in the wind speed over 30-min periods on transfer functions was negligible.

Conclusions
The evaluation of the performance of WMO SPICE transfer functions using an independent, post-SPICE dataset showed that the performance varies by site and shield configuration and is considerably reduced when only assessing their performance for snow. Generally, the application of the transfer functions to measurements from sites with higher wind speeds resulted in an under-adjustment, while producing an over-adjustment for measurements from less windy sites. This trend was not universal, which indicates that the performance is also linked to local climatic conditions affecting snowfall characteristics. On average, the transfer functions resulted in an increase in the RTC of snow measurements from single Alter-shielded gauges (unshielded gauges) from 61% (48%) to 88% (92%), but also produced an under-adjustment as low as 54% and an over-adjustment as high as 123%. Although the RTC values 5 imply improved transfer function performance when adjusting unshielded gauges relative to single Alter-shielded gauges, the higher RMSE and lower PE for unshielded adjustments suggest otherwise. Further, the unshielded gauges were shown to completely miss a larger proportion of events and accumulated precipitation relative to the DFAR than the shielded gauges, raising the critical point that precipitation that is not recorded by the gauge configuration cannot be adjusted. The differences in performance observed for Eq. 1 and Eq. 2 were small enough that the choice of transfer 10 function should largely depend on the availability of observed precipitation phase data as well as user preference.
With only four sites collecting wind speed data at both 10 m and gauge height, it was difficult to determine if the wind speed measurement height significantly affected transfer function performance. RTC was generally closer to 100% when gauge height winds were used for the adjustment, but the RMSE, PE and correlation results were mixed.
Regardless, and perhaps more importantly, users must also carefully consider potential issues with obstructions and 15 spatial representativeness when selecting a wind speed measurement.
Ultimately, eight DFAR intercomparison sites were insufficient to address the variability in performance of the SPICE transfer functions and more intercomparison sites with a DFAR are needed in various cold region climate regimes for more thorough assessments, a key recommendation from the WMO-SPICE project (Nitu et al., 2018) For the most part, and especially at locations that experience relatively high wind speeds during snowfall events, the application of 20 the adjustment improved the usability of the observations. This study also suggests a high degree of uncertainty in applying these adjustments in networks that geographically span many different climate regimes, and additional work is required to assess and minimize that uncertainty.

Data availability
The quality controlled 30-minute data set used in this publication is available at T.L., and C.S. were core participants of the WMO-SPICE data team and were instrumental in the collection and provision of these data.

Competing interests
The authors declare that they have no conflict of interest.

5
The authors would like to thank Eva Mekis of Environment and Climate Change Canada for providing an internal review of this manuscript. We would like to acknowledge the organizations that collected and provided the data for