Comparison of nonhomogeneous regression models for probabilistic wind speed forecasting

In weather forecasting, nonhomogeneous regression is used to statistically postprocess forecast ensembles in order to obtain calibrated predictive distributions. For wind speed forecasts, the regression model is given by a truncated normal distribution where location and spread are derived from the ensemble. This paper proposes two alternative approaches which utilize the generalized extreme value (GEV) distribution. A direct alternative to the truncated normal regression is to apply a predictive distribution from the GEV family, while a regime switching approach based on the median of the forecast ensemble incorporates both distributions. In a case study on daily maximum wind speed over Germany with the forecast ensemble from the European Centre for Medium-Range Weather Forecasts, all three approaches provide calibrated and sharp predictive distributions with the regime switching approach showing the highest skill in the upper tail.


Introduction
Reliable forecasts of wind speed are a necessity in a diverse number of applications such as agriculture, most modern means of transportation and wind energy production. Wind power, as a renewable and emission-free alternative to fossil fuels, has been growing rapidly over the last decade. In Europe, the wind power's share of total installed power capacity amounted to about 11.4% at the end of 2012 and it has increased fivefold since 2000 (European Wind Energy Association, 2012). For wind energy production, accurate forecasts of wind speed at different lead times are required to regulate electricity markets, to schedule maintenance and, more generally, to improve the competitiveness of wind power compared to sources of electricity which allow for dispatchable generation (Genton and Hering, 2007;Pinson et al., 2007;Lei et al., 2009). In many of these applications and for weather warnings, high wind speeds are of particular importance.
The focus of this article is daily forecasts with mediumrange lead times of 1Á3 d. In this setting, forecasts are usually based on outputs from numerical weather prediction (NWP) models which use physical descriptions of the atmosphere and oceans to propagate the state of the atmosphere forward in time based on the current weather conditions. Moreover, to account for uncertainties in the knowledge of the initial state of the atmosphere and in the numerical model, the NWP models are often run several times with different initial conditions and/or numerical representations of the atmosphere resulting in an ensemble of forecasts Leutbecher and Palmer, 2008). The development of ensemble prediction systems plays a key role in the transition from deterministic to probabilistic forecasting and has become an established part of weather and climate prediction (Palmer, 2002).
Probabilistic forecasts are essential in many applications in that they allow for quantification of the associated prediction uncertainty. Furthermore, optimal decision making requires probabilistic forecasts, particularly for rapidly fluctuating resources such as wind energy where the optimal point forecast depends on permanently changing market features (Pinson et al., 2007;Thorarinsdottir and Gneiting, 2010). See also Gneiting (2011) for a detailed discussion of optimal deterministic forecasts based on probabilistic forecasts. While ensemble systems are valuable *Corresponding author. email: Sebastian.Lerch@uni-heidelberg.de Tellus A 2013. # 2013 S. Lerch and T. L. Thorarinsdottir. This is an Open Access article distributed under the terms of the Creative Commons Attribution-Noncommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. in this context, they are finite and do not provide full predictive distributions. Also, they tend to be underdispersive and subject to a systematic bias, and thus they require some form of statistical post-processing (Hamill and Colucci, 1997;. Statistical post-processing methods for ensembles of wind speed forecasts include the ensemble Bayesian model averaging (BMA) method of Sloughter et al. (2010) and the nonhomogeneous regression (NR) or ensemble model output statistics (EMOS) approach developed by Thorarinsdottir and Gneiting (2010). BMA predictions are given by weighted mixtures of parametric densities or kernels each of which depends on a single ensemble member, with the mixture weights determined based on the performance of the ensemble members in the training period. For wind speed, Sloughter et al. (2010) apply a mixture of gamma distributions, see also Courtney et al. (2013). The NR method of Thorarinsdottir and Gneiting (2010), on the other hand, applies a single normal distribution truncated at zero, where the location parameter is an affine function of the ensemble members and the scale parameter is an affine function of their variance. In a comparison study, the two methods show very similar predictive performance ( Thorarinsdottir and Gneiting, 2010). The NR method has been extended to wind gusts (Thorarinsdottir and Johnson, 2012) and a BMA approach for wind direction is proposed in Bao et al. (2010). Pinson (2012), Schuhen et al. (2012) and Sloughter et al. (2013) study statistical post-processing of bivariate wind vector ensembles.
Hourly average wind speeds are usually modelled using lognormal, gamma, Rayleigh or Weibull densities, with the Weibull model showing the best performance in many case studies, see for example Garcia et al. (1998) andCelik (2004). Here, we consider forecasts of daily maximum wind speed and the predictive distributions are conditioned on the ensemble forecast, the situation for which the postprocessing approaches of Sloughter et al. (2010) and Thorarinsdottir and Gneiting (2010) were developed. As daily maximum wind speeds are block maxima, results from extreme value theory imply that the generalised extreme value (GEV) distribution provides a suitable model (Coles, 2001). GEV distributions have especially received attention in modelling maxima of wind and gust speed observations over long return periods, typically 50 yr, see Palutikof et al. (1999) and references therein. Friederichs and Thorarinsdottir (2012) apply a GEV model for probabilistic predictions of daily peak wind speed.
We propose to combine the NR ensemble postprocessing framework originally proposed by  and later extended by Thorarinsdottir and Gneiting (2010) with results from extreme value theory. That is, we apply a predictive distribution from the GEV family, where the location and scale parameters depend on the ensemble member forecasts. To illustrate the difference between the two NR approaches, Fig. 1 shows the predictive distributions for Frankfurt Airport on 19 March 2011. Both the truncated normal (TN) and the GEV model correct the negative bias and the underdispersion of the ensemble, while the GEV density is less symmetric and exhibits a heavy right tail. We further investigate a regimeswitching approach, which issues a TN predictive density when the ensemble median forecast takes a low value and a GEV predictive density for high values of the ensemble median.
The remainder of the article is organised as follows. In Section 2, the ensemble forecasts and the observational data are introduced. In Section 3, we review the NR technique and describe our extensions using GEV distributions and a regime-switching combination model. Section 4 summarises the probabilistic scores used for estimating the model parameters and evaluating the competing forecasting procedures. In particular, we discuss how appropriately weighted proper scoring rules recently proposed in the economic literature (Diks et al., 2011;Gneiting and Ranjan, 2011) can be used to assess the predictive performance for high wind speeds. In Section 5, we report the results of a case study on daily maximum wind speed over Germany for lead times of 1Á3 d with the ensemble issued by the European Centre for Medium-Range Weather Forecasts (ECMWF) from May 2010 to April 2011. We close with a discussion in Section 6.

Data
We consider an ensemble forecast with 50 members of nearsurface (10-m) wind speed obtained from the global ensemble prediction system of the ECMWF. Ensemble forecasts for lead times up to 10 d ahead are issued twice a day at 00 UTC and 12 UTC, with a horizontal resolution of about 33 km and a temporal resolution of 3Á6 hours. To account for uncertainties in the initial conditions and the numerical model, the ensemble members are generated from random perturbations in initial conditions and stochastic physics parametrisation (Molteni et al., 1996;Leutbecher and Palmer, 2008;Pinson and Hagedorn, 2012). The ensemble members are thus statistically indistinguishable and can be treated as exchangeable (Fraley et al., 2010). We restrict our attention to the ECMWF ensemble run initialised at 00 UTC and lead times of 1Á3 d. To obtain predictions of daily maximum wind speed, we take the daily maximum of each ensemble member at each grid point location. For instance, one-day ahead forecasts are given by the maximum over lead times of 3, 6, . . ., 24 hours. The forecasts are verified over a set of 228 synoptic observation stations over Germany, see Fig. 2. The observations are hourly observations of 10-minute average wind speed which are measured over the 10 minutes before the hour. To obtain daily maximum wind speed, we take the maximum over the 24 hours corresponding to the time frame of the ensemble forecast. In the following, the terms wind speed and daily maximum wind speed are used synonymously. Ensemble forecasts at individual stations are obtained by bilinear interpolation of the gridded model output. The results presented below are based on a verification period from 1 May 2010 to 30 April 2011, consisting of 83 220 individual forecast cases. Additionally, we use data from 1 February 2010 to 30 April 2011 to obtain training periods of equal lengths for all days in the verification period and for model selection purposes.

NR prediction models
The NR methodology was originally developed for sea-level pressure and surface temperature under a normal predictive distribution , see also Hagedorn et al. (2008) and Kann et al. (2009) for further applications. Thorarinsdottir and Gneiting (2010) extend the framework to wind speed using a normal distribution truncated in zero, while Thorarinsdottir and Johnson (2012) apply the same setup to predict gust speeds based on NWP forecasts of wind speed and gust factors. A bivariate normal model for wind vectors is discussed in Schuhen et al. (2012). An NR framework for quantitative precipitation has recently been proposed by Scheuerer (2013) using a GEV model for the precipitation accumulations.

TN model
Let Y denote wind speed and X 1 , . . ., X k the corresponding ensemble member forecasts. The predictive distribution for Y is given by a TN distribution with a cut-off at 0, Y jX 1 ; . . . ; X k $ N ½0;1Þ ðl; r 2 Þ; (1) where the location parameter l ¼ a þ b 1 X 1 þ . . . þ b k X k is an affine function of the ensemble forecasts and the scale parameter r 2 ¼ c þ dS 2 is an affine function of the ensemble Fraley et al., 2010). The cumulative distribution function of the TN distribution is given by for z 0, and 0 otherwise, where F denotes the cumulative distribution function of the standard normal distribution.
Map of Germany showing the locations of the 228 synoptic observation stations used in this study. The station at Frankfurt Airport is indicated with ', the station at Greifswalder Oie is indicated with ), and the station at Kap Arkona is indicated with *.

GEV model
As an alternative to the TN model in (1), we consider a model based on extreme value theory. The cumulative distribution function of the GEV distribution with location parameter m, scale parameter s and shape parameter j is given by This distribution is defined on the set {z R :1'j(z (m)/ s0}, where the parameters satisfy l; n 2R and s0.
For j0, G is of Fre´chet type with a heavy right tail and it holds that z 2 ½l À r=n; 1Þ. We obtain the Fre´chet type in approximately 99.5% of our forecast cases. We estimate the parameters of the model in (2) without any constraints on the parameter values. It is thus possible to obtain non-zero probabilities of negative wind speed. However, as our data consist of daily maximum wind speeds, we find that this rarely happens in practice. The probability of negative wind speed is larger than 1% in about 0.1% of the forecast cases and it never exceeds 5%.
To link the parameters of the predictive GEV distribution to the ensemble, we apply the Bayesian covariate selection algorithm described in Friederichs and Thorarinsdottir (2012) to the data from 1 February 2010 to 30 April 2010. In this analysis, we assume a constant shape parameter j, while the location m and the scale s may depend on the ensemble mean and variance, where m i ,s i R for i 00,1,2 and k i ,n i {0,1} for i 01,2.
For 100 000 iterations of the Metropolis within Gibbs algorithm with a burn-in period of 20 000 iterations, we obtain very high posterior inclusion probabilities for the mean ensemble forecast X while k 2 01 or v 2 01 holds for less than 0.1% of the posterior sample for each parameter. In our subsequent predictions for the test set from 1 May 2010 to 30 April 2011, we thus set l ¼ l 0 þ l 1 X and r ¼ r 0 þ r 1 X under the constraint that s0, as the results of Friederichs and Thorarinsdottir (2012) indicate that an identity link on s results in minimally improved performance compared to the logarithmic link for the estimation procedure described in Section 4.

Regime-switching combination model
The third model we consider is a regime-switching method which combines the TN approach in (1) and the GEV approach in (2). Conditional on the median of the ensemble predictions, we either issue a TN or a GEV predictive distribution independently at each station. That is, for a model threshold h 2 R þ , we define the predictive distribution by Here, the parameters of the TN and GEV models depend on the ensemble forecast as described above. However, we train the TN model only on training data for which it holds that X med Bh. Similarly, the parameters of the GEV distribution are learned from data where X med ! h. The model threshold u is selected by comparing predictive performance over a range of possible thresholds based on the out-of-sample data from 1 February to 30 April 2010. Generally, thresholds between 7 and 8 m s (1 prove optimal which approximately corresponds to the 75th and 85th percentiles of the median ensemble predictions over the verification period. These results are discussed in detail below. Under this model, the probability of negative wind speed is less than 1.4 )10 (5 for all forecast cases. To illustrate the effect of the regime switching in (3), Fig. 3 shows the median of the post-processed predictive distribution as a function of the ensemble median for the three post-processing methods proposed here. While the plots for the TN and the GEV models display a linear though slightly heteroskedastic relationship between the two, this relationship is piecewise linear for the combination method. In particular, as the parameters of the GEV regime are learned from data where X med ! h only, the median of the post-processed distribution is generally higher than when the GEV distribution is used for all forecast cases.
Under the regime-switching framework in (3), stations with similar ECMWF median forecasts might have somewhat different post-processed forecasts when these fall close to the limit u, see the example in Fig. 4. However, as our data set consists of a collection of individual stations that are separated in space, this does not imply discontinuities in the resulting predictions. If a continuous spatial forecast is needed, a single TN or GEV model as presented above might be more appropriate.

Parameter estimation and prediction verification
The aim of the prediction is to 'maximize the sharpness of the predictive distribution subject to calibration' . Calibration is a joint property of the predictive distribution and the associated observation. It essentially requires that the observation is indistinguishable from a random draw from the predictive distribution. Sharpness refers to the concentration of the predictive distribution and is a property of the forecasts only. Anderson (1996) and Hamill and Colucci (1997) propose verification rank (VR) histograms as a graphical tool to assess the calibration of ensemble predictions. VR histograms show the distribution of the ranks of the observations when pooled within the ordered ensemble predictions. For a calibrated ensemble, the observations and the ensemble predictions should be exchangeable, resulting in a uniform VR histogram. The continuous analogue of the VR histogram is the probability integral transform (PIT) histogram (Dawid, 1984;. The PIT is the value of the predictive cumulative distribution function at the realised observation. Again, for calibrated forecasts, the PIT values should follow a uniform distribution. To quantify the deviation of VR histograms from uniformity, Delle Monache et al. (2006) propose the reliability index D. Here, we define D to apply equally to VR and PIT histograms and let it be given by: where m denotes the number of classes in the histogram, each of which having expected relative frequency 1/m, and f i denotes the observed relative frequency in class i.

Proper scoring rules
Scoring rules assign numerical values to forecastÁ observation pairs and provide summary measures of predictive performance. Forecasting methods can be compared in this manner by averaging their scores over a test set. If the scoring rule evaluates the full predictive distribution, it can simultaneously address calibration and sharpness.
A scoring rule is proper if the expected score is minimised when the true distribution of the observation is issued as the forecast (Bro¨cker and Smith, 2007;. Proper scores thus prevent hedging strategies. Popular examples of proper scoring rules are the logarithmic or ignorance score (Good, 1952), LogSðF ; yÞ ¼ À logðf ðyÞÞ; (5) where f denotes the density of F and y denotes the corresponding observation, and the continuous ranked probability score (CRPS) (Hersbach, 2000; , ðFðzÞ À 1fy zgÞ 2 dz; where the distribution F is assumed to have a finite first moment. Again, y denotes the corresponding observation. We furthermore use the absolute error jx À yj for the point forecast x given by the median of the predictive distribution as a deterministic measure of accuracy. The median of the predictive distribution is the Bayes predictor under the absolute error loss function (Gneiting, 2011). Here, the scoring rules are negatively oriented in that a smaller score denotes a better performance.

Evaluation of forecasts for high wind speeds
Despite the variety of theoretically justifiable methods to evaluate probabilistic forecasts, it is not obvious how to assess the predictive performance in the tails of the distribution, for example in the case of extreme wind speed observations. A natural approach is to select extreme events while discarding non-extreme events, and to proceed using standard evaluation procedures. However, it can be shown that restricting proper scoring rules to subsets of events results in improper scoring rules. This approach is thus bound to discredit even the most skilful forecasters (Gneiting and Ranjan, 2011). Instead, weighted scoring rules that emphasise specific regions of interest can be constructed. Gneiting and Ranjan (2011) propose the thresholdweighted continuous ranked probability score (twCRPS), twCRPSðF ; yÞ ¼ Z 1 À1 ðF ðzÞ À 1fy zgÞ 2 wðzÞdz; (7) where w(z) is a non-negative weight function on the real line. For wðzÞ 1, the twCRPS reduces to the original CRPS in (6). If the interest lies in the right tail of the distribution, we may set wðzÞ ¼ 1fz ! rg. Note that the CRPS in (6) represents an integral of the Brier score (Brier, 1950) over all possible thresholds. The twCRPS in (7) with wðzÞ ¼ 1fz ! rg thus allows us to simultaneously assess the exceedance probabilities for all thresholds greater or equal to r. Similarly, Diks et al. (2011) propose proper weighted versions of the logarithmic score in (5).

Optimum score estimation
The framework of proper scoring rules may also be applied to parameter estimation. Following the general optimum score estimation approach of , the parameters of a distribution are determined by optimising the average value of a proper scoring rule as a function of the parameters over a training set. Optimum score estimation based on the logarithmic score in (5) thus corresponds to maximum likelihood (ML) estimation. Minimum CRPS estimation, that is, optimum score estimation based on the CRPS in (6) provides a robust alternative to ML estimation if closed-form expressions for the CRPS of the distribution family of interest are available.
Following Thorarinsdottir and Gneiting (2010), we estimate the parameters of the TN model using minimum CRPS estimation. Friederichs and Thorarinsdottir (2012) derive a closed-form expression of the CRPS for the GEV distribution and compare ML and minimum CRPS estimation for their analysis of peak wind speed. For our data set, ML estimation proved to be more parsimonious and numerically stable. There is no analytical solution of the corresponding ML minimisation problem (Coles, 2001). However, numerical approximations can be obtained using standard algorithms for any given data set (Prescott and Walden, 1980). For the regime-switching combination model in (3), minimum CRPS estimation is applied for the parameters of the TN distribution and ML estimation for the parameters of the GEV distribution. For all three methods, the parameters are estimated over a rolling training period consisting of the forecastÁ observation pairs of the last m days. The parameters are estimated regionally in that training data from all stations are pooled together.

Results
Here, we present the results for 1Á3 d ahead probabilistic forecasts of daily maximum wind speed over Germany produced by the three different post-processing methods presented in Section 3. The verification period covers 1 yr, from 1 May 2010 to 30 April 2011.

Selection of training period and regime-switching threshold
The results presented here are based on a rolling training period of length m 030 d for all methods. We have also performed the same analysis for training periods of length m020, 25, . . ., 50 d. In general, shorter training periods allow for a rapid adaption to changes in environmental conditions while longer training periods reduce the statistical variability in the parameter estimation . We found that the performance scores reported in Table 1 change by less than 1% for the different values of m, and, in accordance with the results of Thorarinsdottir and Gneiting (2010) and Thorarinsdottir and Johnson (2012), we conclude that the methods are robust against changes in m.
The model threshold u for the regime-switching combination model in (3) is determined by computing the mean CRPS for a range of threshold values over an out-ofsample training period from 1 February 2010 to 30 April 2010. Using a rolling training period of m 030 d, we obtain the optimal score for u 07.5 m s (1 , see Fig. 5. A sensitivity analysis shows that the results in Table 1 are nearly constant for values of u between 7 and 8 m s (1 and various choices of m. The threshold of 7.5 m s (1 approximately corresponds to the 80th percentile of the median ensemble predictions in the verification set. Over the verification period, a GEV distribution is used in approximately 18% of the forecast cases.

One-day ahead predictive performance
We compare the three ensemble post-processing methods discussed above to the raw, unprocessed ECMWF ensemble and a climatological reference forecast. For each day, the climatological reference forecast is obtained from the observed wind speeds in the 30-d training period used for the parameter estimation of the post-processing methods. VR and PIT histograms for the ensemble and the three post-processing methods are shown in Fig. 6. The ECMWF forecasts are underdispersive, with too many observations falling outside the ensemble range. This deficiency has repeatedly been observed for various ensemble prediction systems. Possible causes for the ECMWF ensemble in this case are underdispersiveness of the underlying model, unsatisfactory modelling of the uncertainty using random perturbations, and spatial and temporal interpolation and smoothing issues, see e.g. Hamill and Colucci (1997), Palmer (2002) and Raftery et al. (2005).
All three post-processing methods significantly improve the calibration of the ensemble. While the GEV forecasts are slightly overdispersive, their PIT histogram shows smaller deviations from uniformity than that of the TN forecasts. The PIT histogram of the combination model  resembles the PIT histogram of the TN technique, with minor improvements for large PIT values. The PIT histograms thus indicate that the GEV distributions tend to have minimally too heavy tails, while the upper tails for the TN distributions seem slightly too light. The combination model somewhat compensates for this. Substantial improvements in calibration compared to the raw ensemble are also indicated by values of the reliability index D (4). For all observations pooled together, the reliability index of 1.02 for the ECMWF ensemble predictions is reduced to 0.19 for the TN model, 0.12 for the GEV model and 0.16 for the combination model. The postprocessing methods further lead to an improvement in local calibration at all observation stations. If the 228 stations are considered individually, the median reliability index of 0.97 for the ensemble predictions reduces to 0.49 for the TN model, 0.53 for the GEV model and 0.48 for the combination model. Table 1 shows the mean CRPS, the MAE and average coverage and width of 80% prediction intervals for the competing forecasts. Here, the point forecast evaluated by the MAE is given by the median of the corresponding predictive distribution. For calibrated forecasts, the average coverage of 80% prediction intervals should be close to 80% and narrower average prediction intervals indicate sharper forecasts. For discrete distributions such as the ECMWF ensemble and the climatological forecast, the CRPS can be calculated explicitly, see e.g. Berrocal et al. (2008). The CRPS for the TN model and the GEV model is calculated as described in Thorarinsdottir and Gneiting (2010) and Friederichs and Thorarinsdottir (2012), respectively. The ECMWF ensemble predictions outperform the climatological reference forecast and provide sharp prediction intervals at the cost of being uncalibrated. All post-processing methods outperform the ensemble predictions, with the GEV method showing small improvements in mean CRPS compared to the TN method. The regime-switching combination method performs best in terms of both mean CRPS and MAE, slightly improving the results of the GEV method. Note that due to the heavier tails, the GEV model generally results in wider prediction intervals than the TN model.  Figure 7 compares the station-specific predictive performance of the individual post-processing models as a function of the site-specific average observed wind speed. Figure 7(a) and (b) indicates that the overall improvements of the GEV and the regime-switching combination model over the TN model are mainly due to improvements at stations with high average observed wind speeds. However, there appears to be no obvious pattern for stations with high average wind speeds when comparing the GEV model and the regimeswitching combination model in Fig. 7(c). However, the combination model outperforms the GEV model for most stations with average observed wind speeds below 7 m s (1 .

One-day ahead performance in the upper tail
With a focus on the performance in the upper tail, Table 2 shows values of the mean twCRPS for the competing forecasts where we have employed the indicator weight function w r ðzÞ ¼ 1fz ! rg for r 010,12 and 15 m s (1 . The threshold values approximately correspond to the 90th, 95th and 98th percentiles of the marginal distribution of the wind speed observations. The twCRPS is here calculated using numerical integration methods. All three post-processing methods improve the ECMWF ensemble predictions, and the GEV approach outperforms the TN method. The regime-switching combination of the two models further improves the performance. Note that the relative improvement over the TN method for the upper tail is comparatively larger than the improvement under the unweighted CRPS in Table 1. Similar rankings hold for any value of r between 10 and 20 m s (1 . We further consider the threshold-weighted continuous ranked probability skill score (twCRPSS) given by where F ref denotes the predictive cumulative distribution function of a reference forecast, in our case the TN method. The twCRPSS is positively oriented and can be interpreted as an improvement over the reference forecast. Figure 8 shows the twCRPSS for the GEV and the regime-switching combination method as a function of the threshold r for the indicator weight function, using the TN method as a reference forecast. For all thresholds and both models, the twCRPSS is strictly positive, indicating an improved predictive performance compared to the TN model with the regime-switching combination method showing greater improvement. In general, the score values increase for larger threshold values, with the largest differences obtained for threshold values around 14 m s (1 .

Performance for longer lead times
For lead times of two and three days, we obtain similar results as above. Table 3 shows the mean CRPS, MAE, and average coverage and width of 80% prediction intervals for those lead times. Compared to the one-day ahead forecasts, forecasts for longer lead times result in slightly less accurate predictions and wider prediction intervals. The ECMWF ensemble predictions exhibit wider prediction intervals compared to the one-day ahead forecasts resulting in small improvements in calibration. However, the ensemble predictions are still underdispersive and the three postprocessing methods significantly improve the predictive skill of the ensemble. The differences among the three postprocessing models are less pronounced than for one-day ahead forecasts. The results for the mean twCRPS and lead times of two and three days are shown in Table 4. We obtain the same ranking as for the one-day ahead forecasts while the differences in the scores for the three models are small. Note that while the post-processing methods show a slight decline in accuracy from two to three days ahead, the scores for the upper tail of the ensemble are identical for these lead times.

Discussion
We propose two extensions to the NR ensemble postprocessing approach of Thorarinsdottir and Gneiting (2010) employing GEV predictive distributions for daily maximum wind speed. In a case study over Germany using the exchangeable ECMWF ensemble, all three NR methods significantly improve the calibration as well as the overall skill of the raw ensemble. The best method according to our results is a regime-switching method, where a TN model is applied when we expect low winds, while a GEV framework is used when high winds are expected. In our GEV approach, we have not accounted for the possibility of the method predicting negative wind speeds. While this rarely happens in our case study, different results are likely to be obtained for less extreme wind variables. In an application to quantitative precipitation, Scheuerer (2013) considers the GEV distribution to be left-censored at zero assigning all mass below zero to exactly zero. This approach seems very appropriate for precipitation, where there is often high probability of zero precipitation. However, it seems less appropriate for wind variables. Instead, one might consider a truncation of the GEV distribution similar to the TN distribution in (1).
The regime-switching combination of the TN and the GEV models offers several starting points for further extensions and potential improvements. For each forecast case, either the TN or the GEV model is selected based on the median ensemble prediction falling below or exceeding a fixed threshold u. Instead of assuming a fixed threshold value, it might be interesting to develop an adaptive method to automatically estimate u, for example based on station-specific information or other weather variables. Alternatively, improvements of the predictive performance for extreme events might be achieved by considering a mixture model using a TN distribution for the bulk of the distribution, and an adaptive generalised Pareto distribution (GPD) for the tail. Bentzien and Friederichs (2012) propose such a mixture model for precipitation using lognormal and gamma mixtures for the bulk of the distribution and an adaptive GPD tail which is able to significantly improve the predictive performance for extreme quantiles, see also Frigessi et al. (2002). Similarly, all three methods could be extended by allowing for local adaption of the parameter estimation. This might be obtained with a geostatistical approach analogous to Kleiber et al. (2011a, b), where the spatially varying parameters are estimated locally and interpolated to locations without available observations.
When assessing the predictive performance in the upper tail, we have focused on the twCRPS with a simple indicator weight function. Under this weight function, it is not possible to distinguish between forecasts with the same behaviour on ½r; 1Þ, but different behaviour on ðÀ1; rÞ. Following Gneiting and Ranjan (2011), we have also performed our analysis using weight functions of the form w r ðzÞ ¼ Uðzjl r ; r 2 r Þ, where UðÁjl r ; r 2 r Þ denotes the cumulative distribution function of a normal distribution with mean m r and variance r 2 r . For example, we might set l r ¼ r and r 2 r ¼ 1, or r 2 r ¼ S 2 , where S 2 is the sample variance of the wind speed observations during the out-ofsample training period from 1 February to 30 April 2010. The average values of the twCRPS using these weight functions result in the same ranking and in comparable relative improvements as reported above for the indicator weight function. We have therefore focused on the indicator weight function which is computationally less demanding.
Our data set does not contain any missing forecasts or observations. Single missing observations would presumably only have negligible effects on the predictions since observations from all stations are pooled together for the parameter estimation. Bayesian adaptions of ensemble post-processing methods as proposed by Bishop and Shanley (2008) and Di Narzo and Cocchi (2010) allow for incorporation of uncertainty in both observations and parameters and might help to overcome the effects of  missing data. However, they are computationally quite demanding compared to the frequentist approach taken in this work and might thus be infeasible for a large number of stations and observations. For further relevant work, see also Crochet (2004) and references therein.