Simultaneous Ensemble Postprocessing for Multiple Lead Times with Standardized Anomalies

Separate statistical models are typically ﬁt for each forecasting lead time to postprocess numerical weather prediction (NWP) ensemble forecasts. Using standardized anomalies of both NWP values and observations eliminates most of the lead-time-speciﬁc characteristics so that several lead times can be forecast simulta- neously. Standardized anomalies are formed by subtracting a climatological mean and dividing by the climatological standard deviation. Simultaneously postprocessing forecasts between 1 12 and 1 120 h increases forecast coherence between lead times, yields a temporal resolution as high as the observation interval (e.g., up to 10min), and speeds up computation times while achieving a forecast skill comparable to the conventional method.


Introduction
Weather forecasts are important for many aspects of life whether professional or private. These weather forecasts rely mainly on numerical weather prediction (NWP) models, which solve partial differential equations. However, it is impossible to solve these equations exactly and errors occur, which grow with forecasting horizon. Additionally, because of limited computational power, the numerical grid of these forecast models has to be coarse and subgrid processes have to be parameterized. The systematic component of the resulting errors can be corrected with a statistical model trained on numerical forecasts and their matching observations of previous forecast times (Glahn and Lowry 1972). Errors and resulting corrections are larger over complex terrain, such as the Alps, where real and NWP model topography differ strongly.
Additionally, weather centers compute not one but an ensemble of NWP forecasts in order to account for uncertainties introduced in initial conditions and parameterizations (Lorenz 1982;Leutbecher and Palmer 2008). The goal of the postprocessing is then to obtain calibrated probabilistic forecasts. Two commonly used methods are nonhomogeneous Gaussian regression  and Bayesian model averaging .
Output from NWP models is typically made available for lead-time intervals between 1 and 6 h. It is then usually postprocessed separately for each lead time and location (weather station), which can become computationally expensive when performed for many stations and/or lead times. Scheuerer and Büermann (2014) proposed a method to simultaneously postprocess spatially using anomalies. Stauffer et al. (2017) and Dabernig et al. (2017) took this method a step further by standardizing the anomalies. Their basic idea is to fit a statistical model on standardized anomalies (Wilks 2011) instead of the direct observations and ensemble forecasts. Since these standardized anomalies should not have any station-specific characteristics anymore, one statistical model can be used for all locations within a certain region.
We adapt the method of Dabernig et al. (2017), but instead of forecasting several stations simultaneously, we forecast several lead times simultaneously, which is computationally more efficient. Standardized anomalies remove lead-time-specific and seasonal characteristics of the data. Conventional methods are usually trained on a sliding window  to account for seasonal differences in the relationship between forecasts and observations. Standardized anomalies do not contain these season-specific characteristics anymore, so longer training datasets can be used. The implicit assumption is, however, that the NWP forecast skill does not vary seasonally. It will be shown that the third advantage of this approach is to provide calibrated forecasts at a much higher temporal resolution than the NWP forecasts by using a high-resolution observation climatology.
The remainder of this article is structured as follows: the data are described in the next section, followed by the methodology. Subsequently, results for a single station and for multiple stations are presented with a discussion and the conclusions completing this article.

Data
The proposed method is tested for the topographically complex terrain of South Tyrol in northern Italy from 1 February 2010 to 31 January 2016. Local time in this region is 1 h ahead of UTC. The forecast variable y is the temperature measured 2 m above ground. Ten-minute averages are available at 39 weather station locations. The predictor variable is the 2-m temperature of the ensemble NWP model from the European Centre for Medium-Range Weather Forecasts (ECMWF). The model has 51 members. Data are bilinearly interpolated from the 32-km model grid (T639) to the measurement sites. Lead times from the 0000 UTC run between 112 and 1120 h are used with an interval of 3 h.

Method
a. Nonhomogeneous Gaussian regression Gneiting et al. (2005) introduced nonhomogeneous Gaussian regression (NGR) for statistically postprocessing of ensemble forecasts. The observations are assumed to follow a normal distribution with regression equations for both the mean m and the standard deviation s: log(s) 5 c 0 1 c 1 log(s) , with regression coefficients b 0 , b 1 , c 0 and c 1 . Here m depends on the ensemble mean m, and s depends on the ensemble spread s. We adapted the method of Gneiting et al. (2005)

b. Standardized anomalies
The basic idea of using standardized anomalies is visualized in Fig. 1. While the annual cycles clearly differ for the two different lead times and times within the diurnal cycle (top panel), the standardized anomalies for both lead times are centered around zero and have no pronounced annual cycle anymore (bottom panel). This may allow the use of the same coefficients for both lead times and to fit only one statistical model.
To obtain standardized anomalies, the climatological mean m y is subtracted from the observations and divided by a climatological standard deviation s y : The standardized anomaliesm and the corresponding s are calculated as follows: where ens is the 51-member ensemble forecast and m ens and s ens are the climatological mean and standard deviation, respectively. The m ens is the same for all members. Them is then the mean of all 51 standardized ensemble member anomalies ands is their standard deviation. The details for computing m y , m ens , s y , and s ens will be shown in section 3d.

c. Standardized anomaly model output statistics (SAMOS)
Standardized anomalies remove seasonal and leadtime-specific characteristics and as a result several lead times of the same station can be fitted and forecast together, if the climatology is able to capture these different characteristics. Therefore, Eq. (1) is fitted with standardized anomalies (ỹ,m, ands ) of the observations and predictions instead of the observations and predictions themselves. Since the coefficients (b 0 , 1 b 1 , c 0 , and c 1 ) are fitted on data without lead-time specifics they are representative for every lead time. One underlying assumption is that the systematic error of the forecast uncertainty provided by the spread of the NWP ensemble does not change with lead time.
Once the coefficients are determined, a rearranged form of Eqs. (2) and (3) can be used for forecasting with new data:m Since the coefficients do not depend on lead time, forecasts are no longer restricted to the lead times of the NWP forecasts. And importantly, the higher temporal resolution (e.g., 10 min) of the observation climatology (m y and s y ) provides additional information about the diurnal cycle in between lead times. Temporal information for individual events of abrupt temperature changes such as frontal passages, however, is solely provided by NWP data.

d. Climatology
To obtain forecasts with a high temporal resolution, a temporally highly resolved climatology is necessary. It can be produced with Generalized Additive Models for Location, Scale, and Shape (GAMLSS; Rigby and Stasinopoulos 2005;Stasinopoulos and Rigby 2007) as in Dabernig et al. (2017). With GAMLSS one model for climatological mean and standard deviation for all lead times and all points in between can be calculated simultaneously. GAMLSS is similar to NGR but can also include nonlinear effects: To calculate a climatology, the daily variation has to be considered as well as the seasonal change. The nonlinear models for m y and s y are therefore as follows: m y 5 b 0 1 f (yd, dm) and s y 5 g 0 1 g(yd, dm), (10) with b 0 and g 0 as regression coefficients and f and g as smooth functions that capture the interaction between the day of the year (yd) and the minute of the day (dm, 0-1439) with a cyclic cubic regression spline (fitted with the R package gamlss). Eight degrees of freedom are used in the function for the day of the year, and 10 degrees of freedom are used for the minute of the day. The resulting terms (''effects'') of the fitted climatological mean based on 10-min measurements are illustrated in Fig. 2. Figure 2a shows the seasonal effect exemplified by four different times of the day indicated by the colored lines. The seasonal amplitude for the morning hours (bright colors) is smaller than in the afternoon (darker colors) and the shape and date of the peak also vary. Figure 2b shows the diurnal effect exemplified by four different months. The minimum temperature occurs around 0900 UTC in January (brightest line) and about 3 h earlier in July (darkest line). These figures show the need for modeling an interaction term between seasonal and diurnal effects. The GAMLSS model provides then a climatology for every day of the year and every minute of the day.
In addition to climatologies of the observations, climatologies of the ensemble forecasts are also required, which can be computed analogously. However, since forecasts in our case are only available every 3 h instead of every 10 min, we model only a seasonal effect separately for each lead time and interpolate linearly in between. Equation (10) is replaced by m ens 5 b 0 1 f (yd) and s ens 5 g 0 1 g(yd), (11) where nonlinear effect for the day of the year uses eight degrees of freedom.

Different models
Three different models with standardized anomalies are compared against a reference to investigate the advantages of SAMOS. The reference is an NGR [ensemble model output statistics (EMOS); Gneiting et al. 2005] fitted separately for each 3-hourly lead time with a sliding window of the previous 30 days as training data.
The SAMOS variations differ in the calculation of the observation climatology and NGR. The observation climatology can either be calculated at every lead time individually as in Eq. (11) or at all lead times simultaneously as in Eq. (10). As second difference, Eqs. (7) and (8) can either be calculated separately for each lead time or simultaneously for all lead times. The three SAMOS variations are as follows: the standardized anomalies is also fitted separately for all lead times, but one single NGR equation is calculated simultaneously for all lead times. d The third model is SAMOS full as described in the methods section with one simultaneous climatology over all available observations and one simultaneous NGR for all 3-hourly lead times.
A forecast example of SAMOS full for a particular initial day is shown in Fig. 3. Whereas conventional methods such as EMOS only produce forecasts at the 3-hourly lead times, SAMOS full also provides forecasts in between. The forecasts in between the 3-hourly lead times follow the shape of the observation climatology (dashed line in Fig. 3). Additionally, SAMOS fully inflates the uncertainty of the raw NWP ensemble to correct for its underdispersion (Hopson 2014;Möller and Groß 2016;Dabernig et al. 2017).
All models are trained on data independent from that used for verification. For the reference model EMOS this is achieved by training it with data of a rolling window from the 30 days prior to the forecast day and verifying it with the remaining days. For the SAMOS variants this is achieved by tenfold cross validation. The complete dataset of 6 yr is split into 10 consecutive parts. Each tenth is then forecast and verified with a model trained on the remaining nine parts.

Results
The performance of these forecasts is evaluated using different metrics. The mean absolute error (MAE) is used as deterministic measure while the CRPS is used to evaluate the probabilistic performance. Both are calculated separately for each lead time and model with tenfold cross-validated data. The metrics are shown as skill score relative to a reference model (ref): skill score 5 1 2 score score ref .
To investigate calibration, the root-mean-squared error (RMSE) of the predicted meanm is compared to the predicted standard deviationŝ.
In the following, we first show results for a single station (Auer) before aggregating results of all 39 stations.

a. Single station
To evaluate how well the models are calibrated, the RMSE and the standard deviation of the forecasts are compared in Fig. 4. A well-calibrated forecast should have a predicted standard deviation similar to the RMSE of the predicted mean. Figure 4a shows that the standard deviation of the reference EMOS model is less than the RMSE of the mean and the model at all lead times and is thus underdispersive. Standard deviation and RMSE of SAMOS lead-time-wise (Fig. 4b), on the other hand, are almost identical, showing the model to be almost perfectly calibrated. When the NGR equations are no longer fitted separately but simultaneously for all lead times (dotted line in Fig. 4b) results remain similar (RMSE not significantly different to SAMOS lead-time-wise) except for an overdispersion in the afternoon of the first four forecast days. This remains unchanged when the climatologies are also fitted simultaneously (SAMOS full, Fig. 4c).
The results for the deterministic MAE and the probabilistic CRPS metrics are illustrated in Fig. 5. Forecasts in between the 3-hourly lead times for EMOS, SAMOS lead-time-wise, and SAMOS leadtime-wise simultaneous are interpolated with a natural spline function to the 10-min resolution. A natural spline was used instead of a linear interpolation to better capture the diurnal cycle of the temperature. Forecast quality varies with the time of the day. It is best (lowest values) in the second half of the night and the morning (Fig. 5a). This pattern is already present in the raw ECMWF forecasts (not shown). The three SAMOS models outperform the EMOS reference model most of the time with the largest improvement in the afternoon. The effect of fitting climatologies and NGR equations simultaneously (SAMOS full) instead of separately for each lead time (SAMOS-ltw) on the forecast quality is shown in Fig. 5b by the CRPS skill score with SAMOS-ltw as reference. The 90% confidence interval of the mean CRPS skill score is computed from 500 bootstrap samples and shown as a gray area in Fig. 5b. On average, the forecasts of SAMOS full are half a percent worse at the 3-hourly lead times (red dots) but similar or better in between with the biggest improvements farthest from the lead times.

b. All stations
So far we have only evaluated the performance at a single forecast location (Auer). To obtain more general results, the individual results of the 39 stations in the region of South Tyrol will be combined. The MAE and CRPS skill scores with EMOS as reference are averaged over all stations for each lead time. One set of skill scores is computed at the 3-hourly resolution of the NWP model; the other at the 10-min resolution of the observations. The results are shown in Fig. 6. The deterministic MAE at the 3-hourly lead times, for which NWP input is available, is slightly but not significantly better than the EMOS reference for SAMOS lead-time-wise and SAMOS lead-time-wise simultaneous, and slightly worse for SAMOS full. In contrast, the probabilistic CRPS metric of all SAMOS models on the 3-hourly lead times is significantly better than that of EMOS. This difference between MAE and CRPS shows that the SAMOS models capture the forecast FIG. 4. RMSE/standard deviation at the 3-hourly lead times (start time is 0000 UTC) for (a) EMOS, (b) SAMOS lead-time-wise, and (c) SAMOS full, with SAMOS lead-time-wise simultaneous as referenced in gray. RMSE/standard deviation are derived on independent data with tenfold cross validation. uncertainty better than EMOS, which can also be seen in Fig. 4. SAMOS full is still slightly worse than the other SAMOS variations.
At the 10-minute resolution, however, SAMOS full performs as well as the other SAMOS variants. The better performance of SAMOS full in between the 3-hourly ECMWF lead times compensate for the slightly worse performance at these lead times.
6. Discussion and conclusions Dabernig et al. (2017) and Stauffer et al. (2017) showed that postprocessing NWP ensemble forecasts with standardized anomalies instead of the forecast variable itself allows the use of one single statistical model for forecasts at an arbitrary location. The traditional approach of ensemble model output statistics (EMOS; Gneiting et al. 2005) thus becomes standardized anomaly model output statistics (SAMOS). The current paper takes this approach from the spatial into the temporal domain and fits one single statistical model to several lead times. This improves the forecast skill, increases the temporal resolution of the forecasts, and reduces the time needed to compute calibrated probabilistic forecasts. The concept is demonstrated for forecasts of 2-m temperature in the region of South Tyrol in Italy. The topographic complexity of this region and the ensuing large differences between real and NWP model topography create larger systematic NWP model errors than over flat terrain. In addition, these systematic model errors vary diurnally and seasonally.
Standardized anomalies remove diurnal and seasonal cycles from the data. Consequently, all available data can be used to fit the postprocessing statistical models. Season-specific characteristics have also been handled by using a moving training window of subseasonal width-typically 30 days (e.g., Gneiting et al. 2005). Being able to use all data instead of only data within the window makes the fit more stable and increases the forecast skill. In addition, the statistical model is formulated over a wider range of weather events and covers more of climatologically possible scenarios. The training data could be expanded even further and the forecast skill improved even further with reforecasts as shown by Stauffer et al. (2017). Even then events more extreme than contained in typical reforecast periods of 20-30 years can occur. As long as the NWP model is capable of producing forecasts that are extreme compared to the NWP climatology, SAMOS should be able to produce forecasts that are extreme compared to the observation climatology.
The underlying assumption for using standardized anomaly data over all seasons is that systematic errors in the NWP ensemble spread have no strong seasonal FIG. 5. (a) MAE and CRPS for different models from 0000 UTC to lead times 112 to 1120 h at 10-min resolution (raw ECMWF forecasts have on average a CRPS of 5.7 and an MAE of 6.2.). MAE/CRPS are aggregated over all lead times and are derived on independent data via tenfold cross validation. (b) CRPS skill score for SAMOS full with SAMOS lead-time-wise as reference at 10-min resolution. The red dots mark the 3-hourly lead times of the ensemble forecasts. The gray shading represents a 90% confidence interval determined via error bootstrapping. dependence. If that assumption were violated, an additional seasonal term in the NGR equation [Eq.
(3)] could be used to accommodate this dependence.
Both observations and NWP model output can be transformed to standardized model anomalies by subtracting the respective means and dividing by the respective standard deviations. Merely replacing the direct values, which are used in the EMOS postprocessing approach with standardized anomalies in the SAMOS approach (termed SAMOS lead-timewise) improves the mean absolute error (MAE) as deterministic metric and the continuous ranked probability score (CRPS) as probabilistic measure. SAMOS also improves the calibration and reduces the underdispersion of the NWP model output even more than EMOS.
When the nonhomogeneous regression models are no longer fit separately for each lead time but simultaneously for all lead times (SAMOS lead-time-wise simultaneous) the verification metrics are similar. Computational speed is increased and the complexity of running a probabilistic postprocessing system in an operational setting with hundreds or thousands of locations is reduced. SAMOS full forecasts are an order of magnitude faster to compute than EMOS over a simulated year of operational forecasts.
Finally, when the statistical models for mean and standard deviations of observations and NWP model output needed for the computation of standardized anomalies are also fit simultaneously for all times (SAMOS full), the temporal resolution of the forecasts can become as high as the temporal resolution of the observations. Output of ensemble NWP models is typically available every few hours (3 in our case), while measurements are typically available several times per hour (every 10 min in our case). The increase in temporal resolution is not simply an interpolation but provides additional information contained in the highly resolved seasonally varying daily cycle of the observations. Abrupt temperature changes from particular synoptic or mesoscale events without a pronounced diurnal cycle such as a frontal passage are smoothed out in the climatology of the observations. The arrival time of such an event can therefore not be more finely resolved than in the NWP model. Nor can a possible timing error of such an event be corrected. Multivariate methods such as ensemble copula coupling (Schefzik et al. 2013) would be needed for such a correction.
An alternative approach to achieving a temporal resolution of the forecasts as high as the resolution of the observations is to first interpolate the raw ensemble NWP model output to observation times and then fit a statistical model separately for each lead time. Five-day FIG. 6. Comparison of the skill score of different models relative to EMOS. Boxplots contain MAE/CRPS point averaged over all stations for each of the (left) 37 (total) 3-hourly lead times, and (right) 649 (total) 10-min lead times.
forecasts at a 10-min resolution would then require 649 separate models per station, which is impractical.