D-vine-copula-based postprocessing of wind speed ensemble forecasts

Current practice in predicting future weather is the use of numerical weather prediction (NWP) models to produce ensemble forecasts. Despite of enormous improvements over the last few decades, they still tend to exhibit bias and dispersion errors and, consequently, lack calibration. Therefore, these fore-casts may be improved by statistical postprocessing. In this work, we propose a D-vine-copula-based postprocessing for 10 m surface wind speed ensemble forecasts. This approach makes use of quantile regression related to D-vine copulas, which is highly data driven and allows one to adopt more general dependence structures as the state-of-the-art zero-truncated ensemble model output statistic (tEMOS) model. We compare local and global D-vine copula quantile regression (DVQR) models to the corresponding tEMOS models and their gradient-boosting extensions (tEMOS-GB) for different sets of predictor variables using one lead time and 60 surface weather stations in Germany. Furthermore, we investigate which types of training periods can improve the performance of tEMOS and the D-vine-copula-based


INTRODUCTION
Weather prediction is usually based on results of numerical weather prediction (NWP) models. These are deterministic models describing the dynamical physics of the atmosphere by a system of partial differential equations (Kalnay, 2002). The current state of the atmosphere is run forward in time to predict future atmosphere states. However, the solutions depend heavily on the initial conditions and model formulations. Consequently, NWP models suffer from various uncertainties. To address this problem, ensemble prediction systems are commonly used. The NWP model is developed multiple times with different model and/or initial and boundary conditions (Gneiting et al., 2005;Leutbecher and Palmer, 2008). Consequently, the forecast ensemble obtained can be seen as a probabilistic forecast, which allows one to quantify forecast uncertainty (Palmer, 2002). In practice, the NWP ensemble forecasts suffer from biases and dispersion errors, and thus may benefit from statistical postprocessing based on past data to improve calibration and forecast skill. One very famous postprocessing model is the so-called Ensemble Model Output Statistics (EMOS; Gneiting et al., 2005). This method is often used to provide a full predictive distribution from the ensemble forecasts. The original EMOS was developed for Gaussian-distributed weather quantities (e.g., temperature). As the Gaussian assumption might not be valid for other weather variables, modifications of the original model were introduced (e.g., Schefzik et al., 2013;Gneiting and Katzfuss, 2014;Hemri et al., 2014).
However, these models mostly assume a specific distribution and a linear relationship between the (transformed) distribution parameters and their predictors, which might be too restrictive in certain cases. Therefore, we suggest a copula approach, allowing the construction of (non-)parametric marginal models for each variable separately and to join the variables with an appropriate dependence structure characterized by a flexible copula.
In this article, our proposed approach for dependence modelling relies on vine copulas and allows us to describe various types of dependencies between the variables considered. Vine copulas are not completely new in the ensemble postprocessing domain; for example, Klein et al. (2016) and Möller et al. (2018) made use of them. To build a vine copula we employ a pair-copula construction (PCC; Joe, 1996;Aas et al., 2009), which means that the joint dependence can be built up by only bivariate building blocks. We focus on a specific class of vine copulas, namely the D-vine (drawable vine) copula, and use the corresponding D-vine copula quantile regression (DVQR) of Kraus and Czado (2017) for the postprocessing. Specifically, we postprocess 10 m wind speed forecasts at 60 surface stations in Germany for a single lead time based on data from 2016 to 2020. We build local and global D-vine copula models and compare different settings for the predictor variables including wind speed ensemble forecasts only, as well as additional ensemble forecasts from other weather variables and station-specific information. We also analyse different training period types and lengths. Furthermore, we investigate which weather variables and dependence patterns are most meaningful. The proposed postprocessing approach is compared with local and global zero-truncated EMOS (tEMOS) models and their gradient-boosting extensions (tEMOS-GB). The overall goals of this study are to highlight the competitiveness of DVQR to the state-of-the-art tEMOS and tEMOS-GB models, to point out in which settings DVQR works well or even better than the tEMOS or tEMOS-GB model, as well as to mention in which directions research can be pursued. To the best of our knowledge, this study is the first that investigates DVQR for wind speed ensemble postprocessing.
The rest of the article is organized as follows: Section 2 describes the observation and forecast data and the types of training periods used for the case study. In addition, some notation is introduced. In Section 3 we review the benchmark tEMOS model and its tEMOS-GB version, followed by a description of the D-vine copula postprocessing. The verification methods used for the comparison of the different models are explained in Section 4. A large verification study on the D-vine copula postprocessing is conducted in Section 5. We close with a conclusion and outlook in Section 6.

DATA
For this study, we focus on postprocessing 10 m surface wind speed forecasts at 60 surface stations in Germany visualized in Figure 1 at a forecast lead time of 24 hr. The hourly average wind speed observations are received by DWD Climate Data Center (CDC) (2021). The ensemble forecasts are provided by the European Centre for Medium-Range Weather Forecasts (ECMWF) (2021), consisting of 50 perturbed ensemble members and a control forecast. The time period considered ranges from January 2, 2016, to December 31, 2020, excluding leap year days. These forecasts are initialized at 1200 UTC on a grid of 0.25 • × 0.25 • . The gridded data are bilinearly interpolated to the observation stations. In addition to the target variable (wind speed), we add several auxiliary ensemble predictor variables (see Table 1). The wind speed is obtained by calculating ws = √ u + v. In the following, we will denote the ensemble mean of an m-member ensemble X q,1 , … , X q,m for a weather variable q by the related control forecast by X ctrl q , and any other variable q by X q . The corresponding realizations are indicated by lowercase letters x q,1 , … , x q,m , x ctrl q , and x q respectively. Finally, we would like to mention that we treat all predictor variables as continuous variables. Missing values in the forecast and observation data are imputed by exponentially weighted moving averages, using the recent 4 days before and after the position where a value is missing. This is carried out using the R-package "imputeTS" of Moritz and Bartz-Beielstein (2017).
Furthermore, we split our nearly 5-year dataset into two parts; that is, a testing period and a validation period (see Figure 2). The period of June 1, 2017, to May 31, 2018, is used as the testing period for parameter tuning. The days before each of these test days can be used for training. After having identified the most suitable model variant based on the results of the testing period, we evaluate these models in the validation period from June 1, 2018, to December 31, 2020. Again, the days before each validation day can be used for model estimation.
As the selection of the type of training period can affect the predictive performance and the computational costs of a model, we consider different types of training periods for the tEMOS and the DVQR model. This can be seen as a small extension to the work of Lang et al. (2020), who compared different types of training periods for the postprocessing of near-surface air-temperature forecasts by EMOS. We test different sliding windows for tEMOS and DVQR in the training period to forecast day t: • Rolling window: Use the days {t − n, … , t − 2, t − 1} to forecast day t for a fixed n ∈ N. The rolling window is a dynamic training period and was originally introduced by Gneiting et al. (2005).
• Refined rolling window: Use the days {t − n, … , t − 2, t − 1} in the year where the forecast day t lives and the days {t − n, … , t − 2, t − 1, t, t + 1, t + 2, … , t + n} in the previous k ∈ N years for a fixed n ∈ N. This is a dynamic training period and was first employed by Möller et al. (2018) in a D-vine copula postprocessing for 2 m surface temperature, to take seasonal effects into account.
• Monthly window: Use the last n − 15 or n − 16 days of the previous month of date t and the n days before and after the 15th/16th of the month where t belongs to the previous k ∈ N years for a fixed n ∈ N. This is a semi-dynamic training period and is a compromise between the refined rolling window and yearly window. It allows for strong seasonal patterns and at the same time it is computationally more economical than the refined training period as a model only needs to be estimated 12 times a year.
• Seasonal window: Divide the year into two seasons; that is, season April-September and season October-March, where the seasonal data of the previous k years is used. This type of seasonal window has already been applied to postprocess total cloud cover by Hemri et al. (2016) and can been seen as a semi-static training period.
• Yearly window: Use the last k ∈ N years, before the year of day t. This is a static training period that has been recently used by, for example, Baran et al. (2021) for the postprocessing of total cloud cover. This type of training period does not include seasonal patterns by construction, as is the case for the previous training periods. Any time-specific or seasonal behaviour that needs to be accounted for should be included directly into the model. However, only one model must be estimated to forecast one year, which can be computationally beneficial.
We select the appropriate length of each period by the minimal average continuous ranked probability score (CRPS; see Section 4) in the testing period over all stations, where k = 1 is fixed for all windows due to the dataset construction. For tEMOS-GB we do not use a sliding window approach. Instead, we add the sine-transformed day of the year X doy as a further predictor variable to the variable set of tEMOS-GB to account for seasonality and allow for a fair comparison.
The model estimations were carried out on the cluster at the Information Systems and Machine Learning Lab (ISMLL) at the University of Hildesheim. The cluster allows one to use 17 nodes each with 40 cores to run codes in parallel. It uses "slurm" (SchedMD, 2022) as a batch queuing system, which schedules compute jobs and the corresponding requested resources. All computations are accomplished with the software R Core Team (2020) running version 3.6.3. Furthermore, we recorded the total computation time T in minutes for the estimation of each method (over all stations).

POSTPROCESSING METHODS
In this section, we will briefly present the compared postprocessing methods. We start with the so-called tEMOS, also known as zero-truncated non-homogeneous Gaussian regression model, followed by its gradient-boosting extension (tEMOS-GB). Afterwards, we will present the D-vine-copula-based postprocessing.

tEMOS
The EMOS model proposed by Gneiting et al. (2005) is one of the most frequently used parametric postprocessing methods. EMOS models for different weather quantities differ in the predictive distribution family and in the link function(s) connecting the parameters of the predictive distribution to the ensemble members. Owing to the distribution and non-negativity of the response variable wind speed, we follow Thorarinsdottir and Gneiting (2010) and consider for EMOS the zero-truncated normal distribution  0 ( , 2 ) as parametric family, with location parameter ∈ R, scale parameter > 0 and left cut-off at zero. The probability density function f is given as where and Φ respectively denote the probability density function and cumulative distribution function of the standard normal distribution. It should be noted that several other parametric distributions have been suggested for wind speed postprocessing (e.g., Lerch and Thorarinsdottir, 2013;Baran and Lerch, 2015;Scheuerer and Möller, 2015). However, as the differences in the results across different parametric distribution assumptions are often small, we consider our assumption as sufficient to obtain proper conclusions. In the tEMOS model, where y stands for a realization of the response variable Y (wind speed), and x ws,1 , … , x ws,m are realizations of the corresponding ensemble forecasts X ws,1 , … , X ws,m , we consequently assume for the conditional predictive density with distribution parameters declared as ∶= a 0 + a 1 x ws,1 + … + a m x ws,m , The variable s ∶= denotes the ensemble standard deviation, x ws,i stands for the ensemble mean, and the regression coefficients a 1 , … , a m and b 0 , b 1 can take on any real number. However, as our forecast ensemble consists of 50 perturbed ensemble forecasts and one control forecast, we have two groups of exchangeable members, where the members within each exchangeable group receive the same coefficient in the location parameter. We therefore let with ensemble mean x ws and control forecast x ctrl ws ∶= x ws,51 . The scale parameter is defined as in Equation (3). Furthermore the coefficients a 0 , a 1 , a 2 , b 0 , b 1 ∈ R are estimated over a training period of N days by optimizing the sum of a proper verification score with the Broyden-Fletcher-Goldfarb-Shannon algorithm, for which we take the CRPS. This score can yield to more robust estimations, such as, for example, the logarithmic score according to Gebetsberger et al. (2018), and it can be calculated analytically. The implementation is based on the R-package "crch" by Messner et al. (2016). Messner et al. (2017) suggested an extension of EMOS which allows to select relevant predictor variables X 1 , … , X p by a gradient-boosting approach, especially if the amount of predictor variables becomes large. Similar to the zero-truncated EMOS model, it is assumed

tEMOS-GB
The basic idea of boosting is to initialize all coefficients for and at zero and to iteratively update only those that correspond to the predictor that improves the predictive performance the most. Using the gradient of the loss function, the predictor with the highest correlation to the gradient is selected and then the corresponding coefficient is updated by taking a step in the direction of steepest descent of the gradient. This procedure is carried out until a stopping criterion is reached to avoid overfitting. The implementation is based on the R-package "crch" by Messner et al. (2016). We tune the tEMOS-GB model by grid search with respect to the loss function (log-likelihood or CRPS), maximum number of boosting iterations (100, 500, 1,000, 2,000), stopping criterion (

D-vine copula postprocessing
As the assumption of a normal distribution and of a linear relationship between observations and forecasts might sometimes be too restrictive, we propose a more general approach in which we model the dependence between the weather variable Y considered and the forecasts X 1 , … , X p by a copula independently of the marginal distributions of these variables. In this section we first briefly describe the concept of a copula before introducing the class of D-vine copulas and the corresponding quantile regression approach.

Copulas
For analysing multivariate datasets (e.g., multivariate standard distributions), the multivariate normals are often restricted in their marginal and joint behaviour, as they, for example, assume that all marginals are of the same type. However, the use of copulas allows one to overcome this problem. A p-dimensional copula C is a multivariate distribution function on [0, 1] p . According to Sklar's theorem (Sklar, 1959), there exists for every multivariate distribution F of p continuous variables X 1 , … , X p a copula C, such that where F , = 1, … , p, are the marginal distributions. Consequently, every multivariate distribution F can be decomposed into a copula C modelling the dependence and its univariate marginal distributions. Therefore, copulas can be used to construct a wide range of distributions. The corresponding p-dimensional joint density function f can be expressed by where c denotes the copula density function of the copula C and f , = 1, … , p, are the marginal density functions of the variables X 1 , … , X p .

D-vine copula
As multivariate copulas -such as, for example, the elliptical and the archimedean copulas -are not adaptable enough, as they usually presume that the variables in all the pairs have homogeneous dependence structures, there was the need to extend them. Bedford and Cooke (Bedford and Cooke, 2001;Bedford and Cooke, 2002) achieved this by developing the so-called PCC, where the joint dependence is built up by only bivariate copulas using conditioning. As there are many ways to build a PCC, Bedford and Cooke (2002) introduced a graphical structure called a "regular vine". A regular vine consists of a set of nested trees, where the edges in one tree become the nodes of the next. A "regular vine copula" is obtained by specifying bivariate copulas, so-called pair-copulas, on each edge of the trees, where the corresponding marginal distributions are uniform on [0, 1]. A "D-vine" is a subclass of a regular vine, in which each tree is a path (e.g., see Figure 3). If the order of the variables in the first tree of the D-vine is determined, all following trees are automatically specified. A D-vine copula is consequently a regular vine copula where the tree structure is a D-vine. Based on the D-vine, we can derive with Sklar (1959) the corresponding D-vine density function where c i,i+ ;i+1, … ,i+ −1 denotes the bivariate copula (pair-copula) density of the bivariate distribution (X i , X i+ ) The conditional copula density functions c i,i+ ;i+1, … ,i+ −1 can be recursively computed using the chain rule of differentiation F I G U R E 3 Four-dimensional D-vine tree structure and corresponding pair-copulas densities. (Joe, 1996). To allow for easy estimation, we make a simplifying assumption (Stöber et al., 2013); that is, we assume that the pair-copulas of conditional distributions are independent of the values of variables on which they are conditioned.
The following density decomposition shows an exem- Figure 3 illustrates its graph representation. Each edge of the nested set of trees corresponds to a pair-copula density in the D-vine density.
For a short overview about vine copulas, see Czado and Nagler (2022), for example; and for a detailed introduction, see Czado (2019), for example.

D-vine copula quantile regression
In the context of quantile regression the aim is to model the response variable Y given a set of predictor variables X 1 , … , X p . In mathematical notation, this means to estimate the conditional quantile function for quantile level ∈ (0, 1). Given a set of predictors X 1 , … , X p (forecasts), one can predict quantiles of the response (weather quantity) to characterize the conditional predictive distribution, which corresponds to the postprocessed distribution. Therefore, quantile regression approaches are also frequently used for ensemble postprocessing. Bremnes (Bremnes, 2004a;Bremnes, 2004b) used (local) linear quantile regression, Taillardat et al. (2016) applied quantile regression forests, Ben Bouallègue (2017) employed penalized quantile regression, and Bremnes (2020) used quantile regression functions based on neural networks and Bernstein polynomials. For the DVQR developed by Kraus and Czado (2017) one is interested in estimating the inverse of Equation (10); that is, (11) where where we can express the conditional quantile function can be obtained through estimating the marginals F Y and F for = 1, … , p, as well as the conditional copula distribution function C V|U 1 , … ,U p . Doing so, we get the estimated quantile function The marginal functions are estimated non-parametrically with a univariate kernel density estimator using the Gaussian kernel implemented in the R-package "kde1d" of Nagler (2018). The conditional copula C V|U 1 , … ,U p can be computed recursively. A simple inversion then yields C −1 The only remaining question is in which order the variables Y , X 1 , … , X p should appear in the D-vine. By construction, the variable Y needs to be the first node in the first tree of the D-vine (in Figure 3 node 1). As the order of X 1 , … , X p is not obviously predetermined, one can select the most informative predictors and order them according to their predictive strength. Kraus and Czado (2017) propose a sequential forward selection approach to select the most important predictors by improving an (AIC/BIC)-corrected conditional log-likelihood. The idea is to parsimoniously add predictor variables that minimize the BIC-corrected conditional log-likelihood the most until no improvement is found. The estimation of the DVQR is carried out with the R-package "vinereg" by Nagler (2020) allowing 33 bivariate copulas, which are elliptical copulas (Gaussian and Student-t), archimedean copulas, and rotated versions thereof (Clayton, Gumbel, Frank, Joe, BB1, BB6, BB7, and BB8) and the non-parametric independence and transformation kernel copulas (TLL). Elliptical copulas have an elliptical shape in the contour plot, and any departures from elliptical shapes may indicate to include non-Gaussian dependencies. Therefore, archimedean copulas are provided that exhibit a pear or bone shape in the contour plot. The non-parametric independence copula has a circular shape in the contour plot and the transformation kernel copula can approximate any dependence shape. Consequently, we cover lots of possible dependence patterns with this copula set.
This way, we obtain a quantile regression model that overcomes typical issues of quantile regression, such as quantile crossings, transformations, collinearity, and interactions of variables (Kraus and Czado, 2017). In addition, the DVQR poses less restrictive model assumptions, since it allows for a more flexible modelling of the dependence structure and the marginals.

VERIFICATION METHODS
As argued by Gneiting et al. (2005) Gneiting and Raftery (2007), the general aim of probabilistic forecasting is to maximize the sharpness of the predictive distribution subject to calibration. "Calibration" refers to the statistical consistency between the predictive cumulative distribution function (CDF) F and the associated observation Y . Consequently, it is a joint property of the predictions and the verifications. "Sharpness" refers to the spread of the predictive distribution F and is a property of the probabilistic forecasts only. The more concentrated the forecast, the sharper the forecast, and the sharper the better, subject to calibration. In the following, we will present methods to measure calibration and sharpness that will be used in the subsequent application. A continuous predictive probabilistic forecast F is calibrated if F(Y ) is uniformly distributed (Dawid, 1984;. For the evaluation of the calibration, a so-called probability integral transform (PIT) histogram can be used as a visual tool, where the PIT values are obtained by evaluating the predictive CDF F at the validating observations. Any departures from uniformity of the PIT histogram can indicate that the predictive distribution F is miscalibrated in some way. A discrete counterpart of the PIT histogram is the so-called verification rank histogram displaying the histogram of ranks of observations with respect to the corresponding ordered ensemble forecasts (Talagrand et al., 1997). For a calibrated m-member ensemble, the ranks should be uniformly distributed on the set {1, … , m + 1}.
Calibration of a predictive distribution can also be assessed by the coverage of a (1 − ) × 100% central prediction interval for ∈ (0, 1) . The coverage of a (1 − ) × 100% central prediction interval is the proportion of validating observations between the lower and upper ∕2 quantiles of the predictive distribution. If a predictive distribution is calibrated, then (1 − ) × 100% of observations should fall within the range of the central prediction interval. Sharpness of a predictive distribution can be validated by the width of a (1 − ) × 100% central prediction interval . Sharper distributions correspond to narrower prediction intervals. In the case of an m-member forecast ensemble, we consider an [(m − 1)∕(m + 1)] × 100% central prediction interval, which corresponds to the nominal coverage of the raw forecast ensemble, thus allowing a direct comparison of all probabilistic forecasts. The target coverage rate for an m = 51 member ensemble is approximately 96.15%.
Until now, we have stated tools to analyse the calibration and sharpness of a probabilistic forecast separately. Proper scoring rules assess calibration and sharpness properties simultaneously and, therefore, play important roles in the comparative evaluation and ranking of competing forecasts . A proper scoring rule that we use is the CRPS (Matheson and Winkler, 1976), which is defined as where F is the predictive CDF, y is the true/observed value, and 1 denotes the indicator function. If F has a finite first moment, Gneiting et al. (2008) state that the CRPS can be approximated bŷ where z k ∶= F −1 [k∕(K + 1)] and z k ′ ∶= F −1 [k ′ ∕(K + 1)], for k, k ′ ∈ {1, … , K}, and F −1 denotes the quantile function of F. The mean CRPS over a set of forecast cases will be denoted by CRPS. Often, a probabilistic forecast is reduced to a point forecast via a statistical summary function such as the mean or median. In this situation, "consistent scoring functions" provide useful tools for forecast evaluation and generate proper scoring rules (Gneiting, 2011). In particular, we use the squared error for the mean forecast mean(F). For n time points we get the root-mean-squared error via which has the advantage of being recorded in the same units as the verifying observations. Furthermore, we employ the absolute error for the median forecast median(F) and denote the mean over the AE values of a set of forecast cases by MAE.
To assess the relative improvement of a forecast with respect to a given reference forecast, one can calculate the continuous ranked probability skill score (CRPSS) via where CRPS ref denotes the CRPS of the reference forecast.
To evaluate the statistical significance of the differences in performance between two competing postprocessing methods, we apply a Diebold-Mariano test (Diebold and Mariano, 1995) to the verification score time series of both methods separately at each station. Afterwards, we use the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995) suggested by Wilks (2016), which allows us to account for multiple testing regarding different stations and to control the overall probability of a type I error, for which we choose = 0.05 in the subsequent analysis.
The verification of the methods applied in Section 5 is carried out using the R-package "eppverification" by Jobst (2021).
In order to assess the relevance of the different predictor variables used in Section 5 we propose an extension of the definition of "variable importance" in van Heiss (2020), assigning each possible predictor variable its corresponding importance. Having p predictor variables X 1 , … , X p for a (d + 1)-dimensional D-vine copula with order Y − X 1 − … − X d , we calculate for feature X the variable importance via where s denotes the position of X in X 1 − … − X d for = 1, … , p. We write Imp for the mean Imp within a discrete time interval.
Finally, for visually assessing the dependence for two variables Y and X, a so-called "empirical normalized bivariate contour plot" can be used in the R-package "VineCopula" of Nagler et al. (2020). This plot is obtained by estimating an empirical copula density c and to visualize the contours of the bivariate density function and Z X ∶= Φ −1 (F X (X)). For more details concerning these plots, see Czado (2019), for example.

APPLICATION
In this section we compare the local and global DVQR, tEMOS, and tEMOS-GB models for different sets of predictor variables and types of training periods.

Local postprocessing
In the local postprocessing, the observation and forecast data of the station of interest are used to estimate a model at each station individually. tEMOS, tEMOS-GB, and DVQR are estimated locally. As our goal is to construct a parsimonious DVQR, we first estimate the model in the testing period and select the predictor variables and their order according to the variable importance criterion in Equation (21) for each station separately. We allow all copula families in the R-package "vinereg". Afterwards, we select the most relevant copula families in the testing period across all stations by hand according to both the most frequently selected copula families and the types of predictor variables at hand. Additionally, we use empirical normalized bivariate contour plots to confirm that the family choices are appropriate. In the validation period, we then restrict the model fitting procedure to the chosen variables and order, as well as to the set of copulas obtained by the aforementioned procedure. This helps to reduce computational complexity of the DVQR to a large extent and to get insights into the relationships of variables. In addition, we estimated the DVQR based on the set of all possible predictor variables without specified order and all available copula families in the validation period. Although some of these fully greedy models perform slightly better than their parsimonious counterparts, the results are not presented here as they are less useful in practice due to extremely high computation times; for example, 1,709 min for the greedy models in comparison with 26 min for the parsimonious models across all stations using the refined training period type in the validation period.

Model selection
Concerning the set of features, we use the ensemble mean X ws and the control forecast X ctrl ws of the 10 m surface wind speed ensemble for tEMOS. For the D-vine copula postprocessing model we consider two different settings. First, we allow only X ws and X ctrl ws in the predictor set, with the respective model denoted by DVQR − . Second, we choose the ensemble mean X q and control forecast X ctrl q of the variables q ∈ {ws, u, v, wg, temp, p, tcc, sh} from Table 1 as the feature set, resulting in 16 potential predictors in total and denote the model by DVQR + .
For all training period types, DVQR − selects in most cases only a single predictor variable at every station in the testing period. According to the variable importance measure, Equation (21), the variable X ws has the highest predictive power for almost each station and in total. Furthermore, the Gaussian and Gumbel copulas are most frequently selected for all training period types in Figure 4. Regarding the empirical normalized contour plots in Figure 5, the dependencies between the observation Y and forecast variables X ws , X ctrl ws could be modelled by Gaussian copulas for the station Hannover and by Gumbel copulas for the station Munich. Therefore, for the DVQR − we allow only the most important weather quantity at each station as predictor, which is in most cases X ws , and the Gaussian and Gumbel copulas for the dependence structures in the validation period.
Overall, DVQR + selects on average between four and five features for each type of training period in the test data. For all training period types except for the rolling training period, X ws , X wg , and X v are the three most important weather variables in decreasing importance. The strongly wind speed-related variables ws, wg, u, v, and tcc appear among the six most important weather variables in Figure 6. With few exceptions, we observe that mean forecasts are assigned a higher variable importance than control forecasts, highlighting their higher predictive strength. Based on the aforementioned observation, we restrict DVQR + to the five (rather than only four) most important weather quantities for each station in the validation period, in order to allow more variation in the variable selection. Regarding Figure 6, we observe that the independence, TLL, Frank, Clayton, Gaussian, and Gumbel copulas are the six most frequently selected copula families for all training period types except the rolling one. These families also seem to be suitable according to Figure 7, showing empirical normalized contour plots for the observation Y and five frequently selected predictors. For example, the dependence between the forecast X ws and X wg variables could be represented by a Gumbel or (rotated) Clayton copula and the contour shape between X wg and X u indicates a non-parametric copula shape that could be estimated by the TLL copula. Therefore, the DVQR + model is restricted to the previously mentioned six copulas for the dependence modelling in the validation period. With respect to the margin estimation for DVQR − and DVQR + , we investigated whether a log-transform for variables with a domain restricted in one direction and a probit transform for variables with a domain restricted in both directions improve the kernel density estimates. We validated these kernel density estimates in the testing period with different scores and detected no meaningful difference to the analysis without a transformation, except for the weather quantity tcc. Consequently, we disregard any transformation, except for tcc, where we use the probit transformation. Furthermore, we investigated whether additional predictor variables calculated from the raw forecast ensemble -that is, quantile forecasts for ∈ {0, 0.05, 0.1, 0.2, … , 0.8, 0.9, 0.95, 1}, a random member, the interquartile range, and standard deviation -can improve the predictive performance of DVQR − . One interesting observation is that a quantile with lower is often assigned a higher variable importance than quantiles with higher , which might be due to the fact that lower wind speed values are more likely than higher ones. However, as there is no remarkable difference between the setting with additional predictor variables calculated from the raw ensemble and the basic DVQR − , the results of the extended DVQR − are omitted.
Regarding the tuning parameters for tEMOS-GB described in Section 3.2, we select the optimal specification on all 17 predictor variables (the 16 predictors outlined earlier in this section plus the sine-transformed day of the year) with the maximum log-likelihood method, resulting in 500 maximum boosting iterations, AIC as stopping criterion, and a step size of 0.05.

Results
Regarding the verification scores in Table 2, we observe that DVQR − shows a comparable performance to tEMOS for all training period types. Both methods are able to improve the raw ensemble (RAW) up to about 28-31% with respect to CRPS. Adding additional meteorological variables allows DVQR + to outperform tEMOS up to 7% and the raw ensemble up to 35% with respect to CRPS. However, overall, tEMOS-GB outperforms tEMOS, DVQR − , and DVQR + for every type of training period with respect to CRPS. DVQR − does not yield substantial improvements over tEMOS, indicating that the distribution and the linear relationship assumption in the parameters is reasonable for tEMOS. By adding additional predictors, DVQR + clearly outperforms tEMOS in the categories CRPS, MAE, RMSE, and width for all types of training periods except for the rolling one. Furthermore, DVQR + yields the lowest RMSE using the refined training period. Overall, tEMOS-GB is the best-performing method with regard to CRPS and MAE, but with only minor differences from DVQR + using the refined type of training period. These improvements of tEMOS-GB over DVQR − and DVQR + come with a worse performance with respect to the coverage, where DVQR − with the yearly type of training period yields the closest value to 96.15%. Concerning the width, DVQR + is able to outperform tEMOS and tEMOS-GB using the refined type of training period. The PIT histograms in Figure 8 also indicate an improvement with respect to calibration in comparison with the raw ensemble for all four models, as they are close to uniform.
When looking at the stations individually, the performance of the methods exhibits much more variation. The station-wise results are shown in Figure 9, which indicates that DVQR + performs better than tEMOS for 95% of all stations and better than tEMOS-GB for 38% of all stations with respect to CRPS. In addition, the Diebold-Mariano test followed by the Benjamini-Hochberg procedure shows, based on the best-performing training period, that DVQR + performs significantly better than tEMOS for 75% of all stations and concerning tEMOS-GB for 5% of all stations with respect to CRPS. However, it should be noted that DVQR + is able to clearly outperform tEMOS-GB at specific stations exhibiting nonlinear relationships between the variables. Notable examples are the red diamond point on the map (11% CRPSS for station Kiel-Holtenau) and the yellow square (7% CRPSS for station Itzehoe) situated directly on the North Sea coast of Germany in the right-hand map of Figure 9. The empirical normalized contour plot for Itzehoe is shown in Figure 10, where, for example, we can observe nonlinear relationships between Y and X u , between X u and X v , or between X v and X ws . One possible reason for this performance difference between DVQR and tEMOS-GB could be that the interactions between the variables are automatically TA B L E 2 Verification scores aggregated over all stations and time points in the validation period for the local models.  included in the DVQR model by construction, whereas tEMOS-GB explicitly needs to take account of them. However, interactions between the variables (e.g., products) are not included in the current tEMOS-GB implementation.
Concerning the type of training period, the refined one yields the lowest CRPS, MAE, and RMSE for tEMOS, DVQR − , and DVQR + , followed by the monthly training period. This is visualized in more detail in the maps in Figure 9: The refined training period is chosen for about 60% of all stations using CRPS followed by the monthly, yearly, and seasonal training period. The model improvements might also be due to the fact that refined and monthly training periods are able to capture seasonal patterns better than the other types of training periods are. Concerning the computation time T, we observe that the refined training period is the most costly one and the yearly training period is the cheapest one over all methods. However, the performance results of the monthly period are only slightly inferior to those of the refined training period, but with a computation time reduction of about five times for tEMOS, two times for DVQR − , and 26 times for DVQR + . Therefore, the monthly training period can be seen as a reasonable compromise between good forecasting performance and faster model estimation.
Regarding the left violin plot in Figure 11, we highlight that DVQR + performs better than tEMOS with respect to the median CRPSS for the refined (3.4%), monthly (3.8%), seasonal (1.8%), and yearly (5.5%) period types. The higher skill score improvements for the yearly in comparison with the other types of training periods might come from Black symbols represent that tEMOS outperforms DVQR + by CRPS. Right: CRPSS with respect to the outperforming method, which is either local DVQR + for all types of training periods or local tEMOS-GB in the validation period. Numbers in parentheses represent the count.
[Colour figure can be viewed at wileyonlinelibrary.com] the fact that the temporal/seasonal dependence pattern is automatically captured by the chosen set of weather quantities in the D-vine copula to some extent, whereas the other training periods capture this dependence by construction and thus do not leave much room for improvement. With respect to the training-period-specific CRPSS improvement of tEMOS-GB over tEMOS, we observe in the right violin plot of Figure 11 that tEMOS-GB performs better than tEMOS when comparing the median CRPSS for the rolling (7.8%), refined (4.6%), monthly (5.0%), seasonal (6.8%), and yearly (8.2%) training period types. This also justifies that tEMOS-GB captures the seasonality well by its predictor variables.

Global postprocessing
In the global postprocessing, the observation and forecast data from all available stations are combined to estimate a single model valid for all stations. All the methods considered are estimated globally. As in Section 5.1, we construct a parsimonious DVQR. We estimate a single D-vine-copula-based postprocessing model for all stations in the testing period and preselect predictor variables and their order according to the variable importance criterion. Furthermore, we subset the copula families according to the same three criteria as in Section 5.1 and we only allow a probit transformation for the weather quantity tcc. In the validation period, we restrict the model fitting procedure to the chosen variables and order, as well as to the set of copulas obtained by the previously stated procedure. In addition, we investigated a DVQR model based on the set of all predictor variables without preselected order and allowing all copula families in the validation period. As in Section 5.1, these results are omitted for the same reasons.

Model selection
We estimate the global tEMOS and DVQR − models again only with the ensemble mean X ws and the control forecast X ctrl ws . Furthermore, we allow the same 16 predictor variables for the tEMOS-GB and the DVQR + models as in Section 5.1, and additionally include the station-specific variables X lon , X lat , and X elev from Table 1, resulting in a total in 19 potential predictors. Furthermore, the sine-tranformed X doy is added to the predictor set for tEMOS-GB, resulting in 20 potential predictors in total. Moreover, we investigated whether additional predictor variables calculated from the wind speed forecast ensemble as in Section 5.1 can improve the DVQR − , but we did not observe any significant improvements compared with the original DVQR − . This time, DVQR − selected in most cases both predictor variables in the testing period and the order X ws then X ctrl ws was chosen most frequently, for all training period types except the rolling one. The predictor X ws was also assigned the highest variable importance in Figure 12. Additionally, all training period types except for the rolling and seasonal, selected the Student-t, TLL, and Gumbel copula most frequently, which can be observed in Figures 12 and  13. Therefore, we fixed the predictor order as X ws then X ctrl ws and the three previously mentioned copula families in the validation period.
On average, DVQR + selected between 13 and 17 predictor variables for each type of training period in the test data. For all training period types except for the seasonal, X ws , X elev , and X tcc are the three most important weather variables in decreasing order of importance in Figure 14. It should also be mentioned that the spatial variable X lat appears among the six most important predictor variables, except for the seasonal period, and has a much higher variable importance than X lon . This might be due to the higher absolute correlation between X elev and X lat than between X elev and X lon , which is represented by the stronger topographic differences from north to south than from east to west in Germany in Figures 1 and 15. Finally, to obtain a parsimonious DVQR + model we restricted the number of predictor variables to the average amount of selected variables over all types of training periods, which results in using the 15 most important ones in the validation period.
With respect to Figures 14 and 15, we observe that the TLL, Clayton, BB8, Frank, Student-t, independence, and Gaussian copulas are the seven most frequently selected copula families over all training period types in the DVQR + model. Therefore, they are used for the validation period.
For the tuning parameters of tEMOS-GB in Section 3.2 we select the specification based on all 20 predictor variables outlined earlier, which was minimized by CRPS, using 2,000 maximum boosting iterations, AIC as stopping criterion, and a step size of 0.2.

Results
With respect to the verification score (Table 3), we find that DVQR − performs slightly better than tEMOS, whereas DVQR + yields significantly better results in comparison with the raw ensemble and with tEMOS for all training period types. We note that DVQR + outperforms the raw ensemble up to 24%, tEMOS up to 8%, and tEMOS-GB up to 4% with respect to CRPS. According to Table 3, DVQR + clearly outperforms tEMOS in the categories CRPS, MAE, and RMSE for all training period types. Overall, DVQR + using the refined period performs best with regard to the same scores in comparison with tEMOS and tEMOS-GB. This is in line with the results visualized on the maps in Figure 17, where DVQR + performs better than tEMOS for 98% of all stations and better than tEMOS-GB for 88% of all stations with respect to CRPS. In addition, the Diebold-Mariano test with the Benjamini-Hochberg procedure shows, based on the best performing training period, that DVQR + performs significantly better than tEMOS for 90% of all stations and significantly better than tEMOS-GB for 75% of all stations with respect to CRPS. The difference in performance between tEMOS-GB and DVQR + could be traced back to the more complex dependence patterns between the weather and spatial variables in the global setting, as visible in Figure 15. These highly nonlinear patterns can

F I G U R E 13
Global DVQR − model: empirical normalized contour plots (lower triangle), probability integral transform histograms (diagonal), and scatterplots including correlation (upper triangle) for all stations; exemplary shown for dates between January 2, 2016, and December 31, 2016. [Colour figure can be viewed at wileyonlinelibrary.com] be explicitly modelled by DVQR + but not by tEMOS-GB. However, tEMOS yields sharper forecasts than DVQR + for the rolling, monthly, and seasonal training periods and the sharpest using the refined type of training period. Looking at Figure 16, we observe that the PIT histograms of DVQR − and DVQR + are both close to uniform, indicating a good calibration. The PIT histograms of tEMOS and tEMOS-GB show departures from uniformity in terms of a slight hump shape in the middle, together with a more pronounced final bin, possibly indicating a tendency for overdispersion and bias.
In comparison, global DVQR − , DVQR + , and tEMOS exhibit worse results than their local counterparts with respect to certain scores; for example, CRPS and MAE. A possible reason is that local models allow locally varying forecast errors to be accounted for and, therefore, result in better predictive performance (e.g., Gneiting, 2010, Schuhen et al., 2012). Especially for DVQR − and DVQR + , the deterioration in performance might be explained by the fact that the two components -that is, the marginal distributions and the dependence structure -are estimated globally and thus cannot account for station-specific effects.
With respect to the training periods, we observe results similar to those in Section 5.1. The refined training period again yields the lowest CRPS for tEMOS, DVQR − , and DVQR + , followed by the monthly training period for DVQR − and DVQR + . Furthermore, the improvement using the refined training period for both methods comes with the cost of the highest computation times (tEMOS: <1 min; DVQR − : 15 min; DVQR + : 2,753 min). However, the results of the monthly period show only small differences from the refined training period, but with an enormous reduction in computation time of about 15 and 32 times for DVQR − and DVQR + respectively. The performance of the training periods is also represented on the maps in Figure 17, indicating that the refined training period is chosen for 37% of all stations using CRPS, followed by the seasonal, yearly, monthly, and rolling training period types.
Considering the left violin plot in Figure 18, we detect that DVQR + performs better than tEMOS with respect to the median CRPSS for the rolling (0.6%), refined (3.0%), monthly (2.9%), seasonal (2.7%), and yearly (4.6%) training period types. As before, the highest skill scores can be obtained by the yearly type of training period. Looking at the right violin plot in Figure 18, we observe that tEMOS-GB outperforms tEMOS with respect to the median CRPSS in the rolling (1.6%), refined (1.8%), monthly (2.0%), seasonal (2.6%), and yearly (3.6%) training period types. TA B L E 3 Verification scores aggregated over all stations and time points in the validation period for the global models.

CONCLUSION AND OUTLOOK
In this work, we proposed the use of DVQR for postprocessing of wind speed ensemble forecasts. The method is able to select the most informative predictor variables and to model various dependencies between them in a data-driven way using the R-package "vinereg". Furthermore, owing to the graph representation of the D-vine and pair-copula construction, the user is able to trace the model, which can sometimes be hard for certain machine-learning methods. In addition, we discovered that parsimonious DVQR models, where the predictor variables and their order are manually preselected and the copula family set is reduced in advance, yield comparative or sometimes even better results than the greedy DVQR, which automatically selects the variables, their order, and the copulas based on the full set of options. Moreover, the parsimonious model with p predictor variables chosen in advance has complexity (p 2 ) and the greedy DVQR has complexity (p 3 ) with respect to the estimated bivariate copulas. Consequently, the parsimonious model helps us to substantially reduce the runtime. The DVQR could be made even more efficient by preselecting the pair-copulas in each tree. Furthermore, the DVQR by Kraus and Czado (2017) could be compared with the recently developed sparse DVQR by Sahin and Czado (2022), which might lead to even more parsimonious models. However, the success of the parsimonious DVQR relies strongly on an appropriate and well-considered variable selection. Therefore, we described a variable importance Rolling (4) Refined (22) Monthly (6) Seasonal (17) Yearly ( criterion for D-vines. It would be interesting to compare the results of the introduced variable importance criterion with other criteria -for example, the permutation importance (Breiman, 2001) -with respect to specific scores or with Shapley values (Shapley, 1953;Aas et al., 2021). Roughly speaking, for the local and global postprocessing setting, DVQR performed nearly the same as the tEMOS model if we allowed X ws and X ctrl ws as the only predictors. Adding additional weather quantities and station-specific variables to the set of possible predictor variables proved to yield a substantial improvement for the local and global DVQR models in comparison with tEMOS. In particular, the strongly wind-related weather quantities (e.g., ws, u, v, wg, and tcc) can improve the local DVQR to a great extent. For the global DVQR, station-specific variables like X lon , X lat , and X elev can be beneficial. However, adding summary predictor variables calculated from the raw ensemble to the DVQR model did not improve performance noticeably. Furthermore, the local DVQR is able to substantially outperform the local tEMOS-GB at certain stations with nonlinear relationships, whereas in the global setting DVQR is clearly able to provide better scores than tEMOS-GB. This is planned to be analysed in more detail and for more stations to get deeper insights. One goal of such an analysis is a more detailed understanding of which weather phenomena can potentially be captured by a D-vine-based approach rather than by standard or machine-learning approaches. All in all, the local DVQR yields better results than the global DVQR, as the margins and the dependence structure in this global model cannot account for station-specific weather effects. Therefore, future research involves to investigate for the DVQR whether some of this information can be integrated into the marginal distributions and copulas via for example, generalized additive models (Hastie and Tibshirani, 1986;Hastie and Tibshirani, 1990;Vatter and Nagler, 2016). Furthermore, it could prove beneficial to include other spatial variables (e.g., distances between stations, land use) and/or temporal variables (e.g., season, different lead times) in the DVQR model for wind speed postprocessing.
Model performance with respect to the different training periods can vary greatly. The refined training period performed best for our methods, followed by the monthly period. Both periods could adapt to seasonal patterns well, and the latter can be seen as a compromise between having a good forecasting performance and, at the same time, not too much computational effort. The yearly training period might yield better results if temporal dependencies are explicitly included in the forecasting model. It is also planned to investigate how both methods perform for these different training periods using other lead times than 24 hr ahead. Additionally, it might be interesting to compare the types of training periods analysed with the recently introduced weekly type of training period in the benchmark for the postprocessing of 2 m surface temperature by Demaeyer et al. (2023). DVQR showed promising results in this benchmark in comparison with the other methods; therefore, we also plan to investigate DVQR in more detail in upcoming benchmarks. Further research might also be directed towards a quantile-regression-based vine copula wind speed postprocessing using C-vines (canonical vines; Tepegjozova, 2019) or restricted R-vines (regular vines; Zhu et al., 2021). The very general and flexible C-/D-/R-vine-copula-based quantile regression can also be examined for other non-Gaussian weather variables exhibiting more complex behaviour, such as, for example, total cloud cover or precipitation.