Multivariate Probabilistic CRPS Learning with an Application to Day-Ahead Electricity Prices

This paper presents a new method for combining (or aggregating or ensembling) multivariate probabilistic forecasts, considering dependencies between quantiles and marginals through a smoothing procedure that allows for online learning. We discuss two smoothing methods: dimensionality reduction using Basis matrices and penalized smoothing. The new online learning algorithm generalizes the standard CRPS learning framework into multivariate dimensions. It is based on Bernstein Online Aggregation (BOA) and yields optimal asymptotic learning properties. The procedure uses horizontal aggregation, i.e., aggregation across quantiles. We provide an in-depth discussion on possible extensions of the algorithm and several nested cases related to the existing literature on online forecast combination. We apply the proposed methodology to forecasting day-ahead electricity prices, which are 24-dimensional distributional forecasts. The proposed method yields significant improvements over uniform combination in terms of continuous ranked probability score (CRPS). We discuss the temporal evolution of the weights and hyperparameters and present the results of reduced versions of the preferred model. A fast C++ implementation of the proposed algorithm is provided in the open-source R-Package profoc on CRAN.


Introduction
Forecast combination (sometimes referred to as expert aggregation or ensembling) has recently gained much traction.We know from theory that combination methods work well to combine different but well-performing model classes Cesa-Bianchi & Lugosi (2006).As Gaillard et al. (2016) pointed out, it is always recommended to use different classes of models, e.g., regression and time series type models, neural network models, decision tree learning models, and other machine learning and artificial intelligence methods.This paper proposes a novel online updating scheme for combining the marginals of the corresponding multivariate distribution across quantiles (also referred to as horizontal aggregation).We know from Sklar's theorem that we can decompose any multivariate distribution into the marginals and a copula.That is, we can improve the marginals (i.e., by using a strictly proper scoring rule like the CRPS) while leaving the copula untouched.In consequence, we require only the reporting of the forecasted marginal distribution.The proposed method considers dependencies between the combination weights across quantiles and marginals through a simple but flexible smoothing procedure.We assume a basic metric or spatial structure in the multivariate dimension.Such a metric structure is present when forecasting a univariate time series several steps ahead or predicting one-dimensional spatial data.
Online learning algorithms are particularly attractive for forecasting where frequent short-term forecasts are essential for the application domain (e.g., energy, weather, finance, retail).The proposed algorithm generalizes the probabilistic CRPS learning framework presented in Berrisch & Ziel (2021).It is based on exponential weighted averaging (EWA) and yields optimal asymptotic convergence rates with respect to the best individual forecast and the best convex combination of all forecasts (Wintenberger, 2017).
Considerable research on forecasting combination already exists.Bordignon et al. (2013); Nowotarski et al. (2014); Avci et al. (2018) combine point-forecasts using various batch methods.Marcjasz et al. (2020); Serafin et al. (2019) apply batch methods to probabilistic forecast combination.Some authors also applied online learning algorithms for point forecasting (Nowotarski et al., 2016) and probabilistic forecasting (Gaillard et al., 2016;Gonzalez et al., 2021a).The work above focuses on developing distinct forecasting models and on combination methods.Gaillard & Goude (2015) discuss how model development can be optimized in the framework of aggregation of experts.
In electricity price forecasting, dynamic aggregation techniques, where the combination weights are adjusted based on past performance, tend to perform better than simple constant weight techniques (Gaillard & Goude, 2015;Marcjasz et al., 2018;Maciejowska et al., 2020).However, they consider multivariate updating schemes that use the same weight for all time series.Most other work in energy forecasting considers all time series to be independent and therefore combines forecasts separately (Bordignon et al., 2013;Nowotarski et al., 2016).Neither approach considers possible dependencies of combination weights between marginals.Consequently, we can expect potential improvements by exploiting this metric structure of electricity prices by considering updating schemes that assign different weights to all neighboring price forecasts of the day and considering possible dependencies between combination weights.Of course, the same logic applies to other areas of application.
The contributions of this manuscript are manifold: i) We generalize batch and online CRPS learning to multivariate settings.
ii) We show how the metric or spatial structure of the combination weights for multivariate data can be considered using two smoothing methods.iii) We discuss three possible strategies for optimizing hyperparameters in online learning settings.iv) We provide a fast C++ implementation of the proposed algorithm in connection with this paper as an open-source R-Package on CRAN.v) We empirically apply the proposed methods to multivariate probabilistic dayahead electricity price forecasts.
The remainder of this paper is structured as follows.Section 2 discusses the general multivariate probabilistic combination setting and discusses CRPS learning using quantile regression.Section 3 presents the proposed multivariate generalization of online CRPS learning and summarizes its asymptotic properties.Additionally, we discuss possible extensions of the proposed method.Those extensions to the core algorithm add hyperparameters that have to be specified.Therefore, we elaborate on two possible strategies for hyperparameter tuning in Section 4. Section 5 continues with an empirical application of the proposed algorithm.We apply the methodology to multivariate probabilistic forecasts of Day-Ahead power prices.We discuss the data, elaborate on the specific algorithms we consider, and present a detailed analysis of the obtained results.Section 6 discusses limitations, introduces potential enhancements, and concludes.

The combination setting
In this paper, we consider the combination of multivariate probabilistic forecasts.
In particular, we consider a setting where the forecasts are given as quantiles of all marginals of a multivariate distribution.Berrisch & Ziel (2021) show that pointwise forecast combinations potentially outperform standard methods where weights are constant over all distribution quantiles.We apply this idea to a multivariate setting by computing weights depending on the quantile and the marginals.First, we discuss batch learning methods and propose a dimension reduction technique that bridges the gap between flexible pointwise and robust constant procedures.Afterward, we show how the proposed online learning algorithm of Berrisch & Ziel (2021) can be extended for combining the marginals of multivariate probabilistic forecasts.
Let F t = ( F t,1 , . . ., F t,K ) be a vector of K univariate distributions representing the marginal distribution of the corresponding multivariate distribution, resp.the set of experts that we want to combine.We consider the combination across quantiles (also known as horizontal aggregation): We evaluate the performance using the cumulative CRPS over all marginals.Therefore, the weights shall be chosen to minimize the cumulative CRPS of all marginals.
We can approximate the CRPS by the sum over Quantile Losses (QL) for an equidistant dense grid P = (p 1 , . . ., p P ) with p i < p i+1 and p i+1 − p i = h for all p.Clearly, P → ∞ induces h → 0, p 1 → 0, p P → 1 and the approximation converges to the CRPS (Gneiting, 2011a,b).This relationship enables us to compute pointwise weights based on quantile losses.We can extend this idea by optimizing weights not only depending on the quantile p but also on the marginal d: We are interested in setting w t,k such that the CRPS of F t,d is minimized.

CRPS learning using quantile regression
Pointwise CRPS learning has the potential to outperform standard CRPS learning methods.However, the best pointwise weights in (3) must be estimated.Theoretically, a pointwise approach has to be applied to all probabilities p ∈ (0, 1) and all marginals D = (1, 2, . . ., D) such that the bivariate weight function w t,k can be specified.However, we can never evaluate infinitely many values for p.On the same page, the computation may be infeasible if D is very large.Therefore, we must consider some finite-dimensional representation for the weight functions w t,k .A suitable option is representing the weight functions w t,k using a finite-dimensional representation using splines.Bivariate splines are a suitable option in this scenario.We can express them as follows: This is essentially the same as univariate splines with L-dimensional parameter vector (β 1 , . . ., β L ) ′ .However, the support of f is 2-dimensional.Thus, we need many more basis functions L to have a suitable description of f .A popular way to describe the bivariate basis function φ l in (4) is to assume a tensor structure (McLean et al., 2014;Wood et al., 2017).In the bivariate case, the spline function is a product of two univariate ones.In addition, φ 1,l and φ 2,l are usually chosen such that φ 1,l1 interacts with each of the considered basis functions φ 2,l2 .This, allows to renumerate the problem such that l = (l 1 , l 2 ), and yields This can be used to express the bivariate weight function as a product of the P × D parameter matrix β t,k and the bivariate basis represented by φ mv and φ pr : Given T historic forecasts F −1 t,d,k (p), the cooresponding realizations Y t,d , and the index of marginals D = (1, 2, . . ., D) we can estimate the D × P × K-dimensional parameter tensor β t by minimizing the corresponding CRPS using (2): The second line uses the shift-invariance of the quantile loss and quantile regression notation ρ p (z) = QL p (0, z) = z(p − 1{z < 0}) (Koenker, 2017).
Still, computing (7) requires the evaluation of all distribution forecasts.As discussed, this is often not possible in practice.If we restrict the evaluation to a grid of probabilities P problem (7) simplifies with (2) to In general, quantile regression problems can be solved efficiently using linear programming solvers (Koenker, 2017).However, ( 8) is not a simple quantile regression problem, but a joint quantile regression (Sangnier et al., 2016;Chun et al., 2016).
The parameters β t,j,l,k are active for multiple quantiles.Thus, adequate estimation requires solving the optimization problem for D × P × K parameters, which can be computationally costly if D, P , and K are large.
However, if we choose both basis φ pr and φ mv so that φ mv i = 1{d i } on D for ) and φ pr i = 1{p i } on P for p i ∈ P = (p 1 , . . ., p P ) then (8) can be disentangled into D × P separate quantile regression problems.This is for p ∈ P and d ∈ D where F −1 t,h,k (p) are the experts for the p-quantile and the d-marginal.
Quantile regression (9) will lead to linear optimality on D and P, as long as standard regularity conditions required for the quantile regression are satisfied (Koenker & Hallock, 2001).However, we might assume further restrictions to reduce the estimation risk, e.g., the solution is a convex combination.Taylor & Bunn (1998) discussed many related plausible restrictions for quantile combination concerning bias correction, positivity, and affinity, among others.
A potential issue of pointwise algorithms is quantile crossing.This problem occurs if we have F −1 t,d (p i ) > F −1 t,h (p j ) for some p i , p j ∈ (0, 1) with p i < p j .In this case, we recommend rearranging the predictions as sorting is known to improve the forecasting performance (Chernozhukov et al., 2010).

Multivariate Online CRPS Learning
Batch-learning approaches, like quantile regression, evaluate the entire history for estimating new combination weights, which is computationally costly.Therefore, we suggest to use online learning methods instead.
Online learning is often called prediction under expert advice.In this context, experts refer to the models producing the predictions (or predictive distributions).
The person or model that combines the experts' predictions is called forecaster.A key element of online learning methods is (cumulative) regret.It is defined as: i.e., the cumulative difference between the loss of the expert's predictions X t,k and the prediction of the forecaster X t for a loss function ℓ.X t,k might be a predicted quantile p) of expert k as discussed in the previous section.R t,k is called regret because it indicates how much the forecaster regrets not following the experts' advice (Cesa-Bianchi & Lugosi, 2006).With (10), we can formulate the EWA: where K refers to the number of experts, w 0,k to the initial weights of an expert k, and η to the learning rate, which defines how fast the weights adjust to changes in the regret Cesa-Bianchi & Lugosi (2006).We can express this aggregation rule in terms of past weights and the loss suffered by the experts (right-hand side of 11).This highlights that there is no need for evaluating the entire history when adjusting weights.
EWA yields optimal convergence rates of O(T ) towards the best expert for expconcave loss functions Cesa-Bianchi & Lugosi (2006).It means that the algorithm's performance (in terms of risk) is asymptotically not worse than the performance of the best expert.A more ambitious property that can also be satisfied is the convex aggregation property.It ensures that the risk of the algorithm is not worse than the risk of the best convex combination of the experts.For an algorithm to satisfy this property, the gradient trick is needed (Devaine et al., 2013).For exp-concave losses, this gives optimal convergence rate O( √ T ) with respect to the best convex combination of the experts (Cesa-Bianchi & Lugosi, 2006).This property also holds for losses that satisfy some Bernstein condition, such as the MAE, when algorithms like Bernstein Online Aggregation (BOA) are used.These algorithms use regularized updating techniques to improve convergence and stability properties (Wintenberger, 2017).
Berrisch & Ziel (2021) adapted BOA to probabilistic problems.The new algorithm is called CRPS learning because it optimizes the CRPS of the target distribution using pointwise optimization on a grid of quantiles.The weights can vary over time and in different parts of the distribution.CRPS learning still maintains the fast convergence of BOA.This algorithm for combining X t = ( X t,1 , . . ., X t,K ) to X t = w ′ t−1 X t can be summarized as follows: where x + and x − denote the elementwise positive and negative parts of x and ⊙ the elementwise product (Hadamard product).The learning rate η t determines the weight adjustment speed.It depends on the bound estimator E t and V t , which is an estimator for the variance.The algorithm describes how weights are calculated on a full quantile grid P. First, the instantaneous regret is calculated (12a).Then the learning rate (12b-12d) is adjusted.In (12e) the cumulative regret is calculated.
Afterward, we calculate the weights (12f).Finally, the forecaster uses w t to calculate X t+1 and starts over with (12a).
Several extensions of online learning algorithms were proposed in the literature.
However, they can also be applied in standard Batch-Learning settings.The benefits of these extensions have been confirmed in empirical studies.Some extensions, like shrinkage operators, are also valuable to nest specific weighting strategies into the learning algorithm.

Smoothing
As mentioned, we apply the general CRPS learning idea to multivariate data.
Therefore, we adopt the two weight-smoothing methods of the original CRPS Learning algorithm.The first consists of the dimension reduction method using basis matrices.
The approach is analogous to the idea discussed in Section 2.2.Using a bivariate basis, we can reduce the dimensionality of the instantaneous regret from D × P to D × P : As usual, we can use this reduced regret to carry out the online learning algorithm.
After obtaining weights (we refer to them as β t,k ) on this condensed version of the regret, we can utilize the basis matrices once again to obtain weights in our original This relatively simple method yields a powerful property: It bridges the gap between purely pointwise weight optimization based on quantiles and the constant approach where a single weight is optimized with respect to the CRPS.That means we can move from a setting with low flexibility (i.e., a few parameters) and low estimation risk to a very flexible one (with many parameters) at the price of high estimation risk.
Another option is to smooth the weights using penalized smoothing.This method can be applied after the estimation, i.e., after the updating step.Hereby we consider two sets of bounded basis functions ψ mv = (ψ 1 , . . ., ψ D ) and ψ pr = (ψ 1 , . . ., ψ P ) on (0, 1) that we will use for penalized smoothing.
Then the weights can be represented by with parameter matix b t,k .We estimate b t,k by penalized L 1 -and L 2 -smoothing which minimizes for each k given β t,k with differential operator D q of order q.The differential order characterizes the smoothing penalty, and λ ≥ 0 characterizes the roughness penalty.
Typically, q = 2 is considered along with cubic B-Splines to penalize for roughness (Wang, 2011;Wood, 2017).However, we prefer using q < 2 here.This smoothes towards constant weights over P for λ → ∞ and not towards a linear relationship between weights and probabilities as for q = 2.As pointed out in Berrisch & Ziel (2021) no argument supports shrinkage towards a linear relationship.In contrast, shrinkage towards constant weights yields the non-pointwise CRPS-learning theory of constant weight functions.However, let us remark that the penalized smoothing approach with λ → ∞ yields a different result than the simple basis smoothing approach mentioned before with φ = φ 1 ≡ 1.
In applications, we only apply this function bases approach on finite grids of probabilities P = (p 1 , . . ., p P ) and a finite number of marginals D = (1, . . ., D).If we consider B-Spline basis functions ψ mv and ψ pr , then an explicit solution based on ordinary least squares exists for (15).This explicit solution has a ridge regression representation.The smoothed weights matrix w t,k is then given by with basis matrices B mv = ψ mv (D) and B pr = ψ pr (P), penalty matrices D mv q and D pr q .We can easily compute the penalty matrix if the b-spline basis has equidistant knots.Hereinafter, we distinguish D S q and D G q , which refer to the equidistant case and the general case where knot placement does not have to be equidistant, respectively.
Let ∆ denote the matrix difference operator: Now, D S q can be easily computed as D S q = ∆ q I.The computation of D G q is more intricate since non-equidistant knots are permitted.The calculation involves an additional weighting step with respect to the non-equidistant distribution of the knots.
We elaborate on this topic briefly since the literature is surprisingly scarce (Li & Cao, 2022, Section 2.2).The required difference matrix can be computed as where W q are weighting matrices that depend on the order of the B-Spline basis, denoted as o, and the knots.Let J denote the number of inner knots.The dimension of the difference matrices ∆ in (18) depend on W q are (J + o − q) × (J + o − q + 1).
The total number of knots will be J + 2o.We can specify the weighting matrices as: The quantity o − q represents the lag used to differentiate the knots.If the knots are equidistant, then W q will be proportional to the identity matrix I. Therefore, it nets the standard P-Spline, which uses D S q = ∆ q I.This gives rise to the general P-Spline estimator.However, W q is only proportional to I rather than equal to it.Therefore, scaling needs to be applied for the results of both estimators to coincide.
The scaling can be applied to lambda, the penalty matrix, or the difference matrix.
To state this formally, let P S q = D S q ′ D S q and P G q = D G q ′ D G q denote the penalty terms of the standard and general P-Spline estimators.The scaling factor with respect to the penalty P G q is (Tr (W q ) /(J + o − q)) 2q so the following relation holds: This is only valid for equidistant B-Splines.For non-equidistant B-Splines, the generalized P-Spline is the only appropriate estimator.However, P G q should always be scaled to ensure the comparability between lambda values in equidistant and nonequidistant situations.
For notational brevity, we denote the first and last part of ( 16) as H mv and H pr , respectively, the so-called hat matrices.Fortunately, they do not depend on timevarying components; therefore, we can compute them prior to the main online learning task, which yields a great reduction in the algorithm's computational complexity.

Knot placement for Smoothing Splines
For both smoothing approaches discussed above, the knots of the B-Spline basis must be placed.A well-established approach is placing plenty of equidistant knots.
However, as discussed above, non-equidistant knot placement is valid if the penalty is defined accordingly.We consider equidistant and non-equidistant knots.Thereby, the non-central beta distribution with the following parameterization is used for distributing the knots: Result: Sequence of knots with length J + 2 (deg + 1) = J + 2o placements for the inputs of Algorithm 1.

Shrinkage operators and Forgetting
Shrinkage operators are well-known in statistical learning theory.They help to reduce the overfitting problem by shrinking a solution.The P-Spline smoothing discussed above can also be interpreted as a shrinkage operator.However, simple shrinkage operators can also be applied to β t .We consider three additional shrink- hard-thresholding operator H.They are defined as for some ϕ ∈ [0, 1], ν ≥ 0 and κ ≥ 0. The fixed share operator shrinks towards the naive combination.This is preferable if no prior information on the experts' performance is available.For some shrinkage problems, there are theoretical guarantees for improvements Tu & Zhou (2011);Cesa-Bianchi et al. (2012).Applications in the context of online learning include, e.g., Cesa-Bianchi et al. (2012); Gonzalez et al. (2021b).The thresholding operators S and H were also considered in online learning contexts previously (Dalalyan et al., 2012;Gaillard & Wintenberger, 2017).Applying thresholds leads to sparse solutions.Both appear in several situations for specific linear model estimators.Most notably, soft-thresholding is the key operator in the coordinate descent algorithm for estimating the lasso (Friedman et al., 2007).Applying any threshold operator potentially violates affinity constraints (incl.the convexity constraint).Therefore, projections to the desired solution space should be applied.
As mentioned, cumulative regret is a key element in online learning.However, in settings with a long history, there might be structural breaks in the data.These breaks motivate the introduction of the forgetting factor.It means that only a limited amount of the old cumulative regret is considered for adjusting the weights.In other words, the algorithm forgets about some part of the past performance.In online learning, usually, exponential forgetting is chosen Guo et al. (2018); Messner & Pinson (2019); Ziel (2022).The Regret with a forgetting factor θ ∈ [0, 1] is formally defined as where θ = 0 correspons to no forgetting.Optimal values for the forgetting factor θ are usually close to 0. The forget should be applied to all hidden state variables in sophisticated online learning procedures like BOA.

Full Model and Hyperparameter Optimization
Algorithm 2 shows the multivariate online CRPS-Learning algorithm.This includes all extensions discussed in Subsections 3.1 and 3.3.Considering all extensions, this algorithm contains five general hyperparameters (the forget rate ϕ, the parameters of the shrinkage operators θ, κ, ν) as well as 30 hyperparameters concerning the design of basis and hat matrices.
The algorithm is versatile as it handles several special cases discussed in the literature.One such case is the uniform combination, also known as the naive combination.
This can be calculated using the Fixed-Share operator F ϕ with ϕ = 1, resulting in uniform weights.There are more efficient ways of calculating uniform weights.However, adjusting the value of ϕ allows a smooth transition from the uniform solution to the  Setting only one of the basis matrices to the unity vector will result in weights that are constant over either marginals (Constant Mv) or probabilities (Constant Pr).
Additionally, setting all smoothing matrices to identity produces pointwise weights, and optimizing λ in the hat matrices concerning the predictive CRPS results in possibly smoothed weights.These cases are illustrated in Figures 2a-2d.The extensions discussed in Subsections 3.1 and 3.3 require the specification of various hyperparameters.There are many possible hyperparameters to choose from, and we do not have  any prior information on the best values.This means that it is impossible to test all combinations of these parameters in each iteration of the forecasting task.The latter would be ideal, but it is impractical due to the required computational resources.As a result, we need to use other, less demanding methods for tuning these hyperparameters.In this paper, we will utilize three approaches for tuning.
The first approach to hyperparameter tuning is using a sophisticated search algorithm based on random forest and optimizing towards the lowest CRPS on a subset of our observations (i.e., a training set).We utilize the R-Package mlrMBO to execute this optimization (Bischl et al., 2017).This approach brings one significant advantage: the search algorithm can efficiently search the considered space by repeatedly reevaluating the objective function.However, once the final set of hyperparameters is selected, it will remain constant throughout the forecasting task.Additionally, the tuning is only executed using a small subset of the dataset.This could be a problem as the chosen set of parameters may not be optimal for the rest of the dataset, especially if there are structural breaks.Hereinafter, we will refer to this approach as Bayesian fix as it fixes the hyperparameters after utilizing a Bayesian search algorithm.
The second approach uses a function included in the R-Package that will be made available in connection with this paper.It implements the proposed algorithm and an online tuning strategy for the hyperparameters.This strategy considers a random sample of all possible hyperparameter sets, and for each iteration, the combination with the lowest aggregate CRPS is chosen.In contrast to the Bayesian fix approach, we define all possible parameter sets before the learning task.However, this method dynamically selects the parameter set based on past performance, allowing for dynamic adjustments if underlying properties change.The most significant drawback of this approach is that only a random sample of the hyperparameter space is considered.However, this approach has the advantage of adjusting dynamically to changes in the data.Therefore, we will refer to this approach as Sampling Online.
It is also possible to combine both approaches.In this case, mlrMBO optimizes on a subset of the data.Afterward, online uses the parameter combinations that got proposed in the mlrMBO optimization.This has the potential to profit from efficient exploration of the hyperparameter space and the ability to adjust to changes in the data dynamically.After this, we will refer to this approach as Bayesian Online.

Application to Multivariate Probabilistic Day-Ahead Power Prices
In day-ahead electricity price forecasting, we consider the price Y t,h at day t and product h = 1, . . ., H of the day.For hourly electricity prices, we have H = 24, and therefore h is often referred to as hour, see Ziel & Weron (2018).We consider the forecasts of Marcjasz et al. (2022), which range from 27th Dec. 2018 to 31st Dec. 2020.
The performance of combinations is mainly determined by two factors: the performance of the considered experts and the diversity between them.The first should naturally be high as an expert can only be beneficial if it provides valuable information; the latter is equally important since there is close to no benefit in combining very similar forecasts.Figure 3 shows the correlation between the experts.We show Pearson's correlation on the lower triangle, which takes values in [−1, 1].In the upper triangle, we show the distance correlation.The distance correlation is a non-linear dependency measure that takes values in [0, 1] and characterizes stochastic independence Székely et al. (2007).Unsurprisingly, we observe positive values for both dependence measures as all time series forecast the same target.However, all values are clearly below 1.This indicates diversity between experts, which is beneficial for the combination task.The simulations of Berrisch & Ziel (2021) show superior performance for penalized smoothing compared to the basis smoothing approach.Therefore, we solely use the penalized smoothing approach for our learning task.We use 99 knots, i.e., one on each quantile.
We consider the knot placement and the other extensions discussed in Subsections 3.1 and 3.3.Table 1 summarizes the considered hyperparameters.That is, we have a total of 15 tuning parameters to optimize.
We conduct the forecasting task using the three tuning strategies Bayesian fix, Sampling Online, and Bayesian Online discussed in Section 4.
Table 1: Considered hyperparameters with their ranges and respective transformation functions and three nested specifications (nested in the Full specification, which considers all of the above hyperparameters).
observations.However, we do not evaluate the forecasts of the first 50 observations due to the elevated estimation uncertainty early in the learning process.We utilize the Krigin learner of mlrMBO to propose eight new points until the budget of 1000 points is exhausted.This is done in parallel.Then, the best hyperparameter set is used to conduct the forecast combination task with all 736 observations.For our final evaluation, we follow Marcjasz et al. (2022) again by excluding the first 182 observations.Appendix B.2 presents a detailed overview of the computation times on our Intel i5-12600K CPU.
For sampling online, we first divide the range of each hyperparameter into 16 equidistant values, apply the transformation function (see Table 1), and then randomly sample up to 2500 points from the resulting multivariate hyperparameter space.
The online optimization process is then carried out as described in Section 4. As with Bayesian fix, the first 182 observations are excluded from the evaluation.
For Bayesian Online, we run Bayesian fix analogous to the above but with a reduced budget of 750 points to propose.These points are fed into the Sampling Online optimization.
In addition to tuning all 15 hyperparameters (Full), we examine three subsets of these hyperparameters.The first subset only includes penalized smoothing and forget (Smooth.Forget), the second subset only includes penalized smoothing (Smooth), and the last subset only includes forget (Forget).Note that the time required for computing Smooth Forget is reduced when using Sampling Online as the number of possible parameter combinations does not exceed 2500.The specifications are summarized in Table 1.We also report the performance of the naive, the performance of each expert, and the four special cases shown in Figure 2. We also report the performance of ML-Poly and EWA weighting schemes as they are popular in the forecast combination literature (Gaillard et al., 2014;Jore et al., 2010;Dalalyan et al., 2012;Opschoor et al., 2017).V 'yugin & Trunov (2022) and Zamo et al. (2021) use EWA together with the CRPS to receive constant weights across the whole distribution.This corresponds to the B Constant Pr scheme using EWA.The ML-Poly algorithm with the CRPS is used in Thorey et al. (2018); this corresponds to the B Constant Pr scheme using ML-Poly.However, ML-Poly and EWA have inferior convergence properties, compared to BOA (Berrisch & Ziel, 2021).That is, we do expect them to perform worse.As an additional benchmark, we report the performance of the Follow-The-Leader (FTL) strategy.This strategy selects the expert who had the smallest loss in the previous iteration (Huang et al., 2017).We always apply the gradient trick (see.Section 3).
We tested the hypothesis of equal accuracy in forecast performance between the naive model and the more sophisticated forecast combinations using the Diebold Mariano (DM) test (Diebold & Mariano, 2002).We apply this DM test with the small sample adjustment of Harvey et al. (1997, eq. ( 9)).Thereby we use the following , where L • t denotes the 24-dimensional vector of CRPS losses on day t for the respective model and ∥L∥ 1 denotes the L 1 norm of the former.The table cells are colored according to the resulting test statistic of the Diebold Mariano test.
We see that all individual experts perform worse compared to naive.The best results were obtained by the Bayesian Online approach and using equidistant knots, JSU1 JSU2 JSU3 JSU4 Norm1 Norm2 Norm3 Norm4 Naive  penalized smoothing, and a forget rate (Smooth.Forget).This solution yields a significant improvement over the naive combination.Comparing this solution with the smaller models Smooth and Forget, we conclude that forgetting contributes to most of the observed improvement.The importance of the forgetting factor indicates structural changes in the data.Further evidence comes from the fact that the dynamic Bayesian Online optimization generally outperforms parameter optimization using Bayesian fix.We did analyze the performance of the combination schemes on subsets of the data.However, the patterns are very similar to the ones observed in Table 2.
Therefore, we do not report them here.
Further, BOA performs best compared to the other considered weighting schemes ML-Poly and EWA.Overall, forgetting and smoothing play a crucial role in the performance of the combination.Finally, regarding the hyperparameter tuning, we conclude that the dynamic optimization Bayesian Online and Sampling Online should be preferred to the static Bayesian optimization Bayesian Fix.
We also analyzed the issue of quantile crossing for all considered combination schemes.Quantile Crossing happened in at least one marginal on 67 of the 554 testset days for the best performing scheme Bayesian Online Smooth.Forget using BOA.For brevity, we transferred the detailed discussion to Appendix B.3. of experts (Kupiec et al., 1995).This table corresponds to the first row, and the BOA column of We performed the Kupiec and Christoffersen tests for unconditional coverage (Kupiec et al., 1995;Christoffersen, 1998) Bayesian Online).They depict the temporal evolution of the weights for each expert.Thereby, Figure 5 presents the temporal evolution across probabilities for hour 16 of the day.After a brief initial burn-in period, the weights of the eight experts stabilized.There is a higher degree of variability in the center of the distribution.
The weights are close to the uniform solution at the tails, with only a few exceptions.
JSU4 strongly influences the combined value in the center of the distribution until around April 2020.After that point, the weight of JSU4 decreases, and JSU3 becomes more prominent.Further, there seem to be noticeable changes in the weights around June 2019 and April 2020, suggesting possible structural changes.These structural changes potentially lead to the dynamic hyperparameter tuning Sampling Online and Bayesian Online performing better than Bayesian fix due to the ability of the hyperparameters to adapt the changing data.
Figure 6 shows the temporal evolution of the weights at the median across all 24 hours.However, the high weights for JSU4 (see Figure 5) are only present in the afternoon and evening.Additionally, the plot reveals structural changes around June 2019 (regarding JSU4 and NORM4) and March 2020 (concerning JSU3, JSU4, NORM3, and NORM4).The latter coincides with the German government's introduction of strict COVID-19 measures.So, we suspect that these changes are due to changes in the power market due to the adjusted behavior of the market participants.
Both graphs indicate that the weights vary with time, hours, and quantiles.Thus, a flexible approach like the one proposed in this study seems appropriate.Lastly, the weights show less smoothing across hours than across quantiles.
Figure 7 presents the parameters used by the proposed online.sm.fr specification.
This approach optimizes the forgetting rate and the two smoothing penalties.All parameters need some time to stabilize.However, the chosen burn-in period (marked in grey) seems to suffice for the most part.Further, we observe an increasing forgetting as the learning progresses and a consistent smoothing level across both dimensions.
Lastly, the weights are getting more smoothed across probabilities than across hours (see also Figure 2d, which presents the most recent weights of this solution across hours and probabilities).Note, however, that the parameters show more persistence as time progresses.This is because hyperparameters are selected based on the cumulative past performance.For larger time series, it is, therefore, advisable to introduce a forgetting factor to the cumulative past performance to ensure reasonably fast adjustment of the hyperparameters to structural changes in the data.Our implementation, which will be made available in connection with this paper, includes this setting.However, we did not use it in this paper as the time series is relatively short.

Conclusion
This paper proposes a novel method for combining multivariate probabilistic forecasts, considering dependencies between quantiles and marginals through a smoothing procedure.The first discussed smoothing method reduces the dimensionality of the regret using Basis matrices.It bridges the gap between pointwise and constant weight optimization.The second method involves smoothing the weights by penalized smoothing.The proposed method extends the (online) CRPS learning algorithm and is based on gradient-based EWA, which yields fast convergence rates.We also discuss possible extensions to the algorithm, such as forgetting and shrinkage operators.We also discuss how non-equidistant knots can be used in penalized smoothing, i.e., how the penalty term needs to be adjusted.
We apply the proposed methodology to multivariate probabilistic forecasts for dayahead electricity prices.For hyperparameter tuning of Algorithm 2, we compare three strategies Bayesian fix and Sampling Online and a mixture of both Bayesian Online.
The best performance is obtained by optimizing Bayesian Online with penalized smoothing and a forget rate.This specification significantly improves over the naive combination.The forget rate contributes to most of the improvement, indicating structural changes in the data.The combination weights stabilize quickly after a short burn-in period.However, we observed some structural change in the weights around November 2019, which remains puzzling.The BOA weighting scheme obtained the best results out of the three considered online learning algorithms.
The smoothing methods used in this paper assume a simple metric structure.This assumption is reasonable for time series that are combined over the forecast horizon, as the order is naturally determined by time.In other cases, the adjacency structure may be more complex.For example, in spatial data, the first and last coordinates are often adjacent (e.g., forecasting a meteorological feature along the equator).Other spatial structures (e.g., across the globe) may be even more complex.We can account for these structures using the Laplacian matrix (graph Laplacian, admittance matrix, or Kirchhoff matrix) as the smoothing penalty.The Laplacian can be calculated from the so-called adjacency matrix, which describes the metric structure of the data.This procedure generalizes our methodology to more complex adjacency structures.This is a promising topic for future research.
The approach proposed in this paper is based on the CRPS as it combines the marginals of a multivariate distribution.This means that the experts only need to report the marginals.The copula is not needed and remains untouched.An interesting question is if and how the copula can be incorporated into the CRPS learning framework.The Energy Score (ES) would be one option to base the weight updates on.However, a decomposition similar to 2 does not exist.Therefore, the ES is suitable only for describing the performance of a joint distribution as a whole.We can use the ES to derive constant combination weights across the entire distribution.
However, this will ignore any variation in the experts' performance across the joint distribution (e.g., across quantiles).Limited research exists on so-called α-projection quantiles (Hallin et al., 2010;Kong & Mizera, 2012).Meng et al. (2020) show a disintegration of the ES using α-projection quantiles in their preprint.This may be a promising starting point for incorporating the copula into the flexible CRPS learning framework.For now, this remains a challenging topic for the future.
Overall, the empirical results approve the proposed combination algorithm, which uses adaptive weights across time, quantiles, and marginals.The algorithm surpasses the performance of simpler weighting schemes such as the naive in terms of the CRPS.
A fast C++ implementation of the proposed algorithm will be made available in )Where I x is the incomplete beta function, a and b are shape parameters, and c is the non-centrality parameterJohnson et al. (1995).Algorithm 1 describes the knot placement in detail.It returns equidistant knots if µ = 0.5, σ = 1, c = 0 and the tailweight parameter τ = 1.Figure1shows B-Spline Basis' for different knot Algorithm 1: Knot placement for B-Splines Data: Let B denote the CDF of the beta distribution and µ, σ, c, τ, deg be parameters for adjusting the knot placement.
Most recent weights JSU1 (a) Constant weights w.r.t.time (hours) Most recent weights JSU1 (b) Constant weights w.r.t.probabilities Most recent weights JSU1 (c) Optimized pointwise weights Most recent weights JSU1 (d) Optimized smoothed weights

Figure 2 :
Figure 2: Most recent weighs of JSU1 calculated using different specifications of Algorithm 2

Figure 3 :
Figure 3: Correlation plot with Pearson's correlation on the lower triangle and distance correlation on the upper triangle.

Figure 4 :
Figure 4: P values and test statistics of the unconditional coverage test of Kupiec for the full set

Figure 5 :
Figure 5: Temporal evolution of the weights of Smooth.Forget Bayesian Online at hour 16 across all 99 probabilities

Figure 6 :Figure 7 :
Figure 6: Temporal evolution of the weights of Smooth.Forget Bayesian Online at the median (50% quantile) across all 24 hours Marcjasz et al.
(2022) used about half a year of data, i.e., the first (182) observations, as a burn-in period for hyperparameters to stabilize.With Bayesian fix, we use these first 182

Table 2
summarizes the results.It reports the CRPS of each expert and the naive combination in the top row and the performance of different specifications of Algorithm 2. The combination schemes in Table 2 consider the full set of experts.

Table 2 :
CRPS scores (lower = better) of the individual experts and the naive combination (top), and different specifications (see Table1) of Algorithm 2 (bottom).The cells are colored according to the test statistic of the DM-Test comparing the model in question to Naive using (greener means lower test statistic, i.e., better performance compared to naive).

Table 2 .
The facets present 50% (top) and 90% (bottom) intervals, the symbols indicate significance, and the cells are colored w.r.t. the test statistics.