Covariance Structures for High-dimensional Energy Forecasting

—Forecasts of various quantities over multiple time periods and/or spatial expanses are required to operate modern power systems. Furthermore, probabilistic forecasts are necessary to facilitate economic decision-making and risk management. This gives rise to the challenge of producing forecasts which capture the dependency between variables, over time, and between locations. The Gaussian Copula has been widely used for multivariate energy forecasts and is scalable because the entire dependency structure is captured by a covariance matrix; estimating this covariance matrix in high dimensional problems remains a research challenge. Here we focus on parameterising this covariance matrix as a step towards more robust estimation and to enable conditioning on explanatory variables. We present a range of parametric structures and estimation strategies suitable for multivariate energy forecasting.


I. INTRODUCTION
When planning and operating weather-dependent energy systems one must consider patterns of spatial and temporal variation of supply and demand. Will solar generation coincide with high demand for space cooling? Will a wind drought exacerbate high winter demand? Is there sufficient network capacity to transmit renewable energy to demand centers? To answer these questions it is necessary to consider interdependency on the relevant spatial and temporal scales. On operational time scales, this means considering the spatial and temporal structure of forecast uncertainty; or loosely speaking, describing whether the impacts forecast errors are likely compound or alleviate one another. For instance, if a forecast error persists at one wind farm, will we see a similar error with the same sign at a neighbouring wind farm as well?
Copulas provide a mathematical framework for modelling dependency between random variables and have been widely applied to multivariate probabilistic energy forecasting [1]- [7]. The Gaussian copula in particular lends itself to this task in high-dimensional settings involving 10s or 100s of dimensions, which demands calibrated density forecasts as margins of the copula to be able to accurately estimate covariance parameters from data. Even with calibrated density forecast the parameters of the Gaussian copula can be challenging to estimate, which is the subject of this paper. The empirical covariance matrix can be calculated from training data but in the high-dimensional Submitted to the 22nd Power Systems Computation Conference (PSCC 2022). setting may be close to singular with finite training data. Several parametrisations have been proposed for wind power forecasting based on covariance functions, which guarantee positive-definiteness of the resulting covariance matrix, or parametric precision matrices [4]- [6]. However these are generally limited to isotropic structures (covariance depends on separation only, not specific time/location) and do not consider possible dynamics arising from dependence on timevarying covariates. Slowly-varying temporal dependency in wind power forecasts is considered in [1] using recursive estimation of the empirical covariance and motivates parametric modelling to capture faster dynamics and avoid the lag associated with recursive estimation.
Covariance estimation plays an important role in many statistical sciences including genetics, ecology, finance and machine learning more generally. The main challenges arise from the constraint that covariance matrices must be positivedefinite, and that the number of parameters grows quadratically with dimension p. Ensuring positive definiteness can be achieved either through constrained optimisation or by adopting an unconstrained covariance matrix re-parametrisation [8]. Such re-parametrisations may be based on matrix decomposition [8] or covariance functions [9]- [12]. Additional properties such as sparsity (of the precision matrix) or smoothness may also be desirable and regularisation may be used to encourage these [13], [14], perhaps the most well-known example being the graphical LASSO [15].
In a static covariance context we have single covariance matrix which, in the Gaussian copula case, can be estimated via the empirical covariance, provided we have enough data to do so reliably. If not, it may be advantageous to impose a structure with fewer parameters in order to obtain a more robust estimate. Furthermore, if we want to let the covariance matrix change with one or more covariates capturing weather or date/time effects, then the number of unknown parameter grows by at least one order of magnitude relative to the static case. In particular, non-parametric modelling (e.g., using splines) of each covariate's effect on each unique element of the covariance matrix leads to an overly complex model that will over-fit to the available data. The solution is making the covariance matrix vary with the covariates only in a limited set of 'directions', for example by modelling only some of its elements or by modelling the covariance matrix via a few global parameters, which are then allowed to vary with the co-2 variates. In this work we consider some parametrisations that are meant to enable the latter approach. Re-parametrisation using covariance functions can drastically reduce the number of parameters to be estimated from p(p + 1)/2 to just a few, provided that a suitable function can be found.
Two limitations the Gaussian copula are is symmetric nature, and that joint extreme events are unlikely, which may not be appropriate for some data or applications [16]. Copula vines, which are a series of linked bivariate copula families, offer a more flexible framework for modelling high dimensional dependency [2], [17]. However, estimation and sampling of the vine structure and bivariate families is computationally intensive even in a static context, making them unattractive for developing dynamic correlation models, which is the focus of this work. Further the results in [2] show that the Gaussian approach with an exponential covariance matrix outperformed the vine copula for two out of three wind farms in terms of the energy score and the p-variogram score. Alternatives copulas include forecasting multivariate regions [18], [19], using Stochastic Differential Equations [20], [21], and a rankreordering method which preserves spatial-temporal characteristics known as the Schaake shuffle [22], [23]. To the best of the author's knowledge, none of these methods have been demonstrated in very high dimensional energy forecasting problems.
In this work we propose a framework for generalising covariance functions by allowing their parameters to take the form of additive functions of explanatory variables. We refer to these as Generalised Additive Covariance (GAC) models and focus on the choice loss function and evaluation model. We begin with the necessary preliminaries in Section II before introducing GAC models and their estimation in Section III. Before applying the proposed methods we describe a suite of evaluation metrics in Section IV. We proceed with two simulation-based examples where the true covariance matrix used to generate synthetic data is known and consider dynamic isotropic covariance in Section V-A and static non-stationary covariance in V-B; followed by an example using real energy forecasts in Section VI. We conclude with a discussion and suggestions for future developments.

II. COVARIANCE FUNCTIONS
Spatio-temporal datasets are widespread in environmental sciences, climatology and meteorology, and related areas, including the energy sector which is increasingly weatherdependent. Covariance models play an important role in these fields, and here we focus on functional models widely used in geostatistics when considering a random process Z(s, t) observed at location s and time t. A subtlety of the present setting is that the temporal dimension of interest is the forecast lead-time l for a given issue time or origin t, hence we adopt the notation Z t (s, l).
A covariance function, C t , is stationary if the covariance cov(Z t (s, l), Z t (s + h, l + u)) = C t (h, u) depends only on separation (h, u), and isotropic if a further condition of invariance to the direction of h and u cov(Z t (s, l), Z t (s + h, l + u)) = C t (khk, |u|) applies. Stationarity and isotropy may apply to only one of the spatial or temporal components. Whether a process is spatially and/or temporally stationary and isotropic must be determined on a case by case basis. However, a wider range of parametric isotropic covariance functions exists and serve as a good starting point for many applications. Some isotropic covariance functions are listed in Table I. Anisotropy accounting for direction may be included by replacing the r with the Mahalanobis distance between locations, and non-stationary extensions to the Powered Exponential and Matérn covariance functions are described in [11] and [24], respectively. We propose a new flexible approach that encompasses non-stationary and anisotropic functions in the next section.
In the remainder of this paper we focus on covariance functions of a single distance measure only (i.e. spatial or temporal), although the methods are easily extendable to multiple distances (i.e. spatio-temporal). We consider random vectors z t with elements i = 1, ..., p being realisations of Z t (l i ) and define the separation matrix R with elements R i,j, = |l 2 l 1 |. The dynamic (time-dependent) covariance matrix ⌃ t may be formed as The time index t is dropped if the covariance function under consideration is static and does not depend on t.

III. GENERALISED ADDITIVE COVARIANCE MODELLING
The classes of covariance functions discussed above provide a framework which we extend to capture the complex covariance structures observed in practice, with a special focus on energy forecasting applications. In particular, let C(r; ⇠) be a covariance function parametrised by the m-dimensional parameter vector ⇠ (e.g., ⇠ = {✓, , } for the powered exponential, see Table I) and let x t be a d-dimensional vector of explanatory variables, which could include t itself. We propose to let each element of ⇠ vary with x t via a semiparametric generalised additive model, which provides much modelling flexibility while retaining interpretability.

A. Formulation
The elements of ⇠ are modelled via is a vector including the second and fourth element of x t . Each f j,i is a smooth function of the form where b ji k are spline basis functions of dimension |S j,i |, while ji k are regression coefficients. Henceforth, we indicate the vector of regression coefficients used to model all the elements of ⇠ with . Note that, while ⇠ depends both on and x t , in the following we use the simpler notation ⇠ t .

B. Estimation
The above models may be fitted to data by minimizing an appropriate loss function with respect to . In the static case, model (4) contains only an intercept (i.e., g j (⇠ j ) = j ) and one might consider a loss that quantifies the difference between the modelled covariance functionĈ(r; ⇠) and the empirical covarianceĈ emp (r). Ordinary least squares is generally suboptimal for this purpose though because variogram estimates at different lags are heteroscedastic and correlated. Therefore, following [25], [26], we apply Weighted Least Squares in the case where x t = x does not depend on time t, and where C cor (·) is the correlation function corresponding to C(·), and recall that ⇠ = ⇠( ) is a function of . But, if ⇠ does depend on t, it is necessary to express the loss in terms of the corresponding covariance matrix ⌃ t = ⌃(⇠ t ) as where z t ⌦ z t is the Kronecker product of realisations at time t and⌃ cor is the correlation matrix corresponding to⌃. However, this approach is correlation-focused as it does not evaluate the fit for variances (i = j), so in cases where we cannot assume unit or otherwise known variances, we consider the full weighted least squares which may be similarly adapted for the dynamic case.
We may estimate by numerically minimising L S WLS or L D WLS (or L S WLSf or L D WLSf ), noting that L D WLS is equivalent to L S WLS in the static case but is more computationally demanding to evaluate. However, this would lead to results that strongly depend on the number of spline basis functions used. To address the issue, we use the penalised objectivê where the second term contains penalties of the form J k ( ) = | S k , with S k being a positive semi-definite matrix, while k are positive tuning parameters. The purpose of the penalties is to control the wiggliness of the smooth effects, the strength of the penalties being controlled by the k 's, which we select by cross-validation. See [27,Ch. 5] for details.

IV. EVALUATION FRAMEWORK
A range of scoring rules are available for evaluating multivariate probabilistic forecasts in the setting where the 'true' distribution is unknown and only a single realisation is available for a given predictive distribution. However, since here we focus on the dependency structure in a Gaussian copula framework we may evaluate forecast performance in the Gaussian domain following the probability integral transformation of the observations. By applying conventional multivariate scoring rules in this setting we are essentially calculating 'copula scores' as coined by Ziel and Berk in [28]. Data in the original domain y are transformed element-wise through the corresponding predictive distribution F i (·), which serves as the margin of the Gaussian copula, and standard Gaussian distribution (·) to yield z i = 1 (F (y i )) with zero mean and unit variance. The vector z of transformed data is considered a sample from a multivariate Gaussian with covariance matrix ⌃, the estimation of which is the focus of this work.
The Energy Score (ES) is a multivariate generalisation of the continuous ranked probability score, and a strictly proper scoring rule [29], [30]. The ES for a single forecastobservation pair is given by The p-Variogram Score (VS-p) [31] is designed to provide greater discrimination between forecasts with different dependency structures than the ES. For a single issue time it is where p is the order of the variogram andẑ (k) is the k th forecast scenario. Note that this score is proper, but not strictly proper, so typically both the ES and VS-p are reported for forecast verification. Also the small relative change between ES skill scores may be sufficient to discriminate between dependency models when coupled with significance tests [28].
The first two scores measure forecast 'errors' while the following two can be motivated by statistical or information theory. In particular, the log or 'ignorance' score [32] for a Gaussian model with zero mean is The log score is equivalent to the negative log-likelihood of z under the model, hence it evaluates the ability of the fitted model to generate the observed data. Its expected value is By subtracting the constant log det (⌃) + p from E(L LS ) we obtain the Kullback-Leibler divergence L KL (⌃,⌃) or 'relative entropy' between two Gaussians, with mean zero but different covariances. L KL (⌃,⌃) 0 and is a strictly proper scoring rule, that is L KL (⌃,⌃) = 0 if and only if⌃ = ⌃ (andμ = µ, beyond the zero-mean setting). Of course, if z is not Gaussian, then (12) is only proper, because distributional features beyond the mean and covariance are ignored.
To compare model performance, and the significance of any apparent difference in performance, it is useful to define skill scores. Skill scores may be calculated for any metric using where M is the metric's value for the method being considered, M ref is the metric's value for a reference method, and M perf is the metrics value for the 'perfect' method, usually zero. We will use bootstrap re-sampling of skill scores to determine if apparent differences in forecast performance (i.e. positive or negative skill) are significantly different from the null hypothesis that skill is zero at the 0.05 level.

V. EXAMPLES WITH SYNTHETIC DATA
Here we test the proposed covariance modelling approach using synthetic data generated by sampling from multivariate normal distributions with known covariance matrices. In both cases, we simulate 5000 samples from the 'true' model and attempt to estimate this model using GACs. We then simulate a further 5000 samples to use for out-of-sample testing and evaluation. The log and relative entropy scores are calculated using the estimated covariance matrix⌃ directly, however the energy and variogram scores require trajectory forecasts to be produced. Therefore, for each sample in the test data, we draw trajectoriesẑ (i) , i = 1, ..., 1000 from N (0,⌃) for evaluation.

A. Dynamic isotropic covariance
First we consider the process Z t (l) with covariance function C t (r) that generates random vectors z t 2 R 6 with l = 0, 0.2, ..., 1. The covariance function is given by where the covariate x t is a realisation of X t ⇠ U(0, 1). We attempt to estimate a covariance model of the form where f cr (·) is cubic spline with five basis functions. The model is fitted by minimizing (9), with = 5 ⇥ 10 5 . For comparison, we also consider the empirical covariance matrix of the 5000 training samples, and a GAC model with a simple linear model for ✓, specificallŷ The true and GAC covariance structure are dynamic, introduced by their dependency on x t , but are guaranteed to be positive-definite as the resulting covariance functions are always of the Powered Exponential form for a given realisation of x t . The estimation of the relationship between ✓ and x t is the estimation problem we are addressing. Of course data generated from a real-world process is unlikely to follow an exact, known covariance function, but the purpose of this exercises if to verify that the GAC modelling approach is able to recover something like (16) from samples of Z t (l).
The true and estimated functions ✓(x t ),✓ GAC-Linear (x t ) and ✓ GAC-CR (x t ) are plotted in Fig. 1. The linear fit is unable to replicate the shape of ✓(x t ) but is an improvement on the static empirical covariance, as verified by the evaluation metrics listed in Table II. Visually, GAC-CR is better able to reproduce the shape of ✓(x t ), although even with 5000 samples in the training set, we found a non-zero smoothness penalty to be necessary to achieve a good fit (not shown).
The GAC model is successfully capturing the dynamics of Z t (l) and has a clear interpretation though the estimated smooth function✓ GAC-CR (x t ). However, in a real-world process is unlikely to be governed by a known parametric form so model specification may be challenging with a large number of candidate covariance functions and additive models for selected parameters, the latter associated with several hyperparameters (basis functions, smoothness penalties) to tune.

B. Non-stationary static covariance
Motivated by a structure observed in energy forecasting, we consider a non-stationary (and therefore also anisotropic) covariance function of the following form cov(Z(l 1 ), Z(l 2 )) = C(l 1 , l 2 )  with an exponential covariance function ✓(l 1 , l 2 ) = 5 with = 1 and = 0.8. This structure exhibits an larger decay rate for greater values of coordinates l 1 and l 2 , and is visualised in Fig. 2. This feature resembles the observations that forecast errors tend to persist for longer the further into the forecast horizon we look.
For the simulation experiment, we consider a 51 ⇥ 51 covariance matrix with l 1 and l 2 taking values 0, 0.02, ..., 1, illustrated in Fig. 2, and aim to estimate the constant parameters =ˆ and =ˆ , and the smooth function ✓(d) = 0 + f cr (d), d = l 1 + l 2 with 10 basis functions, using full weighed least squares loss (8) within (9), and with = 10 4 . A drawback of this approach is there is no guarantee that the fitted covariance function will produce positive-definite covariance matrices when evaluated at all value of d. Here we apply the algorithm described in [33] to find the nearest covariance matrix, though this is not entirely satisfactory and discussed in Section VII.
For reference, we also evaluate the empirical covariance estimated on the training data, and a stationary exponential covariance function with constant ✓ =✓.
Evaluation metrics of the resulting models are presented in Table III, and the significance of apparent differences in these  scores has been tested using bootstrap re-sampling of skill scores. The resulting GAC model, with constant parameter estimatesˆ = 0.992 andˆ = 0.803 and smooth estimatê ✓(d), outperforms both the empirical and stationary (constant ✓) references models in terms of the Log and relative entropy (KL) scores. Energy and Variogram skill scores, are however not significantly different from zero when comparing the GAC and Empirical covariance models. In fact, the Energy score is not able to discriminate between the performance of any of the models.

VI. WIND POWER CASE STUDY
Multivariate wind power forecasting has many use cases in power system operation and energy trading. Spatial dependency is important for managing network constraints, and temporal dependency for scheduling storage and conventional generation. Furthermore, with regions now containing 10s or 100s of wind farms, and lead-times from 0 to 120 hours ahead required for operational planning, multiple years of historical forecast data are needed to estimate a positive definite (PD) empirical covariance matrix. That said, practically speaking older years are less relevant as wind farm development is ongoing, and missing data due to curtailments and so on can pollute historic datasets. In this high dimensional setting, parametrisation of the covariance matrix is essential.
Here we consider the temporal dependency structure of short-term wind power forecasts for the total of Scotland's approximately 10 GW of wind capacity. The dependency is modelled in a Gaussian copula framework where the marginals of the copula are density forecasts of wind power and the temporal dependencies are described by a covariance matrix. We use a non-stationary GAC parametrisation to capture the temporal covariance structures observed in the data.
The case study is based on short-term (0-48h ahead) forecasting of the metered Scottish wind fleet during 2018-2019. Periods where curtailment is over 10% of the estimated total capacity are excluded. Density forecasts for each halfhour period are generated using multiple quantile regression with inputs based on 10m and 100m wind speed forecasts from ECMWF, with parametric estimates for the tails of the distributions [6]. The first 18 months of the dataset are used for model training and tuning via cross-validation, and the last 6 months are used for out-of-sample evaluation. Fig. 3a shows the empirical temporal dependency structure for the forecasts estimated on the training data, which is nonsingular. The correlation matrix has a 'funnel' structure along the diagonal describing how errors tend to persist for longer further into the forecast horizon. Furthermore, the funnel is not monotonically increasing but exhibits some smooth, perhaps periodic, variation along its length and there is also the suggestion of additional off-diagonal structures.
We model the funnel structure using the Powered Exponential correlation function ( = 1) initially with constant parameters to serve as a reference, and then using the proposed GAC approach. In the latter, we allow ✓ to be a smooth function of the distance d = s 1 + s 2 along the diagonal and with a constant parameter =ˆ to be estimated. We model ✓ flexibly rather than in the first instance as it is the simpler effect in the model. Since this is a correlation matrix, the model is estimated using L S WLS ( ), and a smoothing parameter of = 0.1 is chosen. As in the example in Section V-B, this non-stationary structure does not guarantee positive definiteness so as before we find the nearest PD matrix to the fitted GAC model where necessary.
The resulting stationary and GAC correlation matrices are plotted in Fig. 3b and 3c, respectively, and clearly highlights how the stationary model is unable to capture the structure observed in the empirical correlation matrix. The GAC model successfully captures the 'wavy funnel' diagonal structure observed in the empirical correlation matrix. Visually, the GAC matrix resembles a smoothed representation of the Empirical, which is desirable, while the stationary matrix appears deficient in comparison. This is verified by the evaluation scores in Table IV.
The three matrices are evaluated using the forecast metrics introduced previously, however, with six months of testing data, only the large improvement in log score is significant at the 0.05 level. The log score for the Empirical correlation matrix is returned as infinity due to the precision limits of the computation, and highlights the challenge of working with high-dimensional probabilistic forecasts.

VII. DISCUSSION AND CONCLUSIONS
We have proposed a modelling framework called Generalised Additive Covariance for covariance (or correlation) functions and matrices that depend on explanatory variables, as is the case in energy forecasting and other applications. By modelling the parameters of covariance functions using additive models, it is possible to describe high dimensional covariance matrices with a small number of parameters in an interpretable way. The proposed method has been verified in two synthetic examples of time varying isotropic covariance and static non-stationary covariance, and on a further example using real wind power forecast data. In all cases the proposed method out performs benchmarks including the empirical covariance and conventional parametric covariance functions in terms of the ignorance score, the difference is present but not always statistically significant in terms of other evaluation metrics. However, much work remains to determine the theoretical properties of these models and to improve their estimation.
The use of a parsimonious parametrisation is key to enable modelling of the covariance matrix as a function of explanatory variables such as date/time or meteorological conditions, which are common in energy forecasting. In particular, this approach limits that number of parameters that need to be estimated and allows the user to focus on modelling few, interpretable, parameters as functions of the covariates. For example, if we consider the powered exponential function, it is quite simple to understand the effect of varying each parameter on the resulting covariance structure. Hence, a modeller should be able to use their expertise to develop a sensible additive model for each parameter, and methods for automatic feature selection could be explored in future works.
However, the resulting covariance function is generally not guaranteed to be positive definite. The lack of such a guarantee is problematic from a model-fitting perspective. For instance, likelihood-based fitting of a Gaussian copula requires a PD covariance at each optimization step, otherwise the likelihood corresponding to the proposed parameters is not defined. Assuming that the current parameter vector lead to a PD matrix, the simplest solution is to backtrack from the proposed toward the current parameter until the resulting matrix is (c) GAC✓cr(d) Fig. 3: Correlation matrices describing temporal dependency structure of wind power forecasts from 0 to 48 hours-ahead issues at midnight, using the same colour scale as Fig. 2. The forecasts have a visible non-stationary structure. The width of the diagonal ridge indicates how long forecast errors are likely to persist for in time, which grows with lead-time but also appears to depend on the time of day.
positive definite, as done by [4] and [12]. But such a "bruteforce" approach is inelegant and might not scale well with the number of parameters. Other fitting criteria, such the variations on least squares used here, do not rely on the proposed matrix being PD at each optimization step, but the final matrix might not be positive definite. Hence, it must be perturbed to produce a PD matrix, although this might disrupt the interpretation of the underlying matrix parametrisation.
A possible solution to this problem is to penalise the smallest eigenvalue in the optimisation routine to maintain positive definiteness, as in [13], although the implications of this on the quality of the resulting fit are unclear. Alternatively, the positive definiteness problem can be solved by adopting a parametrisation under which the resulting matrix is PD for any parameter value. For example, positive definiteness can be guaranteed by modelling the elements of the Cholesky decomposition of a covariance matrix, but see [8] for other unconstrained parametrisations. The issues now are the lack of an intuitive understanding of the covariance matrix's Cholesky decomposition, and that modelling all the element of the decomposition would likely lead to an over-parametrised model.
In summary, parsimonious covariance modelling requires the use of an interpretable parametrisation which enables users to focus on modelling few parameters as functions of the covariates, but fulfilling the PD constraint requires the adoption of an unconstrained, less interpretable, parametrisation. How to fulfill the positive definiteness constraint while retaining interpretability is an open research question, see for example [14] for an overview.