Cholesky-based multivariate Gaussian regression

Distributional regression is extended to Gaussian response vectors of dimension greater than two by parameterizing the covariance matrix $\Sigma$ of the response distribution using the entries of its Cholesky decomposition. The more common variance-correlation parameterization limits such regressions to bivariate responses -- higher dimensions require complicated constraints among the correlations to ensure positive definite $\Sigma$ and a well-defined probability density function. In contrast, Cholesky-based parameterizations ensure positive definiteness for all distributional dimensions no matter what values the parameters take, enabling estimation and regularization as for other distributional regression models. In cases where components of the response vector are assumed to be conditionally independent beyond a certain lag $r$, model complexity can be further reduced by setting Cholesky parameters beyond this lag to zero a priori. Cholesky-based multivariate Gaussian regression is first illustrated and assessed on artificial data and subsequently applied to a real-world 10-dimensional weather forecasting problem. There the regression is used to obtain reliable joint probabilities of temperature across ten future times, leveraging temporal correlations over the prediction period to obtain more precise and meteorologically consistent probabilistic forecasts.


Introduction
Distributional regression models (Stasinopoulos et al., 2018) -also called generalized additive models for location, scale and shape (GAMLSS, Rigby and Stasinopoulos, 2005) -extend generalized additive models (GAM, Hastie and Tibshirani, 1990) to allow any parametric distribution for the response.Each parameter of the response distribution -not just the mean -can therefore be linked to an additive predictor.Many different univariate response distributions have been employed in such additive distributional regressions, ranging from zero-inflated and overdispersed count data (Klein et al., 2015b;Simon et al., 2019) to survival analysis (Köhler et al., 2017;Burke et al., 2019) or geoadditive hazards regression (Kneib and Fahrmeir, 2007).
Much fewer applications exist for multivariate response distributions.A notable exception is (Klein et al., 2015a) where a bivariate response for childhood undernutrition in India is modeled with a bivariate Gaussian distribution based on the two means, variances and the correlation, all with suitable link functions.However, an extension to higher dimensions is not straightforward because linking individual pairwise correlations would not assure that the corresponding prediction of the covariance matrix Σ is positive definite -which in turn is necessary for a well-defined probability density function.Moreover, the number of parameters for Σ increases quadratically with the dimension of the response, thus necessitating some form of regularization for the high model complexity.
We embed multivariate Gaussian regression into the general distributional regression or GAMLSS framework by parameterizing Σ through the entries of its basic or modified Cholesky decomposition (Pourahmadi, 1999), respectively.The resulting parameterizations are unconstrained, meaning that regardless of the values the additive predictors take, the corresponding covariance matrix Σ is guaranteed to be positive definite.This facilitates regularization through penalized maximum likelihood or Bayesian estimation of the regression coefficients because the additive predictors can be regularized separately.Furthermore, the Cholesky parameterizations allow model complexity to be restricted a priori in cases where the response variables are ordered (for example with respect to time or one dimension in space).Namely, a covariance with an r-order antedependence structure (AD-r, Gabriel, 1962;Zimmerman et al., 1998) can be adopted when a maximum lag may be assumed for the autocorrelations.
The remainder of this paper is structured as follows: A brief overview of methods for covariance matrix estimation (without dependence on regressors) in Sec. 2 motivates leveraging the basic and modified Cholesky decompositions of Σ for a distributional regression (see Sec. 3) with a multivariate Gaussian response (Sec.4).Multivariate Gaussian regression is first illustrated and assessed on artificial data in Sec. 5. Subsequently, in Sec.6 the model is applied to a ten-dimensional weather forecasting application and different parameterizations are compared.A discussion of strengths and limitations of the new Cholesky-based multivariate Gaussian regression framework is found in Sec. 7. Summarizing remarks conclude the paper in Sec. 8.

Parameterizations of the covariance matrix
For addressing the challenges in multivariate Gaussian regression described above in Sec. 1, the important first step is to adopt an unconstrained parameterization of the covariance matrix Σ.This not only facilitates estimation of the parameters with standard optimizers and without complicated constraints, it also enables different forms of regularizations or restrictions of the model complexity.Hence, we review different parameterizations of the covariance matrix proposed in the literature, especially with respect to their suitability in multivariate Gaussian regression.An overview is provided in Table 1.

Positive definiteness of the covariance matrix
The covariance Σ of a k-dimensional random variable y from a multivariate Gaussian distribution is a symmetric k ×k matrix, containing k •(k +1)/2 unique variances and covariances.However, these parameters cannot be chosen freely when defining Σ, but must satisfy z Σ z > 0 for all z = 0. (1) to ensure Σ is positive definite.Only in this case does the corresponding probability density function f (y | µ, Σ) exist: where µ = E(y) is the expectation of y.
To ensure positive definiteness, joint restrictions for the elements of Σ are necessary.The same is true for two other natural parameterizations, namely the precision matrix Σ −1 and the variance-correlation decomposition of Σ.Similarly, a parameterization using the spectral decomposition is interpretable with respect to the eigenstructure of Σ, but constraints enter through the orthogonality of the corresponding eigenvectors (Pourahmadi, 2013).When estimating a fixed covariance matrix from empirical observations, some techniques ensure positive definiteness -e.g., glasso (Friedman et al., 2008) and tapering (Furrer et al., 2006) -while others do not -e.g., hard thresholding (Bickel and Levina, 2008).
Ensuring a positive definite Σ becomes even more difficult in the context of a distributional regression -where parameters underlying the covariance matrix should be linked to regressor variables.Here, it is particularly beneficial to employ a parameterization which ensures positive definiteness without requiring joint constraints and then to combine this with link functions mapping the parameters to the real line.The simplest illustration for this is the case of a univariate Gaussian distribution (i.e., k = 1) with variance σ 2 .To assure positivity, a log link is typically used, mapping the set of positive real numbers to an unrestricted  1: Possible parameterizations of the covariance matrix in multivariate Gaussian distributions, along with properties that are crucial for linking the covariance structure to regressor variables in a distributional regression setup.The (modified) Cholesky decomposition is particularly appealing as its derivation is less burdensome than the matrix logarithm.
predictor (Stasinopoulos et al., 2018).Another notable case is the bivariate Gaussian distribution (i.e., k = 2) where the variance-correlation decomposition can be adopted with log links for the two variances and a suitable link for the correlation parameter restricted to the interval (−1, 1) (Klein et al., 2015a).It is also possible to extend the log-link approach to k > 2 dimensions by using the matrix logarithm, which maps positive definite symmetric matrices Σ to symmetric matrices A = log Σ with unconstrained entries (Pourahmadi, 2013).However, the disadvantages are that (i) the parameters in A have no natural interpretation and (ii) the matrix logarithm involves a Taylor series expansion that is rather burdensome to compute.

Cholesky-based parameterizations
A mathematically and computationally more appealing approach that also yields an unconstrained parameterization is based on the Cholesky decomposition of Σ.Any Σ can be uniquely decomposed as the product of a positive-diagonal lower triangular matrix L with its transpose L : Subsequently the precision matrix Σ −1 results from a product based on the inverse Cholesky factor L −1 .Both L and L −1 offer unconstrained parameterizations of Σ.Although neither the individual parameters in L nor those in L −1 are easily interpretable, the latter matrix as a whole has an elegant interpretation.If y ∼ N (µ, Σ), multiplication with L −1 can be used to uncorrelate y: L −1 (y − µ) ∼ N (0, I).
To obtain parameters that are not only unconstrained but individually interpretable, Pourahmadi (1999) suggests a modified Cholesky decomposition that diagonalizes L −1 : In setups where the k components of y have a natural order (e.g., longitudinal data), the entries of the matrices T and D are related to the autoregressive structure of y ∼ N (µ, Σ).The elements of the lower triangular matrix T are denoted −φ ij (i < j) -where φ ij are the coefficients of an autoregression on y -and the elements of the diagonal matrix D are denoted as ψ i (i = 1, . . ., k) -corresponding to the innovation variances: These intuitive interpretations of the parameters φ ij and ψ i facilitate regularization of the high model complexity, particularly when k is large.Suggestions from the literature include: using lasso penalties on φ ij (Levina et al., 2008); approximating the elements of T and D by low-order polynomials (Pourahmadi, 1999(Pourahmadi, , 2000;;Pan and Pan, 2017); or cutting off the autocorrelation coefficients at a maximum lag of r (Wu and Pourahmadi, 2003), i.e., setting φ ij = 0 for higher lags j − i > r.The latter approach thus yields a banded T matrix, corresponding to a so-called order-r antedependence (AD-r, Gabriel, 1962;Zimmerman et al., 1998).Note that since T and L −1 (Eq. 3) share the same pattern of zeros, AD-r covariances can be modeled using both the modified and basic Cholesky parameterizations, although the individual elements of L −1 are not directly interpretable as autocorrelation coefficients.
In summary, both Cholesky-based parameterizations are appealing candidates for a distributional multivariate Gaussian regression approach.They are relatively easy to compute, yield an unconstrained parameterization that still ensures positive definite covariances, can be regularized using frequentist or Bayesian techniques, and can additionally be restricted to an AD-r antedependence, if the k components are autocorrelated with plausible maximum lag of r.The modified Cholesky decomposition has the advantage that individual parameters are interpretable while the basic Cholesky is slightly easier to compute.

Distributional regression
In this section we briefly introduce the general distributional regression framework into which we embed Cholesky-based multivariate Gaussian regression in the next Sec.4. Specifically, the model specification will be a special case of the general setup from Sec. 3.1 so that the corresponding estimation techniques -both frequentist and Bayesian -from Sec.3.2 can be leveraged.The software that can be used to estimate the models is presented in Sec.3.3.

Model specification
The idea in distributional regression (e.g., Rigby and Stasinopoulos, 2005;Klein et al., 2015b;Umlauf et al., 2018) is to adopt some K-parametric distribution D for the response variable y, linking each of the distributional parameters θ k , k = 1, . . ., K, to separate flexible additive predictors η k typically using known monotonic and twice differentiable link functions h k (•), mapping the support of each parameter to the unrestricted real values of the predictors.The predictors combine additively effects of regressor variable(s) x jk , j = 1, . . ., J k , with where functions f jk (•) can be, e.g., linear terms, but also nonlinear effects, varying coefficients, random intercepts, or spatial effects.Rather than explicitly listing all common types of model terms here, we refer to the literature on GAM (Hastie and Tibshirani, 1990;Wood, 2017), GAMLSS (Rigby and Stasinopoulos, 2005), or Bayesian versions thereof (Umlauf et al., 2018).In this framework, although functions f jk (•) may be nonlinear, they can be represented by a linear combination of so-called basis functions and regression coefficients f jk (x jk ) = d jk l=1 β ljk B ljk (x jk ).For example, functions f jk (•) could be represented by P-splines (Eilers and Marx, 1996) or thin-plate regression splines (Wood, 2003).Hence, this representation of functions makes this model class very flexible and well suited for modeling complex relationships.

Model estimation
In a frequentist setting, distributional regression models are commonly estimated using Newton-Raphson type algorithms maximizing the (penalized) log-likelihood, where parameter updates are usually obtained by zig-zag iterations over distribution parameters θ k and model terms f jk (•) (see, e.g., Rigby and Stasinopoulos, 2005).Moreover, to avoid overfitting, nonlinear terms are estimated using penalization techniques as developed for GAMs (Wood, 2017), i.e., the wiggliness of each model term is controlled by separate smoothing parameters, which can be selected by techniques such as the Akaike information criterion (AIC).The resulting updating equations are known as penalized iteratively weighted least squares (IWLS, Gamerman, 1997).The great benefit of the generic IWLS representation using a basis function approach is that in most cases only first and second order derivatives of the log-likelihood with respect to the predictors are needed to implement a new distribution.This is taken advantage of in Sec. 4 for setting up the estimating equations for the new Cholesky-based multivariate Gaussian regression model.
In addition to this classical GAM-style penalized estimation, the problem of overfitting can also be addressed by boosting algorithms developed for distributional regression (Mayr et al., 2012) or by Lassotype penalization including factor fusion (Groll et al., 2019).
However, in the frequentist framework smoothing parameter optimization for complex distributional regression models can be problematic and computing valid inferential statistics is sometimes difficult or even impossible.The fully Bayesian approach using Markov chain Monte Carlo (MCMC) simulation techniques is particularly attractive in such cases.Here, the model parameters are considered as random rather than as fixed, meaning that the parameters β ljk in a Bayesian model each follow a prior distribution and the estimates are computed using the joint posterior distribution, which is proportional to the product of likelihood and prior.A common choice is to use multivariate normal priors for the regression coefficients and inverse Gamma (usually the default for spline based models) or half-Cauchy priors for smoothing variances (can be advantageous with random effects) that enforce regularization (inverse smoothing parameter in the frequentist approach).For details see, e.g., Umlauf et al. (2018).
For efficiency, MCMC algorithms usually draw parameters from the posterior in blocks from full conditional distributions, i.e., for each model term f jk (•).The full conditionals are available in closed form only in rare cases, however, a very efficient approximation can be constructed by a second order Taylor series expansion of the log-posterior centered at the last parameter state (Gamerman, 1997), which leads to an IWLS-based Metropolis-Hastings algorithm with an acceptance step.Thus, also for full Bayesian inference of distributional regression models only first and second order derivatives are needed.For a more detailed introduction to Bayesian estimation of distributional regression models see Umlauf and Kneib (2018).

Software implementation
A general flexible implementation of distributional regression with particular emphasis on Bayesian estimation is provided in the R package bamlss (Umlauf et al., 2021).The multivariate Gaussian regression models with basic and modified Cholesky parameterizations, as introduced in the next section, are implemented as families for bamlss.For now, these families are made available in a separate package mvnchol, available from the Gitlab server of Universität Innsbruck at https://git.uibk.ac.at/c4031039/mvnchol.In the future, we plan to integrate the families into bamlss.

Cholesky-based multivariate Gaussian regression
This section introduces the novel multivariate Gaussian regression approach we have developed by blending powerful results from the literature on Cholesky-based parameterizations with the framework of distributional regression, briefly reviewed in the previous Sec. 2 and 3, respectively.The multivariate Gaussian regression setup is introduced in Sec.4.1 and subsequently combined with either the basic (Sec.4.2) or the modified (Sec.4.3) Cholesky parameterization to guarantee a positive definite covariance matrix Σ.In order to leverage the typical strategies for estimation and regularization of distributional regression models, the log-likelihood of the multivariate Gaussian regression model is provided in Sec.4.4 along with the first and second derivatives with respect to the predictors.

Multivariate Gaussian regression
In multivariate Gaussian regression the response y is a length-k vector assumed to follow a k-dimensional with probability density function provided in Eq. 2.
All parameters of N -the k components of µ and the k • (k + 1)/2 parameters specifying Σ -may be linked to predictors.For the k means in µ this is straightforward as these parameters are unconstrained and may take any real value.Therefore, we simply link them to the corresponding additive flexible predictors using the identity function.
In contrast, as already argued in Sec. 2, it is not possible to simply link the k • (k + 1)/2 upper-triangular elements of Σ to respective additive predictors.This would not ensure that Σ is positive definite.Instead we propose to link either the elements of the basic or modified Cholesky parameterization of Σ to additive predictors.While parameterizations based on the Cholesky decomposition have been widely used to estimate fixed covariances from sparse observations (Pourahmadi, 1999(Pourahmadi, , 2013)), we exploit them here for estimating covariances that depend on further covariates.

Basic Cholesky parameterization
In the basic Cholesky parameterization, Σ is defined through the k diagonal elements λ ii > 0 and k • (k − 1)/2 off-diagonal elements λ ij of the inverse Cholesky factor: , where Σ = LL .
Restricting the diagonal elements to be positive ensures a unique decomposition, motivating the use of a log link on these parameters while the off-diagonal elements may take any real value so that an identity link can be used: , where i = 1, . . ., k − 1, and j = i + 1, . . ., k.
Modeling the elements of the inverse Cholesky factor L −1 is motivated by the following considerations: (i) Unlike for the parameterization based on L, no computationally intensive matrix inversions are required during model estimation.(ii) There is an autoregressive interpretation for parameter values equal to zero.Hence, in some situations, where it is not be necessary to model all k • (k − 1)/2 off-diagonal elements, some elements may be restricted to zero.Namely, when the components of y have a natural order (e.g., longitudinal data) and a maximum lag in the autocorrelations is reasonable, then an order-r antedependence (AD-r, Gabriel, 1962;Zimmerman et al., 1998) model can be employed.This sets all λ ij = 0 with j − i > r.
For large k and small r this yields a significant reduction in model complexity.

Modified Cholesky parameterization
Alternatively, the modified Cholesky decomposition of Pourahmadi (1999) diagonalizes the inverse Cholesky factor (L −1 ) from Eq. 11, yielding Σ −1 = T D −1 T .The new parameters are those contained in the diagonal matrix D and the upper unitriangular T .
The ψ i in D and the φ ij in T are called the innovation variances and generalized autoregressive parameters of y, respectively.They have meaningful interpretations when the components of y have a natural order.
Analogously to the basic Cholesky parameterization, a log link is used for the innovation variances to ensure positive definiteness while the generalized autoregressive parameters may take any real values: , where i = 1, . . ., k − 1, and Again, it is possible to reduce model complexity when an AD-r model can be assumed.Similar to the basic Cholesky parameterization, this sets all φ ij = 0 with j − i > r.

The log-likelihood and its derivatives
By rearranging the probability density function of the multivariate Gaussian distribution (Eq.2) we obtain the likelihood of distributional parameters for an observation vector y.For mathematical ease, we work with the log-transformed likelihood.
Likelihood-based model estimation maximizes the sum of the individual log-likelihoods (Eq.17) over all n observation vectors contained in the dataset.For computationally efficient estimation, be it frequentist or Bayesian, this requires derivatives of the log-likelihood with respect to the additive predictors.We derive analytical solutions (Appendix A and Appendix B) for the first and second partial derivatives of with respect to all η * .The first derivatives in the basic parameterization are found to be where ỹ = y − µ and ς ij = (Σ −1 ) ij .The corresponding second derivatives are These are always negative, which means likelihood-based estimation of the proposed regression model is a convex optimization problem.The same is true for the modified Cholesky parameterization (Appendix B).

Simulation study
To investigate the finite-sample empirical performance of the novel multivariate Gaussian regression proposed in Sec. 4, this section conducts a systematic simulation study with more supplementary results provided in Appendix C. Specifically, we consider a setup where all distributional parameters (µ, ψ, and φ from a modified Cholesky parameterization) of the response variable y depend on a covariate x, either in a linear or nonlinear way (Sec.5.1).Using a flexible regression model (Sec.5.2) with additive splinebased predictors (i.e., capable of capturing the true effects) it is investigated how quickly recovery of the true distributional parameters improves as the sample size increases (5.3).These results are supplemented in Appendix C by investigating model misspecifications and effects of increasing the dimension of the multivariate response variable.

Data generation
Data sets are constructed by simulating n values of x from a uniform distribution on the interval (−1, 1).Then for each value of x a 3-dimensional vector y = (y 1 , y 2 , y 3 ) is simulated from a trivariate Gaussian distribution whose parameters depend on x.For each type of parameter a mixture of constant, linear and quadratic dependencies is used.The exact equations are given below and visualized by solid black lines in Fig. 1 along with corresponding estimated dependencies for two simulated data sets of n = 500 and n = 5000, respectively.
While Fig. 1 emphasizes the dependency of the distributional parameters (means, innovation variances, and autoregressive parameters) on the covariate x, Fig. 2 brings out how the corresponding means, variances, and correlations (see Eq. 14) relate across the components of the response y.Three setups are shown, namely, when computing the parameters for x = −1, 0, and 1, respectively.The particular choices for the model specification in Eq. 20 are made so that the corresponding covariance matrix is of first-order antedependence (AD-1) type.Specifically, the first variance -that is always equal to the first innovation variance -is kept constant (independent of x) at σ 2 1 = ψ 1 = exp(−2) ≈ 0.14.Similarly, a constant φ 13 = 0 is used so that the first and third components of y are conditionally independent (i.e., AD-1).As shown in Fig. 2, this does not result in a zero correlation ρ 13 , but rather one determined by the remaining correlations, i.e., ρ 13 = ρ 12 • ρ 23 .

Regression model specification
Multivariate Gaussian regression models employing the modified Cholesky parameterization are used to estimate the distribution of y conditionally on x.The three means and six modified Cholesky parameters are all modeled by thin-plate splines s j (x) each composed of 10 basis functions: Bayesian MCMC estimation is employed via the IWLS-based Metropolis-Hastings algorithm (see Sec. 3).
Convergence can be checked using trace and autocorrelation plots for the regression coefficients β ljk from the spline basis functions (see Sec. 3).Fig. 3 shows these diagnostics for the same simulated data set with n = 5000 also employed in Fig. 1.Credible intervals for distributional parameters derived from the MCMC samples are used to test whether or not a predicted effect is significant.Overfitting of the complex model is avoided by choosing prior distributions for the smoothing variances that enforce regularization of the splines.

Results
Fig. 1 already conveys that the true distributional parameters from Eq. 20 are recovered well by the regression model from Eq. 21.The 95% credible intervals from the MCMC simulations almost always contain the true values and become much more narrow as the sample size is increased from n = 500 to 5000.Estimates of the mean parameters are generally more certain -i.e., have narrower credible intervals -than estimates of the covariance parameters.
However, the results in Fig. 1 are based on only a single draw for each of the considered sample sizes.To investigate the increasing predictive skill more thoroughly, we consider 100 replications for each n = 100, 500, 1000, 5000, 10000 and assess the root-mean-squared errors (RMSE) between the true distributional parameters and their corresponding estimates (see Fig. 4).The RMSE is obtained by averaging the errors at 10000 randomly sampled x from the interval (−1, 1).
For all distributional parameters the RMSE decreases with increasing sample size n.Also, RMSE increases with increasing complexity of the dependency on x.For a given parameter type (i.e., µ, ψ, or φ) constant parameters are associated with the lowest RMSE, followed by those with linear and quadratic dependencies.
In addition to these intuitive and reassuring results, Appendix C shows that virtually the same properties also hold for higher-dimensional responses.Moreover, it is shown there that the flexible spline specifications are not very costly in terms of predictive performance.More parsimonious linear specifications perform only slightly better even when the true effects are linear, but considerably worse when misspecified.
Finally, to show that the multivariate Gaussian regression model can not only deal with a single covariate, Sec.6 presents a 10-dimensional weather forecasting problem with 21 covariates.

Application to probabilistic weather forecasting
To illustrate multivariate Gaussian regression in practice, the following multivariate weather forecasting problem is considered: predicting the temperature for ten future time points, so-called lead times, simultaneously.For reliable and meteorologically consistent forecasts, it is crucial to not only accurately predict the marginal distribution of temperature for each individual time point, but also the joint distribution of temperature across all times of interest.This section provides information on numerical weather predictions (Sec.6.1), the data construction (Sec.6.2), multivariate Gaussian regression specification (Sec.6.3), estimated effects and covariance predictions (Sec.6.4), and evaluation of the model performance (Sec.6.5).More general discussions and comparisons follow in Sec. 7.

Background
Numerical weather prediction (NWP) models predict the future state of the atmosphere at multiple lead times by numerically integrating the governing physical differential equations.The numerical integration begins with a best guess of the current state of the atmosphere obtained from in-situ and remote observations around the world.It is typically performed on a discrete grid approximating the earth-atmosphere system that is several kilometers wide horizontally and several hundred meters thick vertically (Bauer et al., 2015).
To account for errors from the initialization and unresolved processes due to the discretization, typically an ensemble of NWP forecasts is generated using slightly different approximations (Bauer et al., 2015;Leutbecher and Palmer, 2008).In a final postprocessing step, statistical regression models are often used for linking actual weather observations to output from the NWP ensemble outputs in order to improve forecast accuracy and better calibrate the uncertainty of the predictions (Gneiting et al., 2007).
Distributional regression has become a popular method to postprocess NWP ensembles, but its state of the art is limited to univariate (following the seminal work of Gneiting et al., 2005) or bivariate responses (Pinson, 2012;Schuhen et al., 2012;Lang et al., 2019).However, some meteorological applications require higher-dimensional joint probability forecasts across several quantities, locations, or lead times (Feldmann et al., 2015;Worsnop et al., 2018;Schoenach et al., 2020).Typically, the prediction errors in such multivariate forecasting problems are correlated, but rather than estimating this correlation structure as part of the distributional regression model, it is usually reconstructed from the empirical NWP ensemble or from historical observations (e.g., Schefzik et al., 2013).
Here, we leverage the novel Cholesky-based multivariate Gaussian regression model to predict a full ten-dimensional temperature distribution based on covariates from an NWP ensemble.

Data
Two-meter temperature forecasts from the Global Ensemble Forecast System (GEFS, Hamill et al., 2013) for Innsbruck, Austria, are postprocessed simultaneously for ten lead times between 186 hours (+7.75 days) and 240 hours (+10 days).The 11 NWP ensemble members of the GEFS have spatial resolutions of approximately 70 km and temporal resolutions of 6 hours.The forecasts are initialized at 00 UTC of 1798 distinct dates over 5 years and bilinearly interpolated to the spatial coordinates of Innsbruck.
Following Gneiting et al. (2005) and Gebetsberger et al. ( 2019) the means mean i and log-transformed standard deviations logsd i of the GEFS ensemble members for each leadtime i are used as covariates for the statistical postprocessing model.Additionally, the day of the year yday of the GEFS run initialization is included to account for seasonal variations in the postprocessing.See Tab. 2 for an overview.

Variable Description mean i
Ensemble mean temperature forecast for lead time i logsd i Logarithm of ensemble standard deviation for lead time i yday Day of year (to capture seasonal variations) Each row of the resulting dataset is associated with a single initialization of the GEFS on one specific date.There 10 lead times of interest, so it contains 10 ensemble means and 10 log-transformed standard deviations of the ensemble temperatures.The 10-dimenstional response variable is composed of the corresponding observed temperatures obs i from the weather station at Innsbruck Airport.Since the lead times are spaced 6 hours apart and there is a model initialization for every day, some observations appear in the response vector for multiple initializations but as different components.This is not the case for predictors because NWP forecasts always change from one model initialization to the next.

Model specifications
The observed temperature obs at Innsbruck is modeled for ten sequential lead times -every 6 hours between 7.75 and 10 days in the future -by a 10-dimensional Gaussian distribution y = (obs +7.75d , . . ., obs +10d ) ∼ N (µ, Σ). ( Distributional parameters of N are linked to flexible additive predictors containing mean i , logsd i and yday.In all regressions, the ten mean parameters are modeled in the same way: These are linear models of the ensemble mean forecasts, but with seasonally varying intercepts and slopes estimated by nonlinear cyclical splines s 0,i and s 1,i , respectively.The regressions differ in how the covariance matrix is parameterized -based on its Cholesky or variance-correlation decomposition -and subsequently how flexibly it may be modeled (Tab.3).

Cholesky parameterizations with fully flexible Σ
Both the basic and modified Cholesky parameterizations permit all 55 covariance-specifying parameters to be linked to covariates.The modified Cholesky parameterization employs the following setup: (24) Again, g 0,i and g 1,i are nonlinear cyclical functions of the year day, but this time the linear models relate the log-transformed ensemble standard deviations with the innovation variances ψ i .Seasonal variations are permitted for the generalized autoregressive parameters φ ij and approximated by cyclical splines h ij .
The corresponding basic Cholesky parameterization employs the analogous setup, simply replacing the innovation variances ψ i with λ ii and the generalized autoregressive parameters φ ij with λ ij .6.3.2.Cholesky parameterizations with assumed structure for Σ Motivated by a seasonal autoregressive model, an antedependence model of order 5 (AD-5) may be adopted for the covariance structure.Namely, combining autocorrelations for lag 1 (previous lead time, hours ago) and lag 4 (previous day, 24 hours ago) in a multiplicative way would lead to autocorrelations up to lag 5 and these are captured here by the AD-5 specificiation.Thus, only the innovation variances and autoregressive parameters with lags at most 5 are modeled for the covariance as in Eq. 24 and higher lag parameters are fixed at zero.This model is referred to as modified Cholesky AD5 : Assuming a covariance of type AD-r has the advantage that the number of covariance parameters increases linearly with the dimension k rather than quadratically, as with unstructured covariances.
For the corresponding basic Cholesky AD5 parameterization, the ψ i are again replaced by λ ii and the φ ij with λ ij .

Reference methods using a variance-correlation parameterization
The Cholesky-based multivariate Gaussian regression models are compared to two reference methods which parameterize Σ through its standard deviations σ i and correlations ρ ij .In both reference models, standard deviations are linked to the same additive predictors as were the diagonal elements of the basic and modified Cholesky decompositions.Again a log-link is required to ensure the estimated parameters are positive: log The problem with a variance-correlation parameterization is that positive definiteness is not generally guaranteed when linking all correlations to predictors (Sec.2).It is possible though to estimate a model where the correlation matrix P is assumed to conditionally follow a first order autoregressive structure.
As a result, P is determined by one single parameter ρ instead of k • (k − 1)/2 correlations.Σ is positive definite if |ρ| < 1, which allows us to model seasonally varying P, but comes at the cost of very inflexible assumptions about the covariance across the k = 10 lead times.This model is referred to as AR1 and can be denoted by where r = ρ/ 1 − ρ 2 is the link function mapping the range of the parameter (−1, 1) to the unrestricted predictor as for the correlation in the bivariate Gaussian regression of Klein et al. (2015a).Finally, we compare the Cholesky-based parameterizations against another alternative model with constant correlation for each element i, j.In terms of the distributional regression model this means that every correlation parameter is modeled as an intercept only.
Thus, unlike all previous specifications considered above, the correlation structure remains fixed and does not change across the days of the year.

Estimated effects and predictions
To highlight the flexibility of the Cholesky-based regression models, a selection of the nonlinear effects estimated by the modified Cholesky model are presented in Fig. 5.The functions s 0,i and s 1,i in Eqs.23-24 can be thought of as seasonally varying intercepts and slopes in linear models relating ensemble and distributional means.Slopes s 1,i are significantly less than the value of 1 expected for a perfect NWP model (where the ensemble means/variances would directly correspond to the observed means/variances).This means that the GEFS forecast mean i contains limited additional information about the true temperature compared to that inherent in yday.Therefore intercepts s 0,i begin to approximate a temperature climatology, with summer maxima approximately 15 degrees higher than winter minima (Fig. 5).
For the innovation variances, g 0,i and g 1,i can again be thought of as seasonally varying intercepts and slopes in linear models.This time though, they relate the log-transformed standard deviations of the ensemble logsd i to the log-transformed innovation variances ψ i .Since the GEFS means did not contain much information about the distributional means, it comes as no surprise that logsd i are even less valuable predictors.The slopes g 1,i average close to zero throughout the year.Intercepts g 0,i have significant seasonal variations for nighttimes (blue) but not for daytime (red).
The effects h ij = φ ij can be directly interpreted as seasonal variations of the generalized autoregressive parameters and paint a complex picture.For some combinations of i and j the estimated seasonal variations are significant and for others they are not, with no simple dependency on lag j − i or index i.
Once all s , g and h have been estimated (see runtimes in Tab.4), predictions for the mean μ and covariance Σ can be computed from the NWP-derived variables yday, mean and logsd .Fig. 6 visualizes forecasts for two days in 2015: one in winter (top) and one in fall (bottom).In the left panels, simulated temperature vectors across all ten lead times are shown in gray along with the actual observations in black.The mean pattern is approximated reasonably well, albeit with relatively large variance due to the long lead   times.The estimated correlation matrices P for the two days are included in the right panels.Clearly the correlation is not constant throughout the year -as assumed in the constant correlation model -but differs substantially between winter and fall.For one, correlations are generally higher in winter, but the pattern of correlations is also much more complex in fall.In winter a first-order autoregressive process -as assumed in the AR1 model -might fit reasonably well.However, in the fall, this is not the case and instead there are large diurnal variations in correlations for a given lag.For example, forecast errors at 6 UTC in the morning (e.g., +8.25d, +9.25d) have little influence on the subsequent daytime predictions.This is not the case in wintertime, where correlations are less variable for a given lag.

Model performance
It is evident that Cholesky-based regressions allow Σ to be modeled flexibly based on the additive predictors.Another question is whether this increased flexibility improves the quality of the postprocessed joint probability forecasts.As the true distributions are unknown, the quality of the predicted distributions is evaluated using the Dawid-Sebastiani score (DSS, Dawid and Sebastiani, 1999;Gneiting and Raftery, 2007).The DSS is a popular multivariate score in postprocessing and linearly related to the log-likelihood of the predicted distributional parameters for a given observation vector.Scores are evaluated out of sample using five-fold cross-validation.
Scores for each method are aggregated by year and month and differences calculated relative to the reference constant correlation model (Fig. 7).All Cholesky models perform better than the constant correlation model (vertical line at zero) and much better than the AR1 model.The models employing the basic parameterization (basic Cholesky and basic Cholesky AD5 ) are better than the constant correlation model in 75% of months.The modified Cholesky models are comparable to the corresponding basic Cholesky models, only very slightly worse.

Discussion
The results from the Cholesky-based multivariate Gaussian regression are discussed further here, in particular regarding the suitability of the novel method for postprocessing multivariate NWP forecasts (Sec.7.1), its sensitivity to ordering of the response vector (Sec.7.2), and practical limitations to its application along with potential remedies (Sec.7.3).

Perspectives for multivariate NWP postprocessing
In state of the art NWP postprocessing joint probability forecasts typically do not take the form of joint probability density functions, but are rather ensembles obtained through ensemble copula coupling (ECC, Schefzik et al., 2013).In ECC the margins of an NWP ensemble are calibrated through univariate postprocessing while retaining the ensemble's order statistics.
In our application, ECC performed much worse than all other models according to the DSS (Fig. 7, ECC median difference of 6 compared to the constant correlation model and not shown).According to another popular multivariate score -the variogram score -ECC again performed worse than all other postprocessing methods (Fig. 8, ECC not shown).Whereas Cholesky-based regression models outperformed all reference methods according to the DSS, the variogram scores show no significant improvement compared to the constant correlation model.This may be due to the variogram score being much more sensitive to the mean and variance than to potential misspecifications of the correlations (Lang et al., 2019).
The poor performance of ECC is likely due to the overall poor predictive skill for GEFS forecasts more than a week in advance.ECC may perform more favorably at shorter lead times, but even here there is a limit to how well tens of ensemble members can possibly capture multivariate dependencies with dimensions of the same order.Additionally it is quite a strong assumption that the ensemble order statistics should reflect error dependencies across the postprocessed univariate forecasts.Cholesky-based multivariate Gaussian regression do not rely on these assumptions and can also be applied when only a direct forecast (and no ensemble) is available.

Sensitivity to ordering of the response
A known limitation of the modified Cholesky decomposition for fixed covariance estimation is that an ordering of the response components needs to be available or be assumed (Pourahmadi, 2013).Many regularization techniques impose an assumed structure on the parameters which would be changed by rearranging the components.Somewhat surprisingly, we find that for our application the unstructured Cholesky models are quite insensitive to random permutations of the variables (Fig. 9).One probable explanation for this is that the individual regression equations for all distributional parameters are regularized separately, while in the fixed covariance estimation of Pourahmadi (2013) the ordering is explicitly exploited for imposing restrictions on the parameters.

Future work
Model complexity is still manageable for our 10-dimensional application, but even here there are 65 distributional parameters and more than 500 model parameters to estimate from data with a sample size of n = 1798 with runtimes on the order of an hour or two (Tab.4).A fully flexible parameterization becomes computationally demanding long before k = 100, where 5150 distributional parameters would need to be modeled.For very large k it is also not sufficient to reduce complexity just by assuming Σ is AD-r.
When there is a natural order to the variables, very parsimonious covariance parameterizations can be obtained by enforcing structure among the Cholesky parameters.Pourahmadi (1999) for example assumes polynomial dependencies among the innovation variances and autoregressive parameters.Σ is then subsequently defined through the coefficients of these polynomials.The polynomial coefficents could be modeled on predictors in place of the Cholesky parameters.Alternatively, smooth nonlinear functions may be used to approximate the parameter structure.Reparameterizations of this sort would extend the applicability of multivariate Gaussian regression to much higher dimensions.

Conclusions
We have developed regression models for a multivariate Gaussian response, where all distributional parameters may be linked to flexible additive predictors.Modeling the mean components of the multivariate dependent variable in such cases is no different from the univariate case, but it becomes difficult to ensure the covariance matrix is positive definite for dimensions greater than two.Common parameterizations such as variances and a correlation matrix require joint constraints among parameters to guarantee this property.Such constraints are difficult to ensure in the context of a regression.
These challenges are addressed by adopting a parameterization of Σ based on its basic and modified Cholesky decomposition, respectively.These parameterizations are unconstrained, ensuring positive definite Σ for any predictors.Subsequently all parameters of the distribution -the means and those specifying the covariance -may be flexibly modeled.
The ability to model k • (k + 3)/2 distributional parameters comes at the cost of high complexity.Regression models can be regularized through penalized likelihood maximization (frequentist approach) or by choosing appropriate prior distributions (Bayesian).Furthermore, when the dependent response variable has a natural order, the degrees of freedom of the covariance matrix may be restricted by assuming a maximum lag for dependencies among the response components.The triangular matrices in the basic and modified Cholesky parameterizations of such a covariance are banded.Subsequently, a large class of parsimonious covariance matrices may be modeled through a priori restrictions on the parameter spacesetting parameters to zero a priori.This limits model complexity by decreasing the number of distributional parameters that are linked to predictors.

Appendix A. Basic Cholesky parameterization
The log-likelihood of the distributional parameters for an observation vector y is given by In terms of the individual matrix entries Eq.A.1 can be expressed as where z is the vector The mean parameters and off-diagonal Cholesky entries only influence the log-likelihood through this third term containing z. Partial derivatives with respect to the means are given by .4)where ς ij refers to the corresponding element of Σ −1 = (L −1 ) L −1 .For the off-diagonal Cholesky entries, Derivatives with respect to the diagonal entries of L −1 also involve the second likelihood term and are given by The log-link on λ ii means ∂λ ii /∂η λ,ii = λ ii , and so Second derivatives for parameters with identity link are found to be The log-link on diagonal entries results in (A.10)

Appendix B. Modified Cholesky parameterization
The modified Cholesky parameters are related to the basic parameters by The log-likelihood in Eq.A.2 can be rewritten with respect to the new parameters: For notational simplicity we define φ ii = −1.
The partial derivatives of the log-likelihood with respect to µ, λ ij and λ ii (Eqs.A.4, A.5, A.6) can be related to derivatives with respect to the modified Cholesky parameters using Eq.B.1: Substituting the partial derivatives of with respect to λ ii and λ ij in the basic Cholesky parameterization, one obtains Since ψ i are estimated using a log-link (log(ψ i ) = η ψ,i ), derivatives with respect to predictors become The remaining parameters use an identity link so ∂ /∂η φ,ij = ∂ /∂φ ij and ∂ /∂η µ,i = ∂ /∂µ i .Continuing with the second derivatives, where ς ii is the i-th diagonal entry of Σ −1 .Multivariate Gaussian regression is performed using (i) splines for all distributional parameters as in Sec. 5 and (ii) linear models for all distributional parameters.When α = 0, all of the true dependencies are constant or linear, which means linear models for the distributional parameters are correctly specified.When α is increased, though, the linear models for the parameters with quadratic dependencies on x (i.e., µ 3 , log ψ 3 , φ 12 ) are misspecified.The dependencies used in the simulation study of Sec. 5 correspond to α = 1.
For the case of true linear dependencies (i.e., α = 0), linear predictors for the distributional parameters perform slightly better than using splines (Fig. C.10).However, for larger α linear predictors perform much worse.The RMSE of the misspecified mean parameter µ 3 triples just by increasing α to 0.1.The increase is more gradual for the misspecified innovation variance ψ 3 and even more so for the autoregressive parameter φ 12 .For small α only these terms perform poorly in the linear specification, but for larger α other terms deteriorate as well.Splines in comparison are much more robust to nonlinearity in the distributional parameter dependencies.

Appendix C.2. Distributional dimension
For a trivariate response, multivariate Gaussian regression is already quite complex.It involves modeling nine distributional parameters by one or more predictor variables.This complexity increases quadratically with the dimension.To investigate how an increase in model complexity influences the predictive skill of multivariate Gaussian regression models, the trivariate simulation of Sec. 5 is extended to higher dimensions k = 5, 10 and 15, where there are 20, 65 and 135 distributional parameters, respectively.
The constant, linear, and quadratic dependencies for the means in Eq. 20 are repeated so that µ i+3 = µ i (i.e., µ 11 = µ 7 = µ 4 = µ 1 ).Similarly, for the log-transformed innovation variances log ψ i+3 = log ψ i .The autoregressive parameters are slightly different since they have two indices i and j.In the simulations of Sec. 5, only the lag-1 autoregressive parameters φ 12 and φ 23 depend on x; the higher-lag parameter φ 13 is set to 0. This pattern is retained for the higher dimensional simulations so that φ ij = 0 for all j − i > 1.The effects for the lag-1 parameters φ 12 and φ 23 are repeated, which results in φ (i+2)(i+3) = φ i(i+1) .The means, variances and correlation matrices of the higher dimensional distributions are visualized for x = −1, 0 and -1 in The predictive skill of multivariate Gaussian regression models for individual distributional parameters does not suffer when the dimension is increased (Fig. C.12.The RMSEs for the nine parameters of the trivariate distribution (µ 1 , µ 2 , µ 3 , ψ 1 , ψ 2 , ψ 3 , φ 12 , φ 23 , φ 13 ) are nearly identical for k = 3, 5, 10 and 15.The same is true for higher dimensional means and innovation variances (grey lines in Fig. C.12).For the generalized autoregressive parameters, the average RMSE is even lower at higher k because a larger fraction of the true paramters are constants (φ ij = 0 for j − i > 1).

Figure 1 :
Figure 1: Effects for the three means (left) and modified Cholesky parameters (center and right) estimated from datasets of size n = 500 (top) and 5000 (bottom).Credible intervals obtained through MCMC sampling (Fig. 3) are indicated by color shading.The true dependencies (Eq.20) are depicted by solid black lines.

Figure 4 :
Figure4: Root-mean-square error (RMSE) of estimates for mean components (left), innovation variances (center), and generalized autoregressive parameters (right) for increasing sample sizes n.True dependencies of the parameters on x (i.e., constant, linear or quadratic) are included in parenthesis.

Figure 5 :
Figure 5: Selected nonlinear effects estimated by the modified Cholesky model.Shaded regions indicate 95% credible intervals obtained from MCMC sampling.Upper row: Seasonally-varying intercepts for the means (left) and log-variances (center) and autoregressive parameters for lag 1 (6h, right).Lower row: Seasonally-varying slopes for the means (left) and log-variances (center) and autoregressive parameters for lag 4 (24h, right).Red colors indicate forecasts valid for daytime, blue ones for nighttime.

Figure 6 :
Figure 6: Predictions from the modified Cholesky model for µ and Σ given values for yday, mean, and logsd for two specific days: 2015-01-03 (in winter, top) and 2015-10-10 (in fall, bottom).Left: Vectors containing forecast scenarios for the ten lead times are depicted by thin grey lines.These are simulated from the predicted 10-dimensional Gaussian distributions.The means of these distributions are shown as dashed black lines.The true observations are thick black circles connected by lines.Right: Heat maps depicting the corresponding estimated correlation matrices P.

Figure 7 :
Figure 7: Differences in Dawid-Sebastiani Score (DSS) to the reference constant correlation model, aggregated by year and month.Positive values mean the given model outperforms the reference.The half circle at the left figure edge indicates that not the entire boxplot for AR1 is shown (minimum value of −4.6).

Figure 8 :Figure 9 :
Figure 8: Variogram skill score (%) relative to the reference constant correlation model, based on scores aggregated by year and month.Positive values mean the given model outperforms the reference.

Figure
Figure C.10: The RMSE of distributional parameter estimates obtained using multivariate Gaussian regression with splines (circles, solid lines) or linear models (squares, dashed lines) as a function of the degree of true nonlinearity α (Eq.C.1).

Figure
Figure C.11: As in Fig. 2, but extended to include means, variances and correlation matrices for the simulating distributions at higher dimensions k = 5, 10 and 15.The first three means and variances are colored to match Fig. 2.

Figure C. 12 :
Figure C.12: RMSE of estimates for mean components (left column), innovation variances (center) and generalized autoregressive parameters (right) for different dimensions k.Parameters of the trivariate distribution are colored and their true dependency on x included in parenthesis (i.e., constant, linear or quadratic).Higher dimensional parameters are indicated by grey.

Table 2 :
Covariates used as predictor variables to model multivariate Gaussian parameters for postprocessing application.The placeholder i stands for one of ten lead times (+7.75d,+8d, . . ., +10d).

Table 3 :
A 10-dimensional error covariance matrix Σ has 55 degrees of freedom.The regression models compared employ different parameterizations of Σ. Depending on the parameterization, parameters may either be modeled on predictors, estimated as intercepts or are restricted to zero a priori.

Table 4 :
Runtime (in minutes) of 10-dimensional Gaussian regression models for temperature forecasting.