Abstract
In this paper, we analyse the computational advantages of the spherical parametrisation for correlation matrices in the context of Maximum Likelihood estimation via numerical optimisation. By using the special structure of correlation matrices, it is possible to define a bijective transformation of an \(n \times n\) correlation matrix \(R\) into a vector of \(n(n-1)/2\) angles between 0 and \(\pi\). After discussing the algebraic aspects of the problem, we provide examples of the use of the technique we propose in popular econometric models: the multivariate DCC-GARCH model, widely used in applied finance for large-scale problems, and the multivariate probit model, for which the computation of the likelihood is typically accomplished by simulated Maximum Likelihood. Our analysis reveals the conditions when the spherical parametrisation is advantageous; numerical optimisation algorithms are often more robust and efficient, especially when R is large and near-singular.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Correlation matrices play a central role in a myriad of statistical models and in many cases their elements are functions of parameters to be estimated numerically. In all these cases, Maximum Likelihood (ML) estimation entails the use of numerical maximisation algorithms, which can be rather demanding, either because of the size of the problem or for the intrinsic difficulty in computing the log-likelihood function: these algorithms, in fact, may run into difficulties when the admissible parameter space is a (possibly complicated) subset of \({\mathbb {R}}^m\), where m is the size of the parameter vector. As we will show, this is precisely the case with correlation matrices. Moreover, when ill-conditioned matrices are involved, the algorithm may easily get stuck into sub-optimal regions. For this reason, an efficient way to parametrise correlation matrices that leads to the maximum computational speed while preserving the highest accuracy is potentially very useful.
In this paper, we investigate the role of the spherical coordinates parametrisation (SCP from here on) as an efficient method to tackle this issue: starting from the original contribution due to Hoffman et al. (1972), who show how an \(n \times 1\) unit-norm vector can be unambiguously expressed in terms of \(n - 1\) angles, several articles have applied this idea to the Cholesky factor of the correlation matrix. In particular, Rapisarda et al. (2007) and Rebonato & Jäckel (2011), building on previous work by Pinheiro & Bates (1996), proposed the SCP in risk management analysis; Pourahmadi & Wang (2015), instead, study the distributional properties of such transformation. Creal et al. (2011) successfully employ spherical coordinates in Generalised Autoregressive Score (GAS) models to study dynamic volatilities and correlation in time series with heavy-tailed distributions. More recently, Loaiza-Maya & Nibbering (2022a) introduce the spherical parametrisations for Bayesian multinomial probit models with a factor structure in the error covariance and trace restrictions, while Loaiza-Maya & Nibbering (2022) extend the framework to multivariate multinomial probit models. As we will argue, the advantage of the SCP comes from its ability to turn a constrained problem into an unconstrained one.
However, the conditions under which the SCP offers a significant edge in computational statistics remain mostly unexplored: in this article, we investigate this issue both in terms of CPU time and of convergence quality using popular models in the econometric literature.
We propose one example from financial econometrics, where the correlation matrix for asset returns is an object of primary interest: in the multivariate GARCH literature two widely used methodologies, i.e., the constant conditional correlation (CCC) model by Bollerslev (1990) and the dynamic conditional correlation (DCC) model by Engle (2002) require the estimation of a correlation matrix. Specifically, we consider the latter given its prominent application in modelling co-volatilites, as for example in Ghosh et al. (2021) and Ni & Xu (2023).
Another case we consider draws from microeconometrics, where correlation matrices arise quite naturally in modelling multiple choice problems with multivariate probit specifications (Ashford & Sowden, 1970; Amemiya, 1974; Chib & Greenberg, 1998).
The rest of the paper is organised as follows: Sect. 2 provides the mathematical background of the parametrisation. We then give practical examples in Sect. 3, starting with a simple exercise based on the Normal and Student’s t distributions (Sect. 3.1), followed by more realistic econometric applications, i.e., the DCC-GARCH model in Sect. 3.2 and the multivariate probit model in Sect. 3.3. Section 4 concludes.
2 Mathematical Background
2.1 Representation of Correlation Matrices via Spherical Coordinates
Given a set of n variables, a correlation matrix \(R\) can be generally described as a positive semidefinite symmetric matrix with ones along the main diagonal and its ij-th element (\(\rho _{i,j}\)) representing the Pearson correlation coefficient between the i-th and the j-th variable. As a consequence, it can be defined in terms of \(m = n (n-1)/2\) parameters.
The positive semi-definiteness requirement, however, implies that the admissible values for the m off-diagonal elements of R are a very highly nonlinear subset of \({\mathbb {R}}^m\). Consider for example the case when \(n = 3\):
It can be proven that, for a given value of \(\rho _{1,2}\) between \(-1\) and 1, the set of values of \(\rho _{1,3}\) and \(\rho _{2,3}\) that ensure that R is positive semidefinite defines an ellipse in \({\mathbb {R}}^2\), that is a circle when \(\rho _{1,2} = 0\) and collapses to a straight line when \(\rho _{1,2} = \pm 1\).
Evidently, numerical procedures such as numerical optimisation or grid search are bound to be quite inefficient and computationally costly in cases such as this. Therefore, in the context of estimation via numerical techniques, it is often preferable to re-express \(R\) as a function of unconstrained terms \(\varvec{\theta }\in \varvec{\Theta }\), where \(\varvec{\Theta }\) is a point in \({\mathbb {R}}^m\) and each element of \(\varvec{\theta }\) is freely varying.
In the special case when \(n=2\), this is almost universally accomplished by using the hyperbolic tangent function, and \(\rho _{1,2}\) is modelled as \(\rho _{1,2} = \tanh {(\theta )}\). In the general case \(n>2\), however, this solution is not directly viable, since the mere fact that off-diagonal elements of \(R\) are less than 1 in modulus does not ensure, per se, positive semi-definiteness of \(R\).
Following Pelletier (2006), since a correlation matrix is positive semi-definite it can always be written as
where \(C\) is lower-triangular, with non-negative diagonal entries so as to ensure uniqueness. If \(R\) has full rank, then \(C\) will correspond to the Cholesky decomposition, but in the semi-definite case some of the columns of \(C\) may be zero vectors.
Let \(\textbf{c}_i'\) denote the i-th row of \(C\) which is here defined as
where \(\textbf{a}_i\) is an unitary vector of dimension i with positive last element. Hoffman et al. (1972)’s intuition was to consider the elements of \(\textbf{a}_i'\) as a point on an i-dimensional unit sphere, thus guaranteeing the existence of a bijective transformation from \(\textbf{a}_i\) to its \(i-1\) angles \(\varvec{\omega }_i\). Following Pinheiro & Bates (1996) and Rapisarda et al. (2007), we express such transformation as:
where \(a_{i,j}\) is the j-th component of \(\textbf{a}_i\) and \(\omega _{i,j} \in [0; \pi ]\). In matrix notation,
where the matrix C is characterised via \(n(n-1)/2\) angles. By virtue of the bijection, Eq. (3) is invertible, so \(\omega _{i,j}\) can be recovered from R via a recursive rule:
By stacking all the \(\omega _{i,j}\) into an m-element vector \(\varvec{\omega }\), one can therefore define the transformation
in which the correlation matrix is expressed as a function of the parameter vector \(\varvec{\omega }\). Note that the formulation above makes it extremely simple to compute the determinant of \(R\) via that of its Cholesky factor CFootnote 1: since C is triangular and \(C_{1,1}\) is identically 1,
therefore, since \(\vert R\vert = \vert CC'\vert = \vert C \vert ^2\), one can compute the determinant of \(R\) as
In this way, the admissible parameter space is simply a hypercube in \({\mathbb {R}}^m\) with length side equal to \(\pi\). Each point corresponds to a valid positive semi-definite matrix R; by virtue of Eq. (7), points on the edge of the hypercube map to singular R matrices. As we will show in the rest of the paper, the SCP is particularly advantageous in cases when \(R\) is near-singular, since in those cases numerical algorithms benefit from the much more regular shape of the parameter space. We call this representation the “\(\omega\)-variant” of the SCP.
In some cases, however, it may be helpful for numerical optimisation procedures to re-express \(\omega _{i,j}\) via a sigmoid transformation of an unconstrained parameter \(\theta _{i,j}\): a very appealing choice is given by
where \(\Lambda (x)\) is the logistic function defined as \(\exp (x)/(1 + \exp (x))\). By using Eq. (8), the parameter space gets mapped to the whole \({\mathbb {R}}^m\) space. We call this representation the “\(\theta\)-variant” of the SCP.
At this point, the advantages of the SCP with respect to the direct use of \(R\) should be evident. Moreover, the \(\theta\)-variant allowing for the free variation of the parameter between \(-\infty\) and \(\infty\) should guarantee a better performance in numerical optimisation methods with ill-conditioned correlation matrices.
2.2 The Spherical Parametrisation in a Nutshell
In order to clarify the meaning of the SCP we will briefly go through the parametrisation steps for a \(2 \times 2\) nonsingular correlation matrix \(R\),
in which the only unknown parameter is \(-1< \rho _{1,2} < 1\). The \(\omega\)-mapping starts from the Cholesky factor of \(R\), here defined as,
whose rows, following Eq. (2), are further decomposed as:
Using Eq. (3), we turn \(\textbf{a}_2\) into:
where \(0< \omega _{2,1} < \pi\). Clearly, this establishes the desired bijection:
The \(\theta\)-parametrisation adds an extra step, that is the bijection between \(\omega\) and the unbounded parameter \(\theta\) as in Eq. (8). As a consequence,
Therefore, \(C\) is parametrised in terms of \(\theta\) as
and therefore
Extending the procedure to a \(3 \times 3\) correlation matrix is also rather simple:
where the Cholesky factor is now defined as
The second and the third rows can be written as:
So the Cholesky factor \(C\), as a function of the unconstrained parameter \(\theta\), becomes
This parametrisation ensures that \(\theta\) can be picked from any point in \({\mathbb {R}}^3\) and still gives rise to a proper Cholesky factor \(C\) for a nonsingular correlation matrix \(R\). Since the transformation is invertible, any nonsingular correlation matrix \(R\) corresponds to one and only one point in \({\mathbb {R}}^3\).
2.3 Derivatives
The availability of closed-form derivatives is necessary to enhance the computational speed and accuracy of optimisation algorithms and in the spherical coordinate case these are easily obtainable: the bijections defined via \(\omega\) and \(\theta\) establish differentiable transformations between the new parameter and the correlation matrix \(R\).
The derivative of \(\textrm{vec}(R)\) with respect to \(\omega\) is
where I is a \(n \times n\) identity matrix and \(\otimes\) denotes the Kronecker product. Equation (10) can be rewritten more compactly via the commutation matrix \(K_n\) asFootnote 2
To compute \(\frac{\partial \textrm{vec}(C')}{\partial \omega }\), we will use the decomposition in unitary vectors of Eq. (2), therefore it suffices to derive \(\frac{\partial \textbf{a}_i}{\partial \omega }\). This element can be obtained quite efficiently via recursion noting that Eq. (3) corresponds to:
As a consequence,
where \(\textbf{e}'_j\) is a standard basis vector, that is the j-th row of the identity matrix. \(\frac{\partial q_{i,j}}{\partial \omega }\) is derived iteratively as,
Finally, to get the derivative with respect to \(\theta\), it is possible to invoke the chain rule and multiply \(\frac{\partial R}{\partial \omega }\) by \(\frac{\partial \omega }{\partial \theta }\). This last quantity follows from Eq. (8) as
3 Simulation Evidence
In this section, we present three examples of estimation via numerical ML where we compare the three parametrisations under analysis: the traditional one for \(R\), in which the likelihood is directly parametrised in terms of \(\textbf{r} = \textrm{vech}(R)\), and the \(\theta\)- and \(\omega\)-variants described in Sect. 2.1.
3.1 Normal Versus Student t Log-Likelihood
As a first example of the practical consequences of using the SCP, we analyse a very simple case, where the elements of the correlation matrix are the main estimated parameters.
Suppose we have a sample \(v_1, v_2, \ldots , v_T\) of independent and identically distributed (iid) observations from a n-variate standardised normal distribution: the total log-likelihood has the following form:
The expression above can be rewritten as
where \(\hat{R}\) is the sample correlation matrix and \(\textrm{tr}(\cdot )\) is the trace operator. Naturally, \(\hat{R}\) is a sufficient statistics and maximising the expression above implies that the ML estimator is simply the sample correlation matrix \(\hat{R}\).
Suppose, however, that one wants to employ numerical methods for maximising (11): if the chosen starting point for \(R\) is the identity matrix, any gradient-based procedure will converge immediately to the maximum in one iteration, on account of the fact that the score in \(R=I\) is exactly proportional to \(\textrm{vec}\left( I - \hat{R}\right)\).
If the log-likelihood is written in terms of \(\varvec{\omega }\) using Eq. (7),
the above does no longer hold, and the number of iterations needed to reach convergence depends on the data and is therefore unpredictable.Footnote 3 As a consequence, in this case the SCP should be inferior to the traditional parametrisation in terms of both CPU time and number of iterations employed by the optimiser.
Suppose now to analyse the same setup, where the normal distribution is replaced by a multivariate Student’s t with \(\nu\) degrees of freedom. The log-likelihood to optimise is
where \(\gamma (\cdot )\) is the log of Euler’s Gamma function. In this case, anticipating the behaviour of gradient-based methods is impossible and the SCP may benefit from the more regular parameter space induced by \(\omega\) or \(\theta\).
In order to shed light on the relative merit of the SCP compared to the traditional parametrisation, we simulate the \(v_t\) process for both distributions in several different scenarios: we set \(T=500\) observations and three different sizes for the vector \(v_t\), that is \(n = \{5, 10, 20 \}\). For the simulated t distribution, we use \(\nu = 5\) so as to make the density markedly different from a Gaussian one. As for the correlation matrix, we use
with \(\varphi = \{ 0, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99 \}\) to investigate cases closer and closer to singularity. For each design we simulate 100 Monte Carlo iterations, keeping track of the elapsed CPU time and the number of maximiser iterations used for convergence, in this case BFGS with numerical derivativesFootnote 4; in all cases, the starting point was \(R= I\).Footnote 5
Table 1 shows the average CPU time and number of BFGS iterations for the normal log-likelihood maximisation: as predicted, the traditional parametrisation achieves the best results. The BFGS iterations needed to converge amount to 2 in almost all cases, save the most complex ones, where the size of the parameter space and the near-singularity of \(R\) probably contaminated the precision of numerical derivatives.Footnote 6 As for CPU time, both the \(\omega\)- and \(\theta\)-variants exhibit worse performance, although the actual penalty in term of actual CPU time is much less severe than the one in number of iterations.
The results for the t distribution (Table 2) show a much different picture: the SCP dominates uniformly the traditional parametrisation both in terms of BFGS iterations and of actual CPU time. The relative advantage of the SCP seems to increase with the \(\rho\) parameter, thus suggesting that the main factor is the different shape of the parameter space (an ellipsoid for the traditional parametrisation and a hypercube for the SCP—see Sect. 2.1); the size of the problem seems also to affect results, although to a lesser degree.
3.2 Example II: The DCC Model
In financial econometrics the analysis of conditional covariance matrices is the key ingredient for studying asset volatility. In particular, the DCC specification as originally proposed by Engle (2002) aims primarily to model time-varying conditional correlations starting from standardised returns: let \(\textbf{x}_t = [x_{1,t}, x_{2,t},\ldots , x_{n,t}]\) be the vector of asset returns, with \(t=0, 1,\ldots , T\) and let define the excess of return as \(\textbf{u}_t = \textbf{x}_t - \textrm{E}(\textbf{x}_t \mid \mathcal {I}_{t-1})\) where \(\mathcal {I}_{t-1}\) denotes the information set at time \(t-1\). The covariance matrix of \(\textbf{u}_t\) is
where \(R_t\) is the \(n \times n\) dynamic correlation matrix and \(V_t = \textrm{diag}(h_{1,t}, h_{2,t},\ldots , h_{n,t} )\) is the diagonal matrix of conditional variances.
Usually, the \(h_{i,t}\) are modelled via univariate techniques. The dynamic correlation is further rewritten as \(R_t = \tilde{Q}_{t}^{-1/2}Q_t\tilde{Q}_{t}^{-1/2}\) where:
where \(\Gamma , A, B\) are \(n \times n\) parameter matrices, \(\eta _t\) is the standardised return vector with elements \(\eta _{it} = u_{i,t}/\sqrt{h_{i,t}}\) and \(\odot\) denotes the Hadamard (element-by-element) product.Footnote 7 The \(\Gamma\) parameter, in particular, corresponds to the unconditional correlation matrix \(\textrm{E}(\eta _t\eta _t')\) which is here reparametrised in terms of spherical coordinates.
In this section, we provide some evidence on the computational advantages of the SCP in a DCC-GARCH model when highly correlated series are used. We will compare the performance of both the \(\omega\)- and the \(\theta\)-variants [Eqs. (4) and (8), respectively] with respect to the traditional case.
The Data Generating Process (DGP) follows Eq. (15), where we opt for the so-called “scalar” specification of the parameter matrices A and B,
where \(a = 0.2\) and \(b = 0.7\). We do so for three reasons: first, the fact that \(a + b = 0.9\) implies that correlations are highly persistent, but the DGP is quite far from values that would imply nonstationarity; moreover, since we focus on the performance of the SCP for modelling \(\Gamma\), we want the DGP to contain as few parameters as possible, apart from \(\Gamma\) itself. Finally, the scalar specification is widely used in empirical practice.
The correlation matrix \(\Gamma\) we use is
We simulated \(n = 10\) asset returns over a time horizon of \(t = 1, \ldots , 1024\). The standardised returns \(\eta _t\) are here defined as
where \(C_t\) is the Cholesky factor of \(R_t\) and \(\eta _t^* = (\eta _{1,t}^*,\ldots , \eta _{10,t}^*)\) with \(\eta _{it}^* \sim t(6)\), where t(6) denotes a Student t distribution with 6 degrees of freedom. We choose this distribution so as to have a degree of “fat tails” in our simulated data similar to that typically found in daily financial time series.
The conditional variances are modelled as univariate GARCH(1, 1) as follows,
where we set \(\omega = 0.02\), \(\alpha = 0.03\), \(\beta = 0.95\). The return \(u_{it}\) is defined via \(\eta _{it}^*\) as \(u_{it} = \sqrt{h_{it}}\eta _{it}^*\). Again, these parameters are meant to resemble the persistence feature typically observed in real data.
We generated 100 datasets from the DGP aboveFootnote 8 and then estimate a DCC model each time, using the three different parametrisations. Given that the likelihood for this model may often have multiple local maxima, we first checked that each technique yielded the same maximum. We then compared CPU time for the DCC model with the traditional parametrisation and for the \(\omega\)- and \(\theta\)-variants of the SCP. ML estimation is performed via the BFGS algorithm, using numerical and analytical derivatives, as per Caporin et al. (2020). All three methods reach exactly the same log-likelihood maximum in both the numerical and analytical case. As for CPU times, Table 3 collects the main summary statistics,Footnote 9
Table 3 shows that the SCP leads to an evident speed-up with respect to the traditional case in the numerical derivative example: the average gain is about 30 s, which corresponds to the 6–\(7\%\) relative change for the \(\theta\)- and \(\omega\)-variants with respect to the standard parametrisation, denoted respectively as \(\Delta\)(\(\theta\)) and \(\Delta\)(\(\omega\)). Moreover, standard deviations are much smaller for the SCP. This may not seem a great improvement in itself, but in fact the advantages of the SCP emerge more clearly by considering the distributions of CPU times for the different scenarios in Figs. 1 and 2.
Both figures show the boxplots for elapsed CPU time for the three different alternatives (Fig. 1 with numerical and Fig. 2 with analytical derivatives): the distributions for both SCP versions are much less widespread, with thinner right tails, especially in the \(\theta\) case. In other words, in the DCC case SCP is not only faster in most cases, but also offers more predictable computing time. As for the relative performance of the two SCP variants, no clear winner emerges: the \(\omega\)-version uses on average slightly less CPU time, while the results for the \(\theta\)-variants appear to be more concentrated and predictable. All these results are qualitatively identical between the numerical vs analytical score scenarios.
A more formal way to asses the computational superiority of the SCP can be carried out via simple t-tests. We compare the log CPU time for the various methods, so that results can be read as relative speed differences. Results are reported in Table 4: both the \(\theta\)-variant and the \(\omega\)-variant show significant differences with respect to the traditional parametrisations. Interestingly, on the third and sixth rows we also report the difference between the two SCPs: the \(\omega\)-variant appears to be superior with a statistically significant difference.
We also experimented with other maximisation algorithms in common use in computational econometrics, the Newton–Raphson method and the limited-memory variant of BFGS, L-BFGS (Morales & Nocedal, 2011), by running a small Monte Carlo simulation with 10 iterations: the Newton method converged in all cases but was almost 6 times more costly in terms of CPU time with respect to the standard BFGS method. The differences between parametrisations, instead, were rather similar to the benchmark case: the \(\omega\)-variant was \(11\%\) faster than the traditional one in the numerical derivative case, while the \(\theta\)-variant was \(8\%\) faster. The L-BFGS, in this scenario, exhibited convergence issues. We report the result for the Newton case in Table 5.
3.3 Example III: The Multivariate Probit Model
The multivariate probit model (Ashford & Sowden, 1970) arises as a natural extension of the probit model in the case of dependent binary outcomes. More formally, let \(\textbf{y}_i = [y_{i,1}, \ldots , y_{i,l}]\) denote a set of l binary variables for the i-th individual, with \(i=1, \ldots N\).
The multivariate probit model can be written as a generalisation of the ordinary probit model as
where \(\mathbb {1}(\cdot )\) is the indicator function, \(\textbf{z}_{i}\) is a vector of k covariates (that we assume common to all binary responses for the sake of simplicity and without loss of generality) and \(\varepsilon _{i,j}\) is the disturbance term. Similarly to the univariate probit model, the scale of \(\varepsilon _{i,j}\) is unidentified, so to achieve identification the covariance matrix of \(\varepsilon\) must be normalised to a correlation matrix R, as illustrated by Chib & Greenberg (1998). Estimation of the unknown \(\beta\) and \(R\) parameters is performed via ML: the computation of the multivariate normal probability is generally handled via simulation methods such as the GHK algorithm (Geweke, 1989; Hajivassiliou & McFadden, 1998; Keane, 1994). The computational burden, however, can be far from being negligible especially when the correlation matrix is near singular. For this reason, we propose to use spherical coordinates in the estimation of \(R\).
In order to assess the numerical properties of the various parametrisations, we set up a simulation experiment as follows: we consider a medium-sized multivariate probit model, with \(l=5\) binary outcomes over \(N = 1024\) observations. The scenarios we consider for the correlation matrix \(R\) are of the form
where we set \(\rho\) over the following grid: \(\rho = \left\{ 0.5, 0.75, 0.9, 0.95 \right\}\) so as to consider progressively more ill-conditioned cases.
For the covariates, we chose a simulation design whose purpose is twofold: on the one hand, we wanted to mimic a real-life, albeit simple, problem; on the other hand, we want our covariate to be helpful in giving the log-likelihood enough curvature to overcome possible identification problems. For these reasons, we define \(\textbf{z}_{i}\) as a four-variate vector, whose first element corresponds to an intercept and the other ones to random entries from a standard Gaussian distribution. The coefficients \(\beta _j\) are set as random draws from a Uniform \(U(-1, 1)\) for all variables except the constant terms,Footnote 10 which takes values over the grid \(\beta _{1,j} = \{-0.5; -0.25; 0; 0.25; 0.5 \}\) with \(j = 1, \ldots , 5\). Finally, for each correlation matrix \(R\) we simulate the data generating process 100 times, allowing the random uniform coefficient \(\beta\) to vary between iterations, so that we avoid cases where the result is dependent from a particular \(\beta\) realisation.
As for the estimation part, ML is performed via simulated Maximum Likelihood using the GHK algorithm. Again, to explore the sensitivity of the results to different choices, we use four different specifications for the pseudo random draws used in the GHK simulation, i.e., two uniform sequences (denoted by U) and two Halton sequences (denoted by H) of length, respectively, 100 and 500. The maximisation algorithm is the stock version of BFGS as provided by the gretl package. Initial values are calculated by running individual probit models for each dependent variables and using the estimated \(\beta\) vectors. The sample correlation matrix for the generalised residuals (see Gourieroux et al., 1987) is used as a starting point for \(R\).
The first main result is reported in Table 6: for the traditional parametrisation and the \(\omega\)-variant the numerical optimiser often failed to converge when \(\rho\) is large and the U/H sequences for the GHK are short. Conversely, the \(\theta\)-variant seems more robust to the GHK initialisation conditions and to the possible ill-conditioning of \(R\) since it converges in most cases, apart from a few ones when \(\rho =0.95\): we conjecture that this may be explained by the unboundness of the \(\theta\) space, which is fully exploited from the optimisation procedure the more the true parameter \(\rho\) is near its extremum.
Tables 7 and 8 show summary statistics on the number of BFGS iterations needed for convergence, conditional on convergence having taken place at the same maximumFootnote 11: the \(\omega\)- and especially the \(\theta\)-variant provide marginally better results (the means are comparable), but much more regular and more robust to outliers than the traditional parametrisation, as shown by the standard deviation and the order statistics. This is very much in line with the experiments performed in the two previous subsections. Again, the advantage that the SCP offers appears to increase with the degree of correlation.
Considering the CPU timings, reported in Tables 9 and 10, we find very similar results: on average, the \(\omega\)- and the \(\theta\)-variants yield a slight advantage, but offer much more predictable performance, with thinner-tailed distributions. This happens uniformly across the four different GHK setups we chose for the uniform sequences.
Similarly to the DCC experiment, we considered the other optimisation algorithms (Newton–Raphson and L-BFGS) in a small Monte Carlo experiment with 10 replications. The Newton method showed convergence rates similar to the BFGS ones; on the contrary the L-BFGS encountered several failures for the traditional parametrisation when high correlations were involved. In particular, the traditional parametrisation always failed to converge with \(\rho =0.90\) or \(\rho =0.95\). The CPU time performance deteriorated dramatically for the Newton method, with the spherical parametrisations (especially the \(\theta\)-variant) faster than the traditional. The L-BFGS method was also slower than the standard BFGS, especially for high values of \(\rho\): when convergence was achieved for the traditional parametrisation, it reported better CPU performances than the SCP variants. In this regard, however, we remark upon the fact that SCP granted much better convergence results. For the sake of brevity, we report in Table 11 the average CPU time in seconds for the \(\rho = \{0.50, 0.75, 0.90\}\) cases (owing to their overall null or small convergence failure rate), with \(U = 100\).
To summarise the whole experiment, the SCP in the form of the \(\theta\)-variant achieves the best results for high values of \(\rho\), especially in terms of ensuring that convergence takes place at all: in those cases where comparison is possible, it guarantees the best numerical performance.
4 Conclusion and Directions for Future Research
Efficient parametrisations in optimisation problems can be very effective. In the context of statistical and econometric models the estimation of correlation matrices poses several issues given the nature of the correlation matrix itself.
In this paper, we analysed the usage of the spherical coordinates representation: following the previous contributions by Pinheiro & Bates (1996) and Rapisarda et al. (2007) we have applied the SCP to some widely used econometric models.
We provide three examples, ranging across different use cases: apart from a few exceptions, the SCP enhances numerical optimisation of log-likelihood functions, both in terms of stability and numerical performance, especially so when the correlation matrix is badly conditioned. Of the two alternative parametrisations proposed here, both SCP methods improve on the traditional approach but the relative gain depends on the context.
Clearly, the scope of the SCP could be extended to models not considered here. For example, one could consider estimating the DCC model with fat-tailed innovations. It would also be interesting to extend the results presented here for the multivariate probit model to the multinomial probit model, especially in the light of its potential for computational complexity (see e.g., Fares et al., 2018). That case, however, would be considerably more complex as the natural parametrisation for the multinomial probit involves a matrix in which diagonal elements may be different from 1 (see e.g., Train, 2009, Section 5.2).
In conclusion, the SCP appears to be a powerful tool for dealing with estimation procedures involving correlation matrices, especially in ill-conditioned cases. Numerical performance is better and, estimation is still possible in limiting cases that would be impossible to handle otherwise.
5 Computational Details
All experiments have been run in the econometric software gretl. In particular, we have used the 2022c release, installed on a Linux machine.
Data Availability
All data used in the manuscript come from simulation exercises. The replication material is available upon request.
Change history
18 May 2024
A Correction to this paper has been published: https://doi.org/10.1007/s10614-024-10614-4
Notes
The determinant of the correlation matrix has been used in the context of collinearity diagnostics by Halkos & Tsilika (2018).
The commutation matrix \(K_n\) is a \(n^2 \times n^2\) matrix satisfying \(\textrm{vec}(C') = K_n\textrm{vec}(C)\), where C is an \(n \times n\) matrix. See for example Magnus & Neudecker (1999), Section 3.7.
The same applies to the \(\theta\)-variant.
BFGS is one of the most widely used numerical optimisation methods in econometrics. See Nocedal & Wright (2006), Section 6.1, for a formal description of the method and an analysis of its properties. We used the BFGS implementation provided in the free software package gretl using all the default settings.
In each simulation, both the traditional parametrisation and the \(\omega\)/\(\theta\)-variants reach exactly the same maximum.
In theory, one iteration should suffice. However, gretl reports two function evaluations: the initial one for computing the score and a second one for ascertaining convergence.
During the Monte Carlo exercise some of the univariate GARCH models could not be estimated because of numerical issues. We skipped these occurrences in order to ensure 100 valid replications.
Running the same simulations on different hardware, we obtained slightly different but qualitatively equivalent results. This is a consequence of the fact that OpenBLAS (Wang et al. 2013), the linear algebra library we used, is heavily optimised for different processor architectures, and results may be subject to slight changes depending on the hardware on which the simulation is performed.
For simplicity and convenience the final values are rounded to one decimal.
In some cases, the exact maximum point found in the three cases considered was different up to numerical noise. As a criterion for judging this aspect, we consider a maximum “qualitatively similar” across the three parametrisations when: all three alternatives have reached exactly the same maximum or, in case of different results, the difference between the value of the maximised log-likelihood is within 0.25.
References
Aielli, G. P. (2013). Dynamic conditional correlation: On properties and estimation. Journal of Business & Economic Statistics, 31(3), 282–299.
Amemiya, T. (1974). Bivariate probit analysis: Minimum chi-square methods. Journal of the American Statistical Association, 69(348), 940–944.
Ashford, J., & Sowden, R. (1970). Multi-variate probit analysis. Biometrics, 26, 535–546.
Bollerslev, T. (1990). Modelling the coherence in short-run nominal exchange rates: A multivariate generalized arch model. The Review of Economics and Statistics, 72(3), 498–505.
Caporin, M., Lucchetti, R., & Palomba, G. (2020). Analytical gradients of dynamic conditional correlation models. Journal of Risk and Financial Management, 13(3), 49.
Cappiello, L., Engle, R. F., & Sheppard, K. (2006). Asymmetric dynamics in the correlations of global equity and bond returns. Journal of Financial Econometrics, 4(4), 537–572.
Chib, S., & Greenberg, E. (1998). Analysis of multivariate probit models. Biometrika, 85(2), 347–361.
Creal, D., Koopman, S. J., & Lucas, A. (2011). A dynamic multivariate heavy-tailed model for time-varying volatilities and correlations. Journal of Business & Economic Statistics, 29(4), 552–563.
Engle, R. (2002). Dynamic conditional correlation: A simple class of multivariate generalized autoregressive conditional heteroskedasticity models. Journal of Business & Economic Statistics, 20(3), 339–350.
Fares, M., Raza, S., & Thomas, A. (2018). Is there complementarity between certified labels and brands? Evidence from small French cooperatives. Review of Industrial Organization, 53, 367–395.
Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integration. Econometrica, 57(6), 1317–1339.
Ghosh, I., Sanyal, M., & Jana, R. (2021). Co-movement and dynamic correlation of financial and energy markets: An integrated framework of nonlinear dynamics, wavelet analysis and dcc-garch. Computational Economics, 57, 503–527.
Gourieroux, C., Monfort, A., Renault, E., & Trognon, A. (1987). Generalized residuals. Journal of Econometrics, 34, 5–32.
Hajivassiliou, V. A., & McFadden, D. L. (1998). The method of simulated scores for the estimation of LDV models. Econometrica, 66(4), 863–896.
Halkos, G. E., & Tsilika, K. D. (2018). Programming correlation criteria with free CAS software. Computational Economics, 52, 299–311.
Hoffman, D. K., Raffenetti, R. C., & Ruedenberg, K. (1972). Generalization of Euler angles to n-dimensional orthogonal matrices. Journal of Mathematical Physics, 13(4), 528–533.
Keane, M. P. (1994). A computationally practical simulation estimator for panel data. Econometrica, 62, 95–116.
Loaiza-Maya, R., & Nibbering, D. (2022a). Fast variational inference for multinomial probit models. arXiv:2202.12495.
Loaiza-Maya, R., & Nibbering, D. (2022). Scalable Bayesian estimation in the multinomial probit model. Journal of Business & Economic Statistics, 40(4), 1678–1690.
Magnus, J. R., & Neudecker, H. (1999). Matrix differential calculus with applications in statistics and econometrics (2nd ed.). Berlin: Wiley.
Morales, J. L., & Nocedal, J. (2011). Remark on algorithm 778: L-BFGS-B: Fortran routines for large-scale bound constrained optimization. ACM Transactions on Mathematical Software, 38(1), 1–4.
Ni, J., & Xu, Y. (2023). Forecasting the dynamic correlation of stock indices based on deep learning method. Computational Economics, 61, 1–21.
Nocedal, J., & Wright, S. J. (2006). Numerical optimization (2nd ed.). Berlin: Springer.
Pelletier, D. (2006). Regime switching for dynamic correlations. Journal of Econometrics, 131(1–2), 445–473.
Pinheiro, J. C., & Bates, D. M. (1996). Unconstrained parametrizations for variance-covariance matrices. Statistics and Computing, 6(3), 289–296.
Pourahmadi, M., & Wang, X. (2015). Distribution of random correlation matrices: Hyperspherical parameterization of the Cholesky factor. Statistics & Probability Letters, 106, 5–12.
Rapisarda, F., Brigo, D., & Mercurio, F. (2007). Parameterizing correlations: A geometric interpretation. IMA Journal of Management Mathematics, 18(1), 55–73.
Rebonato, R., & Jäckel, P. (2011). The most general methodology to create a valid correlation matrix for risk management and option pricing purposes. Available at SSRN 1969689.
Train, K. E. (2009). Discrete choice methods with simulation. Cambridge: Cambridge University Press.
Wang, Q., Zhang, X., Zhang, Y., & Yi, Q. (2013). Augem: Automatically generate high performance dense linear algebra kernels on x86 cpus. In Sc’13: Proceedings of the international conference on high performance computing, networking, storage and analysis (pp. 1–12).
Acknowledgements
We wish to thank Allin Cottrell, Marcin Błażejowski, Giulio Palomba and two anonymous referees for their useful suggestions and comments.
Funding
Open access funding provided by Università Politecnica delle Marche within the CRUI-CARE Agreement. The authors declare that no funds, grants or other support were received during the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
All authors contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interest to disclosure.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lucchetti, R., Pedini, L. The Spherical Parametrisation for Correlation Matrices and its Computational Advantages. Comput Econ (2023). https://doi.org/10.1007/s10614-023-10467-3
Accepted:
Published:
DOI: https://doi.org/10.1007/s10614-023-10467-3