1 Introduction

Correlation matrices play a central role in a myriad of statistical models and in many cases their elements are functions of parameters to be estimated numerically. In all these cases, Maximum Likelihood (ML) estimation entails the use of numerical maximisation algorithms, which can be rather demanding, either because of the size of the problem or for the intrinsic difficulty in computing the log-likelihood function: these algorithms, in fact, may run into difficulties when the admissible parameter space is a (possibly complicated) subset of \({\mathbb {R}}^m\), where m is the size of the parameter vector. As we will show, this is precisely the case with correlation matrices. Moreover, when ill-conditioned matrices are involved, the algorithm may easily get stuck into sub-optimal regions. For this reason, an efficient way to parametrise correlation matrices that leads to the maximum computational speed while preserving the highest accuracy is potentially very useful.

In this paper, we investigate the role of the spherical coordinates parametrisation (SCP from here on) as an efficient method to tackle this issue: starting from the original contribution due to Hoffman et al. (1972), who show how an \(n \times 1\) unit-norm vector can be unambiguously expressed in terms of \(n - 1\) angles, several articles have applied this idea to the Cholesky factor of the correlation matrix. In particular, Rapisarda et al. (2007) and Rebonato & Jäckel (2011), building on previous work by Pinheiro & Bates (1996), proposed the SCP in risk management analysis; Pourahmadi & Wang (2015), instead, study the distributional properties of such transformation. Creal et al. (2011) successfully employ spherical coordinates in Generalised Autoregressive Score (GAS) models to study dynamic volatilities and correlation in time series with heavy-tailed distributions. More recently, Loaiza-Maya & Nibbering (2022a) introduce the spherical parametrisations for Bayesian multinomial probit models with a factor structure in the error covariance and trace restrictions, while Loaiza-Maya & Nibbering (2022) extend the framework to multivariate multinomial probit models. As we will argue, the advantage of the SCP comes from its ability to turn a constrained problem into an unconstrained one.

However, the conditions under which the SCP offers a significant edge in computational statistics remain mostly unexplored: in this article, we investigate this issue both in terms of CPU time and of convergence quality using popular models in the econometric literature.

We propose one example from financial econometrics, where the correlation matrix for asset returns is an object of primary interest: in the multivariate GARCH literature two widely used methodologies, i.e., the constant conditional correlation (CCC) model by Bollerslev (1990) and the dynamic conditional correlation (DCC) model by Engle (2002) require the estimation of a correlation matrix. Specifically, we consider the latter given its prominent application in modelling co-volatilites, as for example in Ghosh et al. (2021) and Ni & Xu (2023).

Another case we consider draws from microeconometrics, where correlation matrices arise quite naturally in modelling multiple choice problems with multivariate probit specifications (Ashford & Sowden, 1970; Amemiya, 1974; Chib & Greenberg, 1998).

The rest of the paper is organised as follows: Sect. 2 provides the mathematical background of the parametrisation. We then give practical examples in Sect. 3, starting with a simple exercise based on the Normal and Student’s t distributions (Sect. 3.1), followed by more realistic econometric applications, i.e., the DCC-GARCH model in Sect. 3.2 and the multivariate probit model in Sect. 3.3. Section 4 concludes.

2 Mathematical Background

2.1 Representation of Correlation Matrices via Spherical Coordinates

Given a set of n variables, a correlation matrix \(R\) can be generally described as a positive semidefinite symmetric matrix with ones along the main diagonal and its ij-th element (\(\rho _{i,j}\)) representing the Pearson correlation coefficient between the i-th and the j-th variable. As a consequence, it can be defined in terms of \(m = n (n-1)/2\) parameters.

The positive semi-definiteness requirement, however, implies that the admissible values for the m off-diagonal elements of R are a very highly nonlinear subset of \({\mathbb {R}}^m\). Consider for example the case when \(n = 3\):

$$\begin{aligned} R= \begin{bmatrix} 1 &{}\quad \rho _{1,2} &{}\quad \rho _{1,3} \\ \rho _{1,2} &{}\quad 1 &{}\quad \rho _{2,3} \\ \rho _{1,3} &{}\quad \rho _{2,3} &{}\quad 1 \end{bmatrix} \end{aligned}$$

It can be proven that, for a given value of \(\rho _{1,2}\) between \(-1\) and 1, the set of values of \(\rho _{1,3}\) and \(\rho _{2,3}\) that ensure that R is positive semidefinite defines an ellipse in \({\mathbb {R}}^2\), that is a circle when \(\rho _{1,2} = 0\) and collapses to a straight line when \(\rho _{1,2} = \pm 1\).

Evidently, numerical procedures such as numerical optimisation or grid search are bound to be quite inefficient and computationally costly in cases such as this. Therefore, in the context of estimation via numerical techniques, it is often preferable to re-express \(R\) as a function of unconstrained terms \(\varvec{\theta }\in \varvec{\Theta }\), where \(\varvec{\Theta }\) is a point in \({\mathbb {R}}^m\) and each element of \(\varvec{\theta }\) is freely varying.

In the special case when \(n=2\), this is almost universally accomplished by using the hyperbolic tangent function, and \(\rho _{1,2}\) is modelled as \(\rho _{1,2} = \tanh {(\theta )}\). In the general case \(n>2\), however, this solution is not directly viable, since the mere fact that off-diagonal elements of \(R\) are less than 1 in modulus does not ensure, per se, positive semi-definiteness of \(R\).

Following Pelletier (2006), since a correlation matrix is positive semi-definite it can always be written as

$$\begin{aligned} R = CC', \end{aligned}$$
(1)

where \(C\) is lower-triangular, with non-negative diagonal entries so as to ensure uniqueness. If \(R\) has full rank, then \(C\) will correspond to the Cholesky decomposition, but in the semi-definite case some of the columns of \(C\) may be zero vectors.

Let \(\textbf{c}_i'\) denote the i-th row of \(C\) which is here defined as

$$\begin{aligned} \textbf{c}_i' = [ \textbf{a}_i' \quad \textbf{0}_{n-i}' ] \end{aligned}$$
(2)

where \(\textbf{a}_i\) is an unitary vector of dimension i with positive last element. Hoffman et al. (1972)’s intuition was to consider the elements of \(\textbf{a}_i'\) as a point on an i-dimensional unit sphere, thus guaranteeing the existence of a bijective transformation from \(\textbf{a}_i\) to its \(i-1\) angles \(\varvec{\omega }_i\). Following Pinheiro & Bates (1996) and Rapisarda et al. (2007), we express such transformation as:

$$\begin{aligned} a_{i,j} = {\left\{ \begin{array}{ll} \cos (\omega _{i,j}) &{}\quad j = 1;\\ \cos (\omega _{i,j}) \prod _{k=1}^{j-1} \sin (\omega _{i,k}) &{}\quad 2 \le j \le i - 1\\ \prod _{k = 1}^{j-1}\sin (\omega _{i,k}) &{}\quad j = i \end{array}\right. } \end{aligned}$$
(3)

where \(a_{i,j}\) is the j-th component of \(\textbf{a}_i\) and \(\omega _{i,j} \in [0; \pi ]\). In matrix notation,

$$\begin{aligned} C = \begin{bmatrix} 1 &{} 0 &{} 0 &{} 0 &{} &{}\ldots &{} &{}0\\ \cos (\omega _{2,1}) &{}\quad \sin (\omega _{2,1}) &{} 0 &{} 0 &{} &{}\ldots &{} &{}0 \\ \cos (\omega _{3,1}) &{}\quad [ \cos (\omega _{3,2}) \cdot \sin (\omega _{3,1})] \quad &{} [\sin (\omega _{3,2}) \cdot \sin (\omega _{3,1})] &{} 0 &{} &{}\ldots &{} &{}0 \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} &{}\ldots &{} &{}\vdots \end{bmatrix} \end{aligned}$$

where the matrix C is characterised via \(n(n-1)/2\) angles. By virtue of the bijection, Eq. (3) is invertible, so \(\omega _{i,j}\) can be recovered from R via a recursive rule:

$$\begin{aligned} \omega _{i,j} = {\left\{ \begin{array}{ll} \arccos (a_{i,j}) \qquad &{}j = 1;\\ \arccos (a_{i,j} / \prod _{k=1}^{j-1} \sin (\arccos (a_{i,k})) \qquad &{}2 \le j \le i \le n.\\ \end{array}\right. } \end{aligned}$$
(4)

By stacking all the \(\omega _{i,j}\) into an m-element vector \(\varvec{\omega }\), one can therefore define the transformation

$$\begin{aligned} R= SCP_{\omega }(\varvec{\omega }) \end{aligned}$$
(5)

in which the correlation matrix is expressed as a function of the parameter vector \(\varvec{\omega }\). Note that the formulation above makes it extremely simple to compute the determinant of \(R\) via that of its Cholesky factor CFootnote 1: since C is triangular and \(C_{1,1}\) is identically 1,

$$\begin{aligned} \vert C \vert = \prod _{i=2}^n \prod _{j=1}^i \sin \left( \omega _{i,j}\right) ; \end{aligned}$$
(6)

therefore, since \(\vert R\vert = \vert CC'\vert = \vert C \vert ^2\), one can compute the determinant of \(R\) as

$$\begin{aligned} \vert R\vert = \prod _{i=1}^m \sin (\varvec{\omega }_i)^2. \end{aligned}$$
(7)

In this way, the admissible parameter space is simply a hypercube in \({\mathbb {R}}^m\) with length side equal to \(\pi\). Each point corresponds to a valid positive semi-definite matrix R; by virtue of Eq. (7), points on the edge of the hypercube map to singular R matrices. As we will show in the rest of the paper, the SCP is particularly advantageous in cases when \(R\) is near-singular, since in those cases numerical algorithms benefit from the much more regular shape of the parameter space. We call this representation the “\(\omega\)-variant” of the SCP.

In some cases, however, it may be helpful for numerical optimisation procedures to re-express \(\omega _{i,j}\) via a sigmoid transformation of an unconstrained parameter \(\theta _{i,j}\): a very appealing choice is given by

$$\begin{aligned} \theta _{i,j} = \log \left( \omega _{i,j}\right) - \log \left( \pi - \omega _{i,j}\right) \longleftrightarrow \omega _{i,j} = \pi \, \Lambda (\theta _{i,j}) \end{aligned}$$
(8)

where \(\Lambda (x)\) is the logistic function defined as \(\exp (x)/(1 + \exp (x))\). By using Eq. (8), the parameter space gets mapped to the whole \({\mathbb {R}}^m\) space. We call this representation the “\(\theta\)-variant” of the SCP.

$$\begin{aligned} R= SCP_{\theta }(\varvec{\theta }) \end{aligned}$$
(9)

At this point, the advantages of the SCP with respect to the direct use of \(R\) should be evident. Moreover, the \(\theta\)-variant allowing for the free variation of the parameter between \(-\infty\) and \(\infty\) should guarantee a better performance in numerical optimisation methods with ill-conditioned correlation matrices.

2.2 The Spherical Parametrisation in a Nutshell

In order to clarify the meaning of the SCP we will briefly go through the parametrisation steps for a \(2 \times 2\) nonsingular correlation matrix \(R\),

$$\begin{aligned} R= \begin{bmatrix} 1 &{} \rho _{1,2} \\ \rho _{1,2} &{} 1 \end{bmatrix} \end{aligned}$$

in which the only unknown parameter is \(-1< \rho _{1,2} < 1\). The \(\omega\)-mapping starts from the Cholesky factor of \(R\), here defined as,

$$\begin{aligned} C= \begin{bmatrix} 1 &{} 0 \\ \rho _{1,2} &{} \sqrt{1 - \rho _{1,2}^2} \end{bmatrix} \end{aligned}$$

whose rows, following Eq. (2), are further decomposed as:

$$\begin{aligned} \textbf{a}_1 = 1 \qquad \textbf{a}_2 = [\rho _{1,2}, \sqrt{1 - \rho _{1,2}^2}]; \end{aligned}$$

Using Eq. (3), we turn \(\textbf{a}_2\) into:

$$\begin{aligned} \textbf{a}_2 = [ \cos {(\omega _{2,1})} \quad \sin {(\omega _{2,1})} ] \end{aligned}$$

where \(0< \omega _{2,1} < \pi\). Clearly, this establishes the desired bijection:

$$\begin{aligned} \rho _{1,2} = \cos {(\omega _{2,1})} \longleftrightarrow \omega _{2,1} = \arccos {(\rho _{1,2})} \end{aligned}$$

The \(\theta\)-parametrisation adds an extra step, that is the bijection between \(\omega\) and the unbounded parameter \(\theta\) as in Eq. (8). As a consequence,

$$\begin{aligned} \textbf{a}_2 = [ \cos {(\pi \Lambda (\theta _{2,1}) )} \quad \sin {(\pi \Lambda (\theta _{2,1}))} ] \end{aligned}$$

Therefore, \(C\) is parametrised in terms of \(\theta\) as

$$\begin{aligned} C= \begin{bmatrix} 1 &{} 0 \\ \cos \left[ \pi \Lambda (\theta _{2,1})\right] &{} \sin \left[ \pi \Lambda (\theta _{2,1})\right] \end{bmatrix} \end{aligned}$$

and therefore

$$\begin{aligned} R = CC' = \begin{bmatrix} 1 &{} \cos \left[ \pi \Lambda (\theta _{2,1})\right] \\ \cos \left[ \pi \Lambda (\theta _{2,1})\right] &{} 1 \end{bmatrix} \end{aligned}$$

Extending the procedure to a \(3 \times 3\) correlation matrix is also rather simple:

$$\begin{aligned} R= \begin{bmatrix} 1 &{} \rho _{1,2} &{} \rho _{1,3} \\ \rho _{1,2} &{} 1 &{} \rho _{2,3} \\ \rho _{1,3} &{} \rho _{2,3} &{} 1 \end{bmatrix} \end{aligned}$$

where the Cholesky factor is now defined as

$$\begin{aligned} C= \begin{bmatrix} 1 &{}\quad 0 &{}\quad 0 \\ \rho _{1,2} &{} \quad \sqrt{1 - \rho _{1,2}^2} &{}\quad 0 \\ \rho _{1,3} &{}\quad \frac{\rho _{2,3} - \rho _{1,2} \rho _{1,3}}{\sqrt{1 - \rho _{1,2}^2}} &{}\quad \sqrt{1 - \rho _{1,3}^2 + \frac{(\rho _{2,3} - \rho _{1,2} \rho _{1,3})^2}{1 - \rho _{1,2^2}}} \end{bmatrix} \end{aligned}$$

The second and the third rows can be written as:

$$\begin{aligned} \textbf{a}_2&= [ \cos (\omega _{2,1}) \quad \sin (\omega _{2,1}) ]\\ \textbf{a}_3&= [ \cos (\omega _{3,1}) \quad \cos (\omega _{3,2})\sin (\omega _{3,1}) \quad \sin (\omega _{3,1})\sin (\omega _{3,2})] \end{aligned}$$

So the Cholesky factor \(C\), as a function of the unconstrained parameter \(\theta\), becomes

$$\begin{aligned} C= \begin{bmatrix} 1 &{} \quad 0 &{}\quad 0 \\ \cos \left[ \pi \Lambda (\theta _{2,1})\right] &{}\quad \sin \left[ \pi \Lambda (\theta _{2,1})\right] &{}\quad 0 \\ \cos \left[ \pi \Lambda (\theta _{3,1})\right] &{}\quad \cos \left[ \pi \Lambda (\theta _{3,2})\right] \sin \left[ \pi \Lambda (\theta _{3,1})\right] &{}\quad \sin \left[ \pi \Lambda (\theta _{3,1})\right] \sin \left[ \pi \Lambda (\theta _{3,2})\right] \end{bmatrix} \end{aligned}$$

This parametrisation ensures that \(\theta\) can be picked from any point in \({\mathbb {R}}^3\) and still gives rise to a proper Cholesky factor \(C\) for a nonsingular correlation matrix \(R\). Since the transformation is invertible, any nonsingular correlation matrix \(R\) corresponds to one and only one point in \({\mathbb {R}}^3\).

2.3 Derivatives

The availability of closed-form derivatives is necessary to enhance the computational speed and accuracy of optimisation algorithms and in the spherical coordinate case these are easily obtainable: the bijections defined via \(\omega\) and \(\theta\) establish differentiable transformations between the new parameter and the correlation matrix \(R\).

The derivative of \(\textrm{vec}(R)\) with respect to \(\omega\) is

$$\begin{aligned} \frac{\partial \textrm{vec}(R)}{\partial \omega } = \frac{\partial \textrm{vec}(CC')}{\partial \omega } = (I \otimes C) \frac{\partial \textrm{vec}(C)}{\partial \omega } + ( C \otimes I) \frac{\partial \textrm{vec}(C')}{\partial \omega } \end{aligned}$$
(10)

where I is a \(n \times n\) identity matrix and \(\otimes\) denotes the Kronecker product. Equation (10) can be rewritten more compactly via the commutation matrix \(K_n\) asFootnote 2

$$\begin{aligned} \frac{\partial \textrm{vec}(R)}{\partial \omega } = (I + K_n) (I \otimes C) \frac{\partial \textrm{vec}(C')}{\partial \omega } \end{aligned}$$

To compute \(\frac{\partial \textrm{vec}(C')}{\partial \omega }\), we will use the decomposition in unitary vectors of Eq. (2), therefore it suffices to derive \(\frac{\partial \textbf{a}_i}{\partial \omega }\). This element can be obtained quite efficiently via recursion noting that Eq. (3) corresponds to:

$$\begin{aligned} a_{i,j}&= q_{i,j} \cos (\omega _{i,j}) \quad j< i\\ a_{i,j}&= q_{i,j} \quad j = i \\ q_{i, j+1}&= q_{i,j} \sin (\omega _{i,j}) \quad j < i\\ q_{i, 1}&= 1 \end{aligned}$$

As a consequence,

$$\begin{aligned} \frac{\partial a_{i,j}}{\partial \omega }&= \frac{\partial q_{i,j}}{\partial \omega } \cos (\omega _{i,j}) + q_{i,j} \frac{\partial \cos (\omega _{i,j})}{\partial \omega }\\&= \frac{\partial q_{i,j}}{\partial \omega } \cos (\omega _{i,j}) - q_{i,j} \sin (\omega _{i,j})\textbf{e}'_j\\&= \frac{\partial q_{i,j}}{\partial \omega } \cos (\omega _{i,j}) - q_{i,j+1}\textbf{e}'_j\\ \end{aligned}$$

where \(\textbf{e}'_j\) is a standard basis vector, that is the j-th row of the identity matrix. \(\frac{\partial q_{i,j}}{\partial \omega }\) is derived iteratively as,

$$\begin{aligned} \frac{\partial q_{i,j+1}}{\partial \omega }&= \frac{\partial q_{i,j}}{\partial \omega } \cos (\omega _{i,j}) \sin {\omega _{i,j}} + q_{i,j}\cos (\omega _{i,j})\textbf{e}'_j\\ \frac{\partial q_{i,1}}{\partial \omega }&= \textbf{0}' \end{aligned}$$

Finally, to get the derivative with respect to \(\theta\), it is possible to invoke the chain rule and multiply \(\frac{\partial R}{\partial \omega }\) by \(\frac{\partial \omega }{\partial \theta }\). This last quantity follows from Eq. (8) as

$$\begin{aligned} \frac{\partial \omega _{i,j}}{\partial \theta _{i,j}} = \pi \Lambda (\theta _{i,j}) [1 - \Lambda (\theta _{i,j})] \end{aligned}$$

3 Simulation Evidence

In this section, we present three examples of estimation via numerical ML where we compare the three parametrisations under analysis: the traditional one for \(R\), in which the likelihood is directly parametrised in terms of \(\textbf{r} = \textrm{vech}(R)\), and the \(\theta\)- and \(\omega\)-variants described in Sect. 2.1.

3.1 Normal Versus Student t Log-Likelihood

As a first example of the practical consequences of using the SCP, we analyse a very simple case, where the elements of the correlation matrix are the main estimated parameters.

Suppose we have a sample \(v_1, v_2, \ldots , v_T\) of independent and identically distributed (iid) observations from a n-variate standardised normal distribution: the total log-likelihood has the following form:

$$\begin{aligned} l(R) = \textrm{const} -\frac{T}{2}\log {\vert R\vert } - \frac{1}{2}\sum _{t=1}^T v_t'R^{-1}v_t \end{aligned}$$

The expression above can be rewritten as

$$\begin{aligned} l(R) = \textrm{const} -\frac{T}{2}\bigl [\log {\vert R\vert } - \textrm{tr}(\hat{R}R^{-1})\bigr ] \end{aligned}$$
(11)

where \(\hat{R}\) is the sample correlation matrix and \(\textrm{tr}(\cdot )\) is the trace operator. Naturally, \(\hat{R}\) is a sufficient statistics and maximising the expression above implies that the ML estimator is simply the sample correlation matrix \(\hat{R}\).

Suppose, however, that one wants to employ numerical methods for maximising (11): if the chosen starting point for \(R\) is the identity matrix, any gradient-based procedure will converge immediately to the maximum in one iteration, on account of the fact that the score in \(R=I\) is exactly proportional to \(\textrm{vec}\left( I - \hat{R}\right)\).

If the log-likelihood is written in terms of \(\varvec{\omega }\) using Eq. (7),

$$\begin{aligned} l(\varvec{\omega }) = \textrm{const} -\frac{T}{2} \sum _{i=1}^m \left[ \log (\sin (\varvec{\omega }_i)^2)\right] - \textrm{tr}\left[ \hat{R}\cdot \textrm{SCP}_{\omega }(\varvec{\omega })^{-1}\right] \end{aligned}$$
(12)

the above does no longer hold, and the number of iterations needed to reach convergence depends on the data and is therefore unpredictable.Footnote 3 As a consequence, in this case the SCP should be inferior to the traditional parametrisation in terms of both CPU time and number of iterations employed by the optimiser.

Suppose now to analyse the same setup, where the normal distribution is replaced by a multivariate Student’s t with \(\nu\) degrees of freedom. The log-likelihood to optimise is

$$\begin{aligned} l(R, \nu )&= T \left[ \gamma \left( \frac{\nu + n}{2}\right) - \gamma \left( \frac{\nu }{2}\right) -\frac{n \pi (\nu -2)}{2} - \frac{1}{2}\log {\vert R\vert }\right] +\\&\quad - \frac{\nu + n}{2} \sum _{t=1}^T \left[ 1 + \frac{1}{\nu + 2} v_t'R^{-1}v_t \right] \end{aligned}$$

where \(\gamma (\cdot )\) is the log of Euler’s Gamma function. In this case, anticipating the behaviour of gradient-based methods is impossible and the SCP may benefit from the more regular parameter space induced by \(\omega\) or \(\theta\).

In order to shed light on the relative merit of the SCP compared to the traditional parametrisation, we simulate the \(v_t\) process for both distributions in several different scenarios: we set \(T=500\) observations and three different sizes for the vector \(v_t\), that is \(n = \{5, 10, 20 \}\). For the simulated t distribution, we use \(\nu = 5\) so as to make the density markedly different from a Gaussian one. As for the correlation matrix, we use

$$\begin{aligned} \rho _{ij} = \varphi ^{\vert i - j \vert } \end{aligned}$$

with \(\varphi = \{ 0, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99 \}\) to investigate cases closer and closer to singularity. For each design we simulate 100 Monte Carlo iterations, keeping track of the elapsed CPU time and the number of maximiser iterations used for convergence, in this case BFGS with numerical derivativesFootnote 4; in all cases, the starting point was \(R= I\).Footnote 5

Table 1 Average CPU time and average BFGS iterations in ML estimation: Normal distribution case

Table 1 shows the average CPU time and number of BFGS iterations for the normal log-likelihood maximisation: as predicted, the traditional parametrisation achieves the best results. The BFGS iterations needed to converge amount to 2 in almost all cases, save the most complex ones, where the size of the parameter space and the near-singularity of \(R\) probably contaminated the precision of numerical derivatives.Footnote 6 As for CPU time, both the \(\omega\)- and \(\theta\)-variants exhibit worse performance, although the actual penalty in term of actual CPU time is much less severe than the one in number of iterations.

Table 2 Average CPU time and average BFGS iterations in ML estimation: Student’s t distribution case

The results for the t distribution (Table 2) show a much different picture: the SCP dominates uniformly the traditional parametrisation both in terms of BFGS iterations and of actual CPU time. The relative advantage of the SCP seems to increase with the \(\rho\) parameter, thus suggesting that the main factor is the different shape of the parameter space (an ellipsoid for the traditional parametrisation and a hypercube for the SCP—see Sect. 2.1); the size of the problem seems also to affect results, although to a lesser degree.

3.2 Example II: The DCC Model

In financial econometrics the analysis of conditional covariance matrices is the key ingredient for studying asset volatility. In particular, the DCC specification as originally proposed by Engle (2002) aims primarily to model time-varying conditional correlations starting from standardised returns: let \(\textbf{x}_t = [x_{1,t}, x_{2,t},\ldots , x_{n,t}]\) be the vector of asset returns, with \(t=0, 1,\ldots , T\) and let define the excess of return as \(\textbf{u}_t = \textbf{x}_t - \textrm{E}(\textbf{x}_t \mid \mathcal {I}_{t-1})\) where \(\mathcal {I}_{t-1}\) denotes the information set at time \(t-1\). The covariance matrix of \(\textbf{u}_t\) is

$$\begin{aligned} \textrm{E}(\textbf{u}_{t-1}\textbf{u}_{t-1}' \mid \mathcal {I}_{t-1}) = V_t^{1/2}R_tV_t^{1/2} \end{aligned}$$
(13)

where \(R_t\) is the \(n \times n\) dynamic correlation matrix and \(V_t = \textrm{diag}(h_{1,t}, h_{2,t},\ldots , h_{n,t} )\) is the diagonal matrix of conditional variances.

Usually, the \(h_{i,t}\) are modelled via univariate techniques. The dynamic correlation is further rewritten as \(R_t = \tilde{Q}_{t}^{-1/2}Q_t\tilde{Q}_{t}^{-1/2}\) where:

$$\begin{aligned} Q_t&= \Gamma + A \odot (\eta _{t-1}\eta _{t-1}' - \Gamma ) + B \odot (Q_{t-1} - \Gamma ) \end{aligned}$$
(14)
$$\begin{aligned} \tilde{Q}_t&= \textrm{diag}(Q_{11, t}, Q_{22, t}, \ldots , Q_{nn, t}) \end{aligned}$$
(15)

where \(\Gamma , A, B\) are \(n \times n\) parameter matrices, \(\eta _t\) is the standardised return vector with elements \(\eta _{it} = u_{i,t}/\sqrt{h_{i,t}}\) and \(\odot\) denotes the Hadamard (element-by-element) product.Footnote 7 The \(\Gamma\) parameter, in particular, corresponds to the unconditional correlation matrix \(\textrm{E}(\eta _t\eta _t')\) which is here reparametrised in terms of spherical coordinates.

In this section, we provide some evidence on the computational advantages of the SCP in a DCC-GARCH model when highly correlated series are used. We will compare the performance of both the \(\omega\)- and the \(\theta\)-variants [Eqs. (4) and (8), respectively] with respect to the traditional case.

The Data Generating Process (DGP) follows Eq. (15), where we opt for the so-called “scalar” specification of the parameter matrices A and B,

$$\begin{aligned} Q_t = \Gamma + a (\eta _{t-1}\eta _{t-1}' - \Gamma ) + b (Q_{t-1} - \Gamma ) \end{aligned}$$
(16)

where \(a = 0.2\) and \(b = 0.7\). We do so for three reasons: first, the fact that \(a + b = 0.9\) implies that correlations are highly persistent, but the DGP is quite far from values that would imply nonstationarity; moreover, since we focus on the performance of the SCP for modelling \(\Gamma\), we want the DGP to contain as few parameters as possible, apart from \(\Gamma\) itself. Finally, the scalar specification is widely used in empirical practice.

The correlation matrix \(\Gamma\) we use is

$$\begin{aligned} \Gamma _{ij} = \left\{ \begin{array}{ll} 0.9 &{} \quad i\ne j\\ 1 &{}\quad i = j \end{array} \right. \end{aligned}$$

We simulated \(n = 10\) asset returns over a time horizon of \(t = 1, \ldots , 1024\). The standardised returns \(\eta _t\) are here defined as

$$\begin{aligned} \eta _t = C_t\eta _t^* \end{aligned}$$
(17)

where \(C_t\) is the Cholesky factor of \(R_t\) and \(\eta _t^* = (\eta _{1,t}^*,\ldots , \eta _{10,t}^*)\) with \(\eta _{it}^* \sim t(6)\), where t(6) denotes a Student t distribution with 6 degrees of freedom. We choose this distribution so as to have a degree of “fat tails” in our simulated data similar to that typically found in daily financial time series.

The conditional variances are modelled as univariate GARCH(1, 1) as follows,

$$\begin{aligned} h_{it} = \omega + \alpha u_{i, t-1}^2 + \beta h_{i, t-1} \end{aligned}$$
(18)

where we set \(\omega = 0.02\), \(\alpha = 0.03\), \(\beta = 0.95\). The return \(u_{it}\) is defined via \(\eta _{it}^*\) as \(u_{it} = \sqrt{h_{it}}\eta _{it}^*\). Again, these parameters are meant to resemble the persistence feature typically observed in real data.

We generated 100 datasets from the DGP aboveFootnote 8 and then estimate a DCC model each time, using the three different parametrisations. Given that the likelihood for this model may often have multiple local maxima, we first checked that each technique yielded the same maximum. We then compared CPU time for the DCC model with the traditional parametrisation and for the \(\omega\)- and \(\theta\)-variants of the SCP. ML estimation is performed via the BFGS algorithm, using numerical and analytical derivatives, as per Caporin et al. (2020). All three methods reach exactly the same log-likelihood maximum in both the numerical and analytical case. As for CPU times, Table 3 collects the main summary statistics,Footnote 9

Table 3 Main summary statistics for the CPU time for the DCC model
Fig. 1
figure 1

Boxplots for the time performances of the three methodologies. Numerical derivatives

Table 3 shows that the SCP leads to an evident speed-up with respect to the traditional case in the numerical derivative example: the average gain is about 30 s, which corresponds to the 6–\(7\%\) relative change for the \(\theta\)- and \(\omega\)-variants with respect to the standard parametrisation, denoted respectively as \(\Delta\)(\(\theta\)) and \(\Delta\)(\(\omega\)). Moreover, standard deviations are much smaller for the SCP. This may not seem a great improvement in itself, but in fact the advantages of the SCP emerge more clearly by considering the distributions of CPU times for the different scenarios in Figs. 1 and 2.

Fig. 2
figure 2

Boxplots for the time performances of the three methodologies. Analytical derivatives

Both figures show the boxplots for elapsed CPU time for the three different alternatives (Fig. 1 with numerical and Fig. 2 with analytical derivatives): the distributions for both SCP versions are much less widespread, with thinner right tails, especially in the \(\theta\) case. In other words, in the DCC case SCP is not only faster in most cases, but also offers more predictable computing time. As for the relative performance of the two SCP variants, no clear winner emerges: the \(\omega\)-version uses on average slightly less CPU time, while the results for the \(\theta\)-variants appear to be more concentrated and predictable. All these results are qualitatively identical between the numerical vs analytical score scenarios.

A more formal way to asses the computational superiority of the SCP can be carried out via simple t-tests. We compare the log CPU time for the various methods, so that results can be read as relative speed differences. Results are reported in Table 4: both the \(\theta\)-variant and the \(\omega\)-variant show significant differences with respect to the traditional parametrisations. Interestingly, on the third and sixth rows we also report the difference between the two SCPs: the \(\omega\)-variant appears to be superior with a statistically significant difference.

Table 4 Mean difference tests for CPU time

We also experimented with other maximisation algorithms in common use in computational econometrics, the Newton–Raphson method and the limited-memory variant of BFGS, L-BFGS (Morales & Nocedal, 2011), by running a small Monte Carlo simulation with 10 iterations: the Newton method converged in all cases but was almost 6 times more costly in terms of CPU time with respect to the standard BFGS method. The differences between parametrisations, instead, were rather similar to the benchmark case: the \(\omega\)-variant was \(11\%\) faster than the traditional one in the numerical derivative case, while the \(\theta\)-variant was \(8\%\) faster. The L-BFGS, in this scenario, exhibited convergence issues. We report the result for the Newton case in Table 5.

Table 5 Average CPU time (in seconds) for the three parametrisations in the DCC example: Newton–Raphson case

3.3 Example III: The Multivariate Probit Model

The multivariate probit model (Ashford & Sowden, 1970) arises as a natural extension of the probit model in the case of dependent binary outcomes. More formally, let \(\textbf{y}_i = [y_{i,1}, \ldots , y_{i,l}]\) denote a set of l binary variables for the i-th individual, with \(i=1, \ldots N\).

The multivariate probit model can be written as a generalisation of the ordinary probit model as

$$\begin{aligned} y^*_{i,j}= \,{} \textbf{z}_{i}'\beta _j + \varepsilon _{i,j} \end{aligned}$$
(19)
$$\begin{aligned} y_{i,j}= \, {} \mathbb {1}\left[ y^*_{i,j} > 0\right] \end{aligned}$$
(20)
$$\begin{aligned} \varepsilon\sim & {} N(0, R) \end{aligned}$$
(21)

where \(\mathbb {1}(\cdot )\) is the indicator function, \(\textbf{z}_{i}\) is a vector of k covariates (that we assume common to all binary responses for the sake of simplicity and without loss of generality) and \(\varepsilon _{i,j}\) is the disturbance term. Similarly to the univariate probit model, the scale of \(\varepsilon _{i,j}\) is unidentified, so to achieve identification the covariance matrix of \(\varepsilon\) must be normalised to a correlation matrix R, as illustrated by Chib & Greenberg (1998). Estimation of the unknown \(\beta\) and \(R\) parameters is performed via ML: the computation of the multivariate normal probability is generally handled via simulation methods such as the GHK algorithm (Geweke, 1989; Hajivassiliou & McFadden, 1998; Keane, 1994). The computational burden, however, can be far from being negligible especially when the correlation matrix is near singular. For this reason, we propose to use spherical coordinates in the estimation of \(R\).

In order to assess the numerical properties of the various parametrisations, we set up a simulation experiment as follows: we consider a medium-sized multivariate probit model, with \(l=5\) binary outcomes over \(N = 1024\) observations. The scenarios we consider for the correlation matrix \(R\) are of the form

$$\begin{aligned} \rho _{ij} = \left\{ \begin{array}{ll} \rho &{}\quad i\ne j\\ 1 &{}\quad i = j \end{array} \right. \end{aligned}$$

where we set \(\rho\) over the following grid: \(\rho = \left\{ 0.5, 0.75, 0.9, 0.95 \right\}\) so as to consider progressively more ill-conditioned cases.

For the covariates, we chose a simulation design whose purpose is twofold: on the one hand, we wanted to mimic a real-life, albeit simple, problem; on the other hand, we want our covariate to be helpful in giving the log-likelihood enough curvature to overcome possible identification problems. For these reasons, we define \(\textbf{z}_{i}\) as a four-variate vector, whose first element corresponds to an intercept and the other ones to random entries from a standard Gaussian distribution. The coefficients \(\beta _j\) are set as random draws from a Uniform \(U(-1, 1)\) for all variables except the constant terms,Footnote 10 which takes values over the grid \(\beta _{1,j} = \{-0.5; -0.25; 0; 0.25; 0.5 \}\) with \(j = 1, \ldots , 5\). Finally, for each correlation matrix \(R\) we simulate the data generating process 100 times, allowing the random uniform coefficient \(\beta\) to vary between iterations, so that we avoid cases where the result is dependent from a particular \(\beta\) realisation.

Table 6 BFGS convergence failures

As for the estimation part, ML is performed via simulated Maximum Likelihood using the GHK algorithm. Again, to explore the sensitivity of the results to different choices, we use four different specifications for the pseudo random draws used in the GHK simulation, i.e., two uniform sequences (denoted by U) and two Halton sequences (denoted by H) of length, respectively, 100 and 500. The maximisation algorithm is the stock version of BFGS as provided by the gretl package. Initial values are calculated by running individual probit models for each dependent variables and using the estimated \(\beta\) vectors. The sample correlation matrix for the generalised residuals (see Gourieroux et al., 1987) is used as a starting point for \(R\).

The first main result is reported in Table 6: for the traditional parametrisation and the \(\omega\)-variant the numerical optimiser often failed to converge when \(\rho\) is large and the U/H sequences for the GHK are short. Conversely, the \(\theta\)-variant seems more robust to the GHK initialisation conditions and to the possible ill-conditioning of \(R\) since it converges in most cases, apart from a few ones when \(\rho =0.95\): we conjecture that this may be explained by the unboundness of the \(\theta\) space, which is fully exploited from the optimisation procedure the more the true parameter \(\rho\) is near its extremum.

Table 7 BFGS iterations summary statistics with \(\{U,H\} = 100\)
Table 8 BFGS iterations summary statistics with \(\{U,H\} = 500\)
Table 9 CPU time summary statistics with \(\{U,H\} = 100\)
Table 10 CPU time summary statistics with \(\{U,H\} = 500\)

Tables 7 and 8 show summary statistics on the number of BFGS iterations needed for convergence, conditional on convergence having taken place at the same maximumFootnote 11: the \(\omega\)- and especially the \(\theta\)-variant provide marginally better results (the means are comparable), but much more regular and more robust to outliers than the traditional parametrisation, as shown by the standard deviation and the order statistics. This is very much in line with the experiments performed in the two previous subsections. Again, the advantage that the SCP offers appears to increase with the degree of correlation.

Considering the CPU timings, reported in Tables 9 and 10, we find very similar results: on average, the \(\omega\)- and the \(\theta\)-variants yield a slight advantage, but offer much more predictable performance, with thinner-tailed distributions. This happens uniformly across the four different GHK setups we chose for the uniform sequences.

Similarly to the DCC experiment, we considered the other optimisation algorithms (Newton–Raphson and L-BFGS) in a small Monte Carlo experiment with 10 replications. The Newton method showed convergence rates similar to the BFGS ones; on the contrary the L-BFGS encountered several failures for the traditional parametrisation when high correlations were involved. In particular, the traditional parametrisation always failed to converge with \(\rho =0.90\) or \(\rho =0.95\). The CPU time performance deteriorated dramatically for the Newton method, with the spherical parametrisations (especially the \(\theta\)-variant) faster than the traditional. The L-BFGS method was also slower than the standard BFGS, especially for high values of \(\rho\): when convergence was achieved for the traditional parametrisation, it reported better CPU performances than the SCP variants. In this regard, however, we remark upon the fact that SCP granted much better convergence results. For the sake of brevity, we report in Table 11 the average CPU time in seconds for the \(\rho = \{0.50, 0.75, 0.90\}\) cases (owing to their overall null or small convergence failure rate), with \(U = 100\).

Table 11 Average CPU time (in seconds) for the parametrisations in the multivariate probit example: Newton method and L-BFGS

To summarise the whole experiment, the SCP in the form of the \(\theta\)-variant achieves the best results for high values of \(\rho\), especially in terms of ensuring that convergence takes place at all: in those cases where comparison is possible, it guarantees the best numerical performance.

4 Conclusion and Directions for Future Research

Efficient parametrisations in optimisation problems can be very effective. In the context of statistical and econometric models the estimation of correlation matrices poses several issues given the nature of the correlation matrix itself.

In this paper, we analysed the usage of the spherical coordinates representation: following the previous contributions by Pinheiro & Bates (1996) and Rapisarda et al. (2007) we have applied the SCP to some widely used econometric models.

We provide three examples, ranging across different use cases: apart from a few exceptions, the SCP enhances numerical optimisation of log-likelihood functions, both in terms of stability and numerical performance, especially so when the correlation matrix is badly conditioned. Of the two alternative parametrisations proposed here, both SCP methods improve on the traditional approach but the relative gain depends on the context.

Clearly, the scope of the SCP could be extended to models not considered here. For example, one could consider estimating the DCC model with fat-tailed innovations. It would also be interesting to extend the results presented here for the multivariate probit model to the multinomial probit model, especially in the light of its potential for computational complexity (see e.g., Fares et al., 2018). That case, however, would be considerably more complex as the natural parametrisation for the multinomial probit involves a matrix in which diagonal elements may be different from 1 (see e.g., Train, 2009, Section 5.2).

In conclusion, the SCP appears to be a powerful tool for dealing with estimation procedures involving correlation matrices, especially in ill-conditioned cases. Numerical performance is better and, estimation is still possible in limiting cases that would be impossible to handle otherwise.

5 Computational Details

All experiments have been run in the econometric software gretl. In particular, we have used the 2022c release, installed on a Linux machine.