ΑΔΣ Advances in Decision Sciences

Modeling financial returns is challenging because the correlations and variance of returns are time-varying and the covariance matrices can be quite high-dimensional. In this paper, we develop a Bayesian shrinkage approach with modified Cholesky decomposition to model correlations between financial returns. We reparameterize the correlation parameters to meet their positive definite constraint for Bayesian analysis. To implement an efficient sampling scheme in posterior inference, hierarchical representation of Bayesian lasso is used to shrink unknown coefficients in linear regressions. Simulation results show good sampling properties that iterates from Markov chain Monte Carlo converge quickly. Using a real data example, we illustrate the application of the proposed Bayesian shrinkage method in modeling stock returns in Hong Kong.


Introduction
Financial return modeling is an important topic in financial econometric research.
The covariance matrices of financial returns are known to vary over time. When modeling financial time series, research including Jebabli et al. (2014), Billio et al. (2016), Dyhrberg (2016, Laurent et al. (2016) and Hendrych and Cipra (2016) has proposed models assuming changing covariance matrices. To model the changing covariance matrix of financial returns in markets, on top of accounting for the time-varying feature of variances and correlations, we propose a Bayesian shrinkage method to model the covariances of financial returns while imposing sparsity in dynamic covariances. An efficient shrinkage method is helpful because the number of returns in the data set can be quite large, resulting in a large number of parameters to be estimated.
To start with, the model for time-varying variances has to be determined. Engle (1982) proposed an autoregressive conditional heteroscedastic (ARCH) model and described the maximum likelihood estimators. Bollerslev (1986) generalized the ARCH and proposed the GARCH( , ) model. Assume that is an information up to time . Franses and Ghijsels (1999) proposed jumps at = and stated that certain return observations are additive outliers and cannot fit GARCH models. For extensions and applications of the GARCH model, we refer to Billio et al. (2016), Dyhrberg (2016), Laurent et al. (2016), Corrêa et al. (2016), Narayan et al. (2016), Smetanina (2017) and Valera et al. (2018).
The GARCH model is commonly used for modeling time series and has sparked discussions about possible extensions and applications owing to the interpretative advantage of allowing the current variance to be expressed as the weighted average of squared returns and variances in the past. In our model, the GARCH(1,1) structure is applied so that the current variance can be expressed as a weighted average of the squared returns and variances in the previous period. To enable changing correlations between the returns, we allow the standardized residuals for different stocks from the GARCH(1,1) model to be correlated at any time .
Then, in markets, an important research question is how we model the correlations between returns by using the standardized residuals in the GARCH(1,1) model. To impose a positive definite condition in correlation matrices, we first link the correlation matrix to the covariance matrix in an equation. By assuming that is a random vector and is a diagonal matrix in which the diagonal elements are the square root of the diagonal elements of ( ), we have: With this link between the correlation and covariance matrices, we may consider various kinds of covariance modeling. A popular covariance modeling approach is to apply reparametrization by modified Cholesky decomposition. Pourahmadi (1999) proposed that as covariance matrices are positive definite and symmetric, there is a unique lower triangular matrix with diagonal entries equal to 1 such that: Suppose that > and the ( , ) th off-diagonal entries of is − , .
Reparametrization allows us to write regression equation The reparametrization is equivalent to regressing on −1 , . . . , 1 , provides statistical interpretation and is unconstrained. Pourahmadi (2000) studied the asymptotic distribution of the maximum likelihood estimators of the parameters. For the applications and extensions of modified Cholesky decomposition, we refer to Matsuo et al. (2011), Xu and Mackenzie (2012), Zhang and Leng (2012) and Hendrych and Cipra (2016). Modified Cholesky decomposition is widely discussed and often applied because it provides a linear regression interpretation and avoids the positive definite constraint of covariance matrices.
Owing to these advantages, we apply modified Cholesky decomposition. In the later sections, when implementing our model, no constraints except those in the updating equations are imposed. To achieve optimal convergence rates in the sampling of the parameters, least absolute shrinkage and selection operator (lasso), which shrinks the parameters towards their means, is applied to our model because it improves estimation accuracy and allows for easier interpretation of results.
The remainder of this paper is organized as follows. Section 2 describes the shrinkage method in our model. Section 3 introduces the dependence between stocks, and the incorporation of modified Cholesky decomposition. Section 4 introduces the sampling scheme. Section 5 presents and interprets the simulation results. Section 6 discusses the real sampling results.
They suggested that by integrating 1 2 , . . . , 2 , we can obtain the double exponential priors and the hierarchical representation allows a simple and practical implementation of Gibbs samplers. For the lasso parameter , they considered using empirical Bayes by marginal likelihood or imposing gamma priors. For extensions and applications of lasso, we refer to Chen et al. (2011), Guo et al. (2012a), Schmidt and Makalic (2013), Benoit et al. (2013) and Zhang and Biswas (2015).
Group lasso and adaptive lasso are extensions of lasso. Yuan and Lin (2006) Zou (2006) proved that lasso estimates are not consistent and proposed an adaptive lasso. For extensions of adaptive lasso, we refer to Mutshinda and Sillanpää (2010) and Mutshinda and Sillanpää (2012). Zou and Hastie (2005) proposed the elastic net that generalizes lasso. The elastic net with standardized predictors can be expressed as: They explained that when 2 is equal to zero, the elastic net is equivalent to lasso. For the extensions of the elastic net, we refer to Zou and Zhang (2009), Kyung et al. (2010), Gefang (2014) and Zhang et al. (2017).
Other shrinkage methods include smoothly clipped absolute deviation (SCAD) and group SCAD. Fan and Li (2001) proposed SCAD, a quadratic spline function, and proved its oracle property. Kim et al. (2008) developed an optimization algorithm and proved the asymptotic properties of the estimator in high dimensions. Wang et al. (2016) described the asymptotic properties of SCAD estimators in the quadratic approximation of objective function. For group SCAD and its oracle properties, we refer to Wang et al. (2007), Hu et al. (2014) and Guo et al. (2015). Lasso and SCADrelated shrinkage methods are considered for our sampling scheme. Firstly, lasso-related shrinkage methods, such as adaptive lasso, group lasso, and elastic net, have been widely discussed because lasso shrinks the parameters towards their means, which improves estimation accuracy and allows for easier interpretation of results. For these advantages, lasso is applied to our model in a Bayesian framework. As the number of time points can be large, group lasso is used to group correlation coefficients at each time point. Secondly, SCAD-related shrinkage methods are widely discussed because of their oracle properties. However, SCAD-related shrinkage methods have not been extended for application in a Bayesian framework and therefore the approach cannot be used in our model.

Model
The GARCH model has been widely used for modeling financial returns to incorporate timevarying conditional volatility. In our model, , the financial return of the -th stock at time , is modeled as: where ̃ follows standard normal distribution and ℎ is the conditional variance of given as follows: (1) The return of one stock can be dependent on those of other stocks at any time and thus one objective of this paper is to develop a model for capturing changing dependence structures between returns. Let ̃= (̃1,̃2, . . . ,̃) , where is the total number of stocks. To model the changing correlation matrix of ̃, we specify: where is a diagonal matrix with the -th diagonal element given by the standard deviation of .
From Equation (2), we have Our goal is therefore to estimate while imposing possible sparsity on . We write: where is the information up to time and so ( | −1 ), the conditional covariance matrix of is given by . Following Pourahmadi (1999), we reparameterize using a modified Cholesky decomposition: where is identity matrix and is a lower triangular matrix with diagonal entries equal to one at time . The off-diagonal entry at the -th row and -th column of is − , when < . The coefficient , is uniquely defined because there are only ( − 1)/2 off-diagonal entries in the Cholesky decomposition. Our model generalizes cases for different covariance matrices. can take various forms according to the values of , . If , for all , are equal to zero, will be an identity matrix for all and the model will be reduced to an independent multivariate GARCH model.
By applying Cholesky decomposition to the covariance matrix, we provide unconstrained reparametrization and regression interpretation. is constrained to be positive definite but is unconstrained. For the regression interpretation, the central idea is to write: where is iid N(0,1). The variance of is set to 1 for normalization. As Equation (3)   and imposed a Bayesian lasso prior. They found that their proposed model without spline performed better than the model without splines and the shrinkage effects of the Bayesian lasso lead to a correct model. Therefore, our goal is to shrink towards its prior mean and achieve sampling accuracy by applying a Bayesian lasso. Following Park and Casella (2008), we choose the following Bayesian lasso prior:

Bayesian Shrinkage Estimation
) and Exp( ) denotes the exponential distribution with mean 1 . The idea is that by integrating 1 2 , . . . , −1 2 , the prior becomes: which can perform shrinkage by adjusting . This prior is formed by independent Laplace( , , −1 ). As the prior mean of , is , , the iterates of , are expected to approach , when the prior variance 2 −2 is small or is large. To enable dynamic shrinkage of , is written as: which is the weighted average of −1, and −1, , implying that , is an exponentially moving average of 1, , . . . , −1, . The above indicates that , the prior mean of , varies with . We assume stationarity in , implying that the mean of , is 0 /(1 − 1 − 2 ).

i. Sampling
We apply Gibbs sampling to 2 because the posterior distribution follows generalized inverse Gaussian distribution. The posterior distribution of 2 is:

ii. Sampling
The posterior distribution of is: The MH algorithm is applied to sample because the posterior distribution in Equation (5) is nonstandard. The procedure for sampling is 1.
The parameters are expected to be strongly correlated because they are GARCH parameters in Equation (1). Therefore, we follow the approach used by Chen and So (2006). During the first iterations, is selected to be a diagonal matrix. And is tuned to achieve an acceptance rate between 25% and 50%. After iterations, is revised to be a sample covariance matrix of the first iterates to improve the convergence rate.
An advantage of the approach used by Chen and So (2006) is that the covariance matrix between the parameters is taken into consideration in the sampling scheme, speeding up convergence. The scalar term is selected to obtain an acceptance rate between 25% and 50%, which is considered optimal.

iii. Sampling
The posterior distribution of is: The posterior distribution of is not standard. We also observed that sampling of has a multi-modal problem in the preliminary test. Therefore, a multiple-try Metropolis algorithm with random walk kernel is applied to sample because we wish to tackle the multi-modal problem by applying an algorithm such that the iterates are settled in global regions instead of local regions. The sampling procedure of is 1. At iteration , draw ,1 ,..., , from ( , ), where is the number of trials.

Liu et al. (2000) proposed the above multiple-try Metropolis sampler and showed that while
Metropolis chains were stuck in local maxima, most multiple-try Metropolis chains settled in modal regions. So (2006) presented two examples and showed that multiple-try Metropolis sampler enables the iterates to move from one mode to another mode, speeding up convergence. Pandolfi et al. (2014) proposed a generalized multiple-try reversible jump algorithm and found that it is more efficient than reversible jump algorithm. Therefore, our hope is that the multiple-try algorithm will resolve the challenge of the multi-modal problem by settling the iterates in global regions.
To account for the correlations between 0 , 1 and 2 , we follow the approach used by Chen and So (2006). During the first iterations, is a diagonal matrix and is tuned so that we have an acceptance rate between 25% and 50%. After iterations, a sample covariance matrix is selected to be .

iv. Sampling
A major challenge of sampling is that the number of time points and dimension of the returns can be very large, and so the number of parameters that need to be estimated is potentially huge. With low sampling efficiency of , we can get a low overall convergence rate because the number of parameters is potentially huge and are key regression parameters for modeling correlations between stocks. Because of the nonstandard posterior distribution of , we opt to use the MH algorithm to sample , an algorithm that may increase the program run time substantially when a large number of parameters are sampled. The posterior distribution of is: The posterior distribution of is reduced to a Gaussian distribution if √ ( ) outside the exponential functions in Equation (6) are equal to 1. The reduced form is used as the independence kernel because it has a similar form to the posterior distribution and follows a normal distribution, which will increase the efficiency of the sampling of . Assume that is a × 1 vector containing only 1s. The independence kernel can be written as: For detailed derivation of the posterior distribution, please refer to Appendix 8. The sampling procedure is a standard MH algorithm which can be written as 1.

2.
Accept * with probability min{1, Good statistical properties and faster convergence can be acquired by using an independence kernel. Mengersen and Tweedie (1996) suggested that the independence kernel has geometric and uniform ergodicity properties. Robert and Casella (2013)  The summations can be calculated as follows: The initial value of , is −1, which is 2 , . The initial value of , is −2, which is 1 2 2 −1, . At each iteration, we computed the summations by the recursive relationships right before sampling for = 1, . . . , and = 2, . . . , − 1. Without using the recursive relationships, it takes ( 2 ) time to finish one iteration. It requires ( ) summations for each and . As there are time points and dimensions, the amount of time needed is ( 2 ). But by applying the recursive relationships in our calculations, the amount of time needed for one iteration is reduced to ( ).
For each , our implementation needs ( ) summations because of the recursive relationship. As the dimension is , the amount of time needed for our implementation is ( ) . Hence, our implementation allows the sampling of larger amounts of time series data within a limited time and reduces the number of computations required.

Simulation Experiment
In order to investigate the efficiency of the Bayesian method in Section 4, we implement the . The dimension in the simulation is five and are set to be 1. The length of the time series is 2,000. The goal of this simulation is to examine whether the sampling scheme is effective and efficient.
Similar to the second setting, the 0 , 1 and 1 are set to be 0.05, 0.1 and 0.85 respectively.
The parameters 1 and 2 are chosen to be different from the second setting. The third and fourth parameter settings are presented in Table 2.
We generated 100 replications for the four parameter settings to study the performance of the Bayesian estimation method. Table 1 and Table 2 show the mean, median and standard deviation of the posterior mean estimators based on the replications that were successfully run. In general, the means of the posterior mean estimators are within one standard deviation of the respective true values. Most of the standard deviations of the posterior mean estimators are reasonably small. The above experiment indicates that the Bayesian method is able to reliably estimate the unknown parameters of the model with dynamic sparsity.

Real Data
In order to study the performance of the model in real data, we collected five constituent stocks of the Hang Seng Index in Hong Kong. The real data set consists of 2,461 percentage returns of 0001 HK Equity, 0101 HK Equity, 0011 HK Equity, 1044 HK Equity and 1088 HK Equity between 31 July, 2007 and 21 July, 2017 from Yahoo Finance. The duration of the data set is ten years and there are no missing data. Table 3 shows the mean, maximum, median and minimum of percentage returns for the five stocks.
The MCMC algorithm was implemented on the real data set. The total number of iterations is 30,000 and the number of burn-in iterations is 15,000, implying that we kept the last 15,000 iterates for posterior analysis. We adopted = 1 for all to observe whether this can lead to satisfactory results. When is larger, the deviation of from its prior mean is penalized more heavily and the movement of the iterates is smaller. When is small, the effect of shrinkage is small and may not be shrunk towards the prior mean . Figure 1 and Figure 2 report the trace plots, histograms, and autocorrelation plots of the real data experiment. The trace plots show that the iterates converge very quickly to values that are not their initial values. As the iterates have only one mode in the histograms, there is no multi-modal problem.
The autocorrelations drop rapidly with the number of lags, proving that the current iterates are independent of previous iterates. Table 4 Table 4 with ℎ 1 set at the long-run variance, 0 /(1 − observed. In Figure 3(c), we observe an obvious peak in 2008 around the period of global financial instability due to the "financial tsunami". Interestingly, the estimated correlations are relatively lower in mid 2008 and started to climb after the financial tsunami.

Discussion
The proposed Bayesian shrinkage method allows the covariance matrix of multiple returns variables to change over time. The main idea is to express the prior mean of as a weighted average of −1, and −1, , the prior mean at time − 1.
The real data sampling has achieved good convergence properties. In Table 4, the standard deviations are small after disregarding the initial burn-in period, meaning that after the initial burnin period the iterates have converged. From the trace plots in Figure 1 and Figure 2, we can see that the MCMC iterates converge quite quickly. It is expected that an efficient Bayesian sampling strategy for modeling possible sparsity in via is essential for effective modeling of timevarying correlations in financial markets.    Table 4 Summary  Therefore, the proposal density function of is: