Sparse Bayesian time-varying covariance estimation in many dimensions

We address the curse of dimensionality in dynamic covariance estimation by modeling the underlying co-volatility dynamics of a time series vector through latent time-varying stochastic factors. The use of a global-local shrinkage prior for the elements of the factor loadings matrix pulls loadings on superfluous factors towards zero. To demonstrate the merits of the proposed framework, the model is applied to simulated data as well as to daily log-returns of 300 S&P 500 members. Our approach yields precise correlation estimates, strong implied minimum variance portfolio performance and superior forecasting accuracy in terms of log predictive scores when compared to typical benchmarks.


Introduction
The joint analysis of hundreds or even thousands of time series exhibiting a potentially timevarying variance-covariance structure has been on numerous research agendas for well over a decade. In the present paper we aim to strike the indispensable balance between the necessary flexibility and parameter parsimony by using a factor stochastic volatility (SV) model in combination with a global-local shrinkage prior. Our contribution is threefold. First, the proposed * Department of Finance, Accounting and Statistics, WU Vienna University of Economics and Business, Welthandelsplatz 1/D4/4, 1020 Vienna, Austria, +43 1 31336-5593, gregor.kastner@wu.ac.at approach offers a hybrid cure to the curse of dimensionality by combining parsimony (through imposing a factor structure) with sparsity (through employing computationally efficient absolutely continuous shrinkage priors on the factor loadings). Second, the efficient construction of posterior simulators allows for conducting Bayesian inference and prediction in very high dimensions via carefully crafted Markov chain Monte Carlo (MCMC) methods made available to end-users through the R (R Core Team 2017) package factorstochvol (Kastner 2017).
Third, we show that the proposed method is capable of accurately predicting covariance and precision matrices which we asses via statistical and economic forecast evaluation in several simulation studies and an extensive real-world example.
Concerning factor SV modeling, early key references include Harvey et al. (1994), Pitt and Shephard (1999), and Aguilar and West (2000) which were later picked up and extended by e.g. Philipov and Glickman (2006), Chib et al. (2006), Han (2006), Lopes and Carvalho (2007), Nakajima and West (2013), Zhou et al. (2014), and Ishihara and Omori (2017). While reducing the dimensionality of the problem at hand, models with many factors are still rather rich in parameters. Thus, we further shrink unimportant elements of the factor loadings matrix to zero in an automatic way within a Bayesian framework. This approach is inspired by highdimensional regression problems where the number of parameters frequently exceeds the size of the data. In particular, we adopt the approach brought forward by Caron and Doucet (2008) and Griffin and Brown (2010) who suggest to use a special continuous prior structure -the Normal-Gamma prior -on the regression parameters (in our case the factor loadings matrix).
This shrinkage prior is a generalization of the Bayesian Lasso (Park and Casella 2008) and has recently received attention in the econometrics literature (Bitto and Frühwirth-Schnatter 2016;. Another major issue for such high-dimensional problems is the computational burden that goes along with statistical inference, in particular when joint modeling is attempted instead of multi-step approaches or rolling-window-like estimates. Suggested solutions include Engle and Kelly (2012) who propose an estimator assuming that pairwise correlations are equal at every point in time, Pakel et al. (2014) who consider composite likelihood estimation, Gruber and West (2016) who use a decoupling-recoupling strategy to parallelize estimation (executed on graphical processors), Lopes et al. (2016) who treat the Cholesky-decomposed covariance matrix within the framework of Bayesian time-varying parameter models, and Oh and Patton (2017) who choose a copula-based approach to link separately estimated univariate models. We propose to use a Gibbs-type sampler which allows to jointly take into account both parameter as well as sampling uncertainty in a finite-sample setup through fully Bayesian inference, thereby enabling inherent uncertainty quantification. Additionally, this approach allows for fully probabilistic in-and out-of-sample density predictions.
For related work on sparse Bayesian prior distributions in high dimensions, see e.g. Kaufmann and Schumacher (2013) who use a point mass prior specification for factor loadings in dynamic factor models or Ahelegbey et al. (2016) who use a graphical representation of vector autoregressive models to select sparse graphs. From a mathematical point of view, Pati et al. (2014) investigate posterior contraction rates for a related class of continuous shrinkage priors for static factor models and show excellent performance in terms of posterior rates of convergence with respect to the minimax rate. All of these works, however, assume homoskedasticity and are thus potentially misspecified when applied to financial or economic data. For related methods that take into account heteroskedasticity, see e.g. Nakajima and West (2013) and Nakajima and West (2017) who employ a latent thresholding process to enforce time-varying sparsity. Moreover, Zhao et al. (2016) approach this issue via dependence networks, Loddo et al. (2011) use stochastic search for model selection, and Basturk et al. (2016) use time-varying combinations of dynamic models and equity momentum strategies. These methods are typically very flexible in terms of the dynamics they can capture but are applied to moderate dimensional data only.
We illustrate the merits of our approach through extensive simulation studies and an indepth financial application using 300 S&P 500 members. In simulations, we find considerable evidence that the Normal-Gamma shrinkage prior leads to substantially sparser factor loadings matrices which in turn translate into more precise correlation estimates when compared to the usual Gaussian prior on the loadings. 1 In the real-world application, we evaluate our model against a wide range of alternative specifications via log predictive scores and minimum variance portfolio returns. Factor SV models with sufficiently many factors turn out to imply extremely competitive portfolios in relation to well-established methods which typically have been specifically tailored for such applications. Concerning density forecasts, we find that our approach outperforms all included competitors by a large margin.
The remainder of this paper is structured as follows. In Section 2, the factor SV model is specified and the choice of prior distributions is discussed. Section 3 treats statistical inference via MCMC methods and sheds light on computational aspects concerning out-of-sample density predictions for this model class. Extensive simulation studies are presented in Section 4, where the effect of the Normal-Gamma prior on correlation estimates is investigated in detail. In Section 5, the model is applied to 300 S&P 500 members. Section 6 wraps up and points out possible directions for future research.

Factor SV Model
To reduce dimensionality, factor SV models utilize a decomposition of the m × m covariance matrix Σ t with m(m + 1)/2 free elements into a factor loadings matrix Λ of size m × r, an r-dimensional diagonal matrix V t and an m-dimensional diagonal matrix U t in the following fashion: This reduces the number of free elements to mr + m + r. Because r is typically chosen to be much smaller than m, this specification constrains the parameter space substantially, thereby inducing parameter parsimony. For the paper at hand, Λ is considered to be time invariant whereas the elements of both V t and U t are allowed to evolve over time through parametric stochastic volatility models, i.e. U t = diag(exp(h 1t ), . . . , exp(h mt )) and h m+j,t ∼ N φ m+j h m+j,t−1 , σ 2 m+j , j = 1, . . . , r.
More specifically, U t describes the idiosyncratic (series-specific) variances while V t contains the variances of underlying orthogonal factors f t ∼ N r (0, V t ) that govern the contemporaneous dependence. The autoregressive process in (3) is assumed to have mean zero to identify the unconditional scaling of the factors.
This setup is commonly written in the following hierarchical form (e.g. Chib et al. 2006): where the distributions are assumed to be conditionally independent for all points in time. To make further exposition clearer, let y = (y 1 · · · y T ) denote the m×T matrix of all observations, is referred to as the vector of parameters where µ i is the level, φ i the persistence, and σ 2 i the innovation variance of h i . To denote specific rows and columns of matrices, we use the "dot" notation, i.e. X i· refers to the ith row and X ·j to the jth column of X. The proportions of variances explained through the common factors for each component series, C it = 1 − U ii,t /Σ ii,t for i = 1, . . . , m, are referred to as the communalities. Here, U ii,t and Σ ii,t denote the ith diagonal element of U t and Σ t , respectively. As by construction 0 ≤ U ii,t ≤ Σ ii,t , the communality for each component series and for all points in time lies between zero and one. The joint (overall) communality C t = m −1 m i=1 C it is simply defined as the arithmetic mean over all series.
Three comments are in order. First, the variance-covariance decomposition in (1) can be rewritten as Σ t = Λ t Λ t + U t with Λ t := ΛV 1/2 t . An essential assumption within the factor framework is that both V t as well as U t are diagonal matrices. This implies that the factor loadings Λ t are dynamic but can only vary column-wise over time. Consequently, the time-variability of Σ t 's off-diagonal elements are cross-sectionally restricted while its diagonal elements are allowed to move independently across series. Hence, the "strength" of a factor, i.e. its cross-sectional explanatory power, varies jointly for all series loading on it. Consequently, it is likely that more factors are needed to properly explain the co-volatility dynamics of a multivariate time series than in models which allow for completely unrestricted timevarying factor loadings (Lopes and Carvalho 2007), correlated factors (V t not diagonal, see Zhou et al. 2014), or approximate factor models (U t not diagonal, see Bai and Ng 2002). Our specification, however, is less prone to overfitting and has the significant advantage of vastly simplified computations.
Second, identifying loadings for latent factor models is a long-standing issue that goes back to at least Anderson and Rubin (1956) who discuss identification of factor loadings. Even though this problem is alleviated somewhat when factors are allowed to exhibit conditional heteroskedasticity (Sentana and Fiorentini 2001;Rigobon 2003), most authors have chosen an upper triangular constraint of the loadings matrix with unit diagonal elements, thereby introducing dependence on the ordering of the data (see Frühwirth-Schnatter and Lopes 2017).
However, when estimation of the actual factor loadings is not the primary concern (but rather a means to estimate and predict the covariance structure), this issue is less striking because a unique identification of the loadings matrix is not necessary. 2 This allows leaving the factor loadings matrix completely unrestricted, thus rendering the method invariant with respect to the ordering of the series.
Third, note that even though the joint distribution of the data is conditionally Gaussian, its stationary distribution has thicker tails. Nevertheless, generalizations of the univariate SV model to cater for even more leptokurtic distributions (e.g. Liesenfeld and Jung 2000) or asymmetry (e.g. Yu 2005) can straightforwardly be incorporated in the current framework.
All of these extensions, however, tend to increase both sampling inefficiency as well as running time considerably and could thus preclude inference in very high dimensions.

Prior Distributions
The usual prior for each (unrestricted) element of the factor loadings matrix is a zero-mean Gaussian distribution, i.e. Λ ij ∼ N 0, τ 2 ij independently for each i and j, where τ 2 ij ≡ τ 2 is a constant specified a priori (e.g. Pitt and Shephard 1999;Aguilar and West 2000;Chib et al. 2006;Ishihara and Omori 2017;. To achieve more shrinkage, we model this variance hierarchically by placing a hyperprior on τ 2 ij . This approach is related to Bhattacharya and Dunson (2011) and Pati et al. (2014) who investigate a similar class of priors for homoskedastic factor models. More specifically, let Intuitively, each prior variance τ 2 ij provides element-wise shrinkage governed independently for each row by λ 2 i . Integrating out τ 2 ij yields a density for Λ ij |λ 2 i of the form p(Λ ij |λ 2 where K is the modified Bessel function of the second kind.
This implies that the conditional variance of Λ ij |λ 2 i is 2/λ 2 i and the excess kurtosis of Λ ij is 3/a i . The hyperparameters a i , c i , and d i are fixed a priori, whereas a i in particular plays a crucial role for the amount of shrinkage this prior implies. Choosing a i small enforces strong 2 The conditional covariance matrix Σ t = ΛV t Λ + U t involves a rotation-invariant transformation of Λ.
shrinkage towards zero, while choosing a i large imposes little shrinkage. For more elaborate discussions on Bayesian shrinkage in general and the effect of a i specifically, see Griffin and Brown (2010) and Polson and Scott (2011). Note that the Bayesian Lasso prior (Park and Casella 2008) arises as a special case when a i = 1.
One can see prior (4) as row-wise shrinkage with element-wise adaption in the sense that all variances in row i can be thought of as "random effects" from the same underlying distribution.
In other words, each series has high and a priori independent mass not to load on any factors and thus can be thought of as series-specific shrinkage. For further aspects on introducing hierarchical prior structure via the Normal-Gamma distribution, see Griffin and Brown (2017) and . Analogously, it turns out to be fruitful to also consider column-wise shrinkage with element-wise adaption, i.e.
This means that each factor has high and a priori independent mass not to be loaded on by any series and thus can be thought of as factor-specific shrinkage.

Statistical Inference
There are a number of methods to estimate factor SV models such as quasi-maximum likelihood (e.g. Harvey et al. 1994), simulated maximum likelihood (e.g. Liesenfeld and Richard 2006;Jungbacker and Koopman 2006), and Bayesian MCMC simulation (e.g. Pitt and Shephard 1999;Aguilar and West 2000;Chib et al. 2006;Han 2006). For high dimensional problems of this kind, Bayesian MCMC estimation proves to be a very efficient estimation method because it allows simulating from the high dimensional joint posterior by drawing from lower dimensional conditional posteriors.

MCMC Estimation
One substantial advantage of MCMC methods over other ways of learning about the posterior distribution is that it constitutes a modular approach due to the conditional nature of the sampling steps. Consequently, conditionally on the matrix of variances τ = (τ ij ) 1≤i≤m; 1≤j≤r , we can adapt the sampling steps of . For obtaining draws for τ , we follow Griffin and Brown (2010). The MCMC sampling steps for the factor SV model are: 1. For factors and idiosyncratic variances, obtain m conditionally independent draws of the idiosyncratic log-volatilities from h i |y i· , Λ i· , f , µ i , φ i , σ i and their parameters from Similarly, perform r updates for the factor logvolatilities from h m+j |f m+j,· , φ m+j , σ m+j and their parameters from φ m+j , σ m+j |f m+j,· , h m+j for j = 1, . . . , r. This amounts to m + r univariate SV updates. 3 2a. Row-wise shrinkage only: For i = 1, . . . , m, sample from wherer = min(i, r) if the loadings matrix is restricted to have zeros above the diagonal andr = r in the case of an unrestricted loadings matrix. For i = 1, . . . , m and j = 1, . . . ,r, draw from τ 2 2b. Column-wise shrinkage only: For j = 1, . . . , r, sample from wherej = j if the loadings matrix is restricted to have zeros above the diagonal and j = 1 otherwise. For j = 1, . . . , r and i =j, There is a vast body of literature on efficiently sampling univariate SV models. For the paper at hand, we use R package stochvol (Kastner 2016). 4 The Generalized Inverse Gaussian distribution GIG(m, k, l) has a density proportional to x m−1 exp − 1 2 (kx + l/x) . To draw from this distribution, we use the algorithm described in Hörmann and Leydold (2013) which is implemented in the R package GIGrvg (Leydold and Hörmann 2017). r = 0 r = 1 r = 5 r = 10 r = 20 r = 50 plain R, m the ith normalized observation vector and is the T ×r design matrix. This constitutes a standard Bayesian regression update.
3*. When inference on the factor loadings matrix is sought, optionally redraw Λ using deep interweaving ) to speed up mixing. This step is of less importance if one is interested in the (predictive) covariance matrix only.

Draw the factors from
Hereby,ỹ t = (y 1t e −h 1t /2 , . . . , y mt e −hmt/2 ) denotes the normalized observation vector at time t and is the m × r design matrix. This constitutes a standard Bayesian regression update.
The above sampling steps are implemented in an efficient way within the R package factorstochvol (Kastner 2017). Table 1 displays the empirical run time in milliseconds per MCMC iteration. Note that using more efficient linear algebra routines such as Intel MKL leads to substantial speed gains only for models with many factors. To a certain extent, computation can further be sped up by computing the individual steps of the posterior sampler in parallel.
In practice, however, doing so is only useful in shared memory environments (e.g. through multithreading/multiprocessing) as the increased communication overhead in distributed memory environments easily outweighs the speed gains.

Prediction
Given draws of the joint posterior distribution of parameters and latent variables, it is in principle straightforward to predict future covariances and consequently also future observations.
This gives rise to the predictive density (Geweke and Amisano 2010), defined as As with most quantities of interest in Bayesian analysis, computing the predictive density can be challenging because it constitutes an extremely high-dimensional integral which cannot be solved analytically. However, it may be approximated at a given "future" point y f through Monte Carlo integration, where κ (k) [1:t] denotes the kth draw from the posterior distribution up to time t. If (6) is evaluated at y f = y o t+1 , it is commonly referred to as the (one-step-ahead) predictive likelihood at time t + 1, denoted PL t+1 . Also, draws from (5) can straightforwardly be obtained by generating values y (k) t+1 from the distribution given through the (in our case multivariate Gaussian) density For the model at hand, two ways of evaluating the predictive likelihood particularly stand out.
First, one could average over k = 1, . . . , K densities of To obtain PL t+1 , we suggest to average over k = 1, . . . , K densities of [1:t] . This form of the predictive likelihood is obtained by analytically performing integration in (5) with respect to f t+1, [1:t] . Consequently, it is numerically more stable, irrespectively of the number of factors r. However, it requires a full m-variate Gaussian density evaluation for each k and is thus computationally much more expensive. To a certain extent, the computational burden can be mitigated by using the Woodbury matrix identity, . This substantially speeds up the repetitive evaluation of the multivariate Gaussian distribution if r m.
We apply these results for comparing competing models A and B between time points t 1 and t 2 and consider cumulative log predictive Bayes factors defined through log BF t 1 , where PL t (A) and PL t (B) denote the predictive likelihood of model A and B at time t, respectively. When the cumulative log predictive Bayes factor is greater than 0 at a given point in time, there is evidence in favor of model A, and vice versa.
Thereby, data up to time t 1 is regarded as prior information and out-of-sample evaluation starts at time t 1 + 1.

Simulation Studies
The aim of this section is to apply the model to a simulated data set in order to illustrate the shrinkage properties of the Normal-Gamma prior for the factor loadings matrix elements. For this purpose, we first illustrate several scenarios on a single ten dimensional data set. Second, we investigate the performance of our model in a full Monte Carlo simulation based on 100 simulated data sets. Third, and finally, we investigate to what extend these results carry over to higher dimensions.
In what follows, we compare five specific prior settings. Setting 1 refers to the usual standard Gaussian prior with variance τ 2 ij ≡ τ 2 = 1 and constitutes the benchmark. Setting 2 is the row-wise Bayesian Lasso where a i = 1 for all i. Setting 3 is the column-wise Bayesian Lasso where a j = 1 for all j. Setting 4 is the Normal-Gamma prior with row-wise shrinkage where    Turning towards the zero loadings, the strongest shrinkage is introduced by both variants of the Normal-Gamma prior, followed by the different variants of the Bayesian Lasso and the standard Gaussian prior. This is particularly striking for the loadings on the superfluous third factor. The difference between row-and column-wise shrinkage for the Lasso variants can most clearly be seen in row 9 and column 3, respectively. The row-wise Lasso captures the "zero-row" 9 better, while the column-wise Lasso captures the "zero-column" 3 better.
Because of the increased element-wise shrinkage of the Normal-Gamma prior, the difference between the row-wise and the column-wise variant are minimal.
In the context of covariance modeling, however, factor loadings can be viewed upon as a mere means to parsimony, not the actual quantity of interest. Thus, Figure 2 Table 2: Estimated log Bayes factors at t = 1500 against the 2-factor model using a standard Gaussian prior, where data up to t = 1000 is treated as the training sample. Lines 1 to 5 correspond to cumulative 1-day-ahead Bayes factors, lines 6 to 10 correspond to 10-days-ahead predictive Bayes factors.
intervals are tighter.
To conclude, we briefly examine predictive performance by investigating cumulative log predictive Bayes factors. Thereby, the first 1000 points in time are treated as prior information, then 1-day-and 10-days-ahead predictive likelihoods are recursively evaluated until t = 1500.

Medium Dimensional Monte Carlo Study
For a more comprehensive understanding of the shrinkage effect, the above study is repeated for 100 different data sets where all latent variables are generated randomly for each realization. In Table 3, the medians of the respective relative RMSEs (root mean squared errors, averaged over time) between the true and the estimated pairwise correlations are depicted.
The part above the diagonal represents the relative performance of the row-wise Lasso prior (setting 2) with respect to the baseline prior (setting 1), the part below the diagonal represents the relative performance of the row-wise Normal-Gamma prior (setting 4) with respect  Table 3: Relative RMSEs of pairwise correlations. Above the diagonal: Row-wise Lasso (a i = 1) vs. benchmark standard Gaussian prior with geometric means (first row). Below the diagonal: Row-wise Normal-Gamma (a i = 0.1) vs. row-wise Lasso prior (a i = 1) with geometric means (last row). Numbers greater than one mean that the former prior performs better than the latter. Hyperhyperparameters are set to c i = d i = 0.001. All values reported are medians of 100 repetitions.
to the row-wise Lasso prior (setting 2). Clearly, gains are highest for series 9 which is by construction completely uncorrelated to the other series. Additionally, geometric averages of these performance indicators are displayed in the first row (setting 2 vs. baseline) and in the last row (setting 4 vs. baseline). They can be seen as the average relative performance of one specific series' correlation estimates with all other series.
To illustrate the fact that extreme choices of c i and d i are crucial for the shrinkage effect of the Bayesian Lasso, Table 4 displays relative RMSEs for moderate hyperparameter choices c i = d i = 1. Note that the performance of the Bayesian Lasso deteriorates substantially while performance of the Normal-Gamma prior is relatively robust with regard to these choices. This indicates that the shrinkage effect of the Bayesian Lasso is strongly dependent on the particular choice of these hyperparameters (governing row-wise shrinkage) while the Normal-Gamma can adapt better through increased element-wise shrinkage. Table 5 which lists RMSEs and MAEs for all prior settings, averaged over the non-trivial correlation matrix entries as well as time. Note again that results under the Lasso prior are sensitive to the particular choices of the global shrinkage hyperparameters as well as the choice of row-or column-wise shrinkage, which is hardly the case for the Norma-Gamma prior. Interestingly, the performance gains achieved through shrinkage prior usage are higher when absolute errors are considered. This is coherent with the extremely high kurtosis of Normal-Gamma-type 1 2 3 4 5 6 7 8 9 10 Average 1 1.00 1.00 1.01 1.01 1.01 1.01 1.01 1.00 1.02 1.00 1 1.00 1.00 1.01 1.00 1.00 1.01 1.00 1.01 1.00 2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.01 1.00 3 1.00 1.01 1.01 1.00 1.00 1.02 1.00 1.03 1.00 4 3.06 1.00 2.99 1.02 1.02 1.00 1.00 1.01 1.02 5 1.00 1.01 1.00 3.02 1.00 1.02 1.00 1.03 1.00 6 1.00 1.01 1.00 2.93 1.00 1.02 1.00 1.04 1.00 7 3.25 1.00 2.98 1.00 2.90 2.72 1.00 1.02 1.02 8 1.00 1.00 1.00 1.01 1.01 1.00 1.01 1.02 1.00 9 3.51 3.07 3.28 3.19 3.40 3.37 3.28 3.03 1.02 10 1.00 1.01 1.00 2.92 1.00 1.00 2.98 1.01 3.32 Average 2 1.49 1.14 1.43 2.11 1.43 1.42 2.07 1.14 3.37 1.44 Table 4: Relative RMSEs of pairwise correlations. Above the diagonal: Row-wise Lasso (a i = 1) vs. benchmark standard Gaussian prior with geometric means (first row). Below the diagonal: Row-wise Normal-Gamma (a i = 0.1) vs. row-wise Lasso prior (a i = 1) with geometric means (last row). Numbers greater than one mean that the former prior performs better than the latter. Hyperhyperparameters are set to c i = d i = 1. All values reported are medians of 100 repetitions.

An overall comparison of the errors under different priors is provided in
priors which, while placing most mass around zero, allow for large values.

High Dimensional Monte Carlo Study
The findings are similar if dimensionality is increased; in analogy to above, we report overall RMSEs and MAEs for 495 000 pairwise correlations, resulting from m = 100 component series at T = 1000 points in time. The factor loadings for the r = 10 factors are again randomly sampled with 43.8% of the loadings being equal to zero, resulting in about 2.6% of the pairwise correlations being zero. Using this setting, 100 data sets are generated; for each of these, a separate (overfitting) factor SV model using r = 11 factors without any prior restrictions on the factor loadings matrix is fit. The error measures are computed and aggregated. Table 6 reports the medians thereof. In this setting, the shrinkage priors outperform the standard Gaussian prior by a relatively large margin; the effect of the specific choice of the global shrinkage hyperparameters is less pronounced.

Application to S&P 500 Data
In this section we apply the SV factor model to stock prices listed in the Standard & Poor's 500 index. We only consider firms which have been continuously included in the index from   The presentation consists of two parts. First, we exemplify inference using a multivariate stochastic volatility model and discuss the outcome. Second, we perform out-of-sample predictive evaluation and compare different models. To facilitate interpretation of the results discussed in this section, we consider the GICS 5 classification into 10 sectors listed in Table 7.  We run our sampler employing the Normal-Gamma prior with row-wise shrinkage for 110 000 draws and discard the first 10 000 draws as burn-in. 6 Of the remaining 100 000 draws every 10th draw is kept, resulting in 10 000 draws used for posterior inference. Hyperparameters are set as follows: a i ≡ a = 0.1, c i ≡ c = 1, d i ≡ d = 1, b µ = 0, B µ = 100, a 0 = 20, b 0 = 1.5, B σ = 1, B m+j = 1 for j = 1, . . . , r. To prevent factor switching, we set all elements above the diagonal to zero. The leading series are chosen manually after a preliminary unidentified run such that series with high loadings on that particular factor (but low loadings on the other factors) become leaders. Note that this intervention (which introduces an order dependency) is only necessary for interpreting the factor loadings matrix but not for covariance estimation or prediction. Concerning MCMC convergence, we observe excellent mixing for both the covariance as well as the correlation matrix draws. To exemplify, trace plots of the first 1000 draws after burn-in and thinning for posterior draws of the log determinant distribution of the covariance and correlation matrices at t = T are displayed in Figure 3.   The corresponding factor log variances are displayed in Figure 6. Apart from featuring similar low-to medium frequency properties, each process exhibits specific characteristics. First, notice the sharp increase of volatility in early 2010 which is mainly visible for the "overall" factor 1. The second factor (Utilities) displays a pre-crisis volatility peak during early 2008.
The third factor, driven by Energy and Materials, shows relatively smooth volatility behavior while the fourth factor, governed by the Financial, exhibits a comparably "nervous" volatility evolution. Higher correlation can be spotted throughout, both within sectors but also between sectors.
There are only few companies that show little and virtually no companies that show no correlation with others. Another two years later, we again see a different overall picture.
Lower correlations throughout become apparent with moderate correlations remaining within the sectors Energy, Utilities, and in particular Financials.  The matrix has been rearranged to reflect the different GICS sectors in alphabetical order, cf. Table 7.

Predictive Likelihoods for Model Selection
Even for univariate volatility models evaluating in-or out-of-sample fit is not straightforward because the quantity of interest (the conditional standard deviation) is not directly observable.
While in lower dimensions this issue can be circumvented to a certain extent using intraday data and computing realized measures of volatility, the difficulty becomes more striking when the dimension increases. Thus, we focus on iteratively predicting the observation density out-of-sample which is then evaluated at the actually observed values. Because this approach involves re-estimating the model for each point in time, it is computationally costly but can be parallelized in a trivial fashion on multi-core computers.
For the S&P 500 data set, we begin by using the first 3000 data points (until 5/2/2006) to estimate the one-day-ahead predictive likelihood for day 3001 as well as the ten-day-ahead predictive likelihood for day 3010. In a separate estimation procedure, the first 1001 data points (until 5/3/2006) are used to estimate the one-day-ahead predictive likelihood for day 3002 and the corresponding ten-day-ahead predictive likelihood for day 3011, etc. This procedure is repeated for 1990 days until the end of the sample is reached.
We Accumulated log predictive likelihoods for the entire period are displayed in Figure 8. Gains in predictive power are substantial up to around 8 factors with little difference for the two priors.
After this point, the benefit of adding even more factors turns out to be less pronounced. On the contrary, the effect of the priors becomes more pronounced. Again, while differences in scores tend to be muted for models with fewer factors, the benefit of shrinkage grows when r gets larger. Estimated marginal log predictive likelihoods for industry "overall" (n = 300, method = "new") q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1−day−ahead, normal prior 1−day−ahead, NG prior 10−days−ahead, normal prior 10−days−ahead, NG prior Figure 8: Accumulated 1-day-and 10-days-ahead log predictive likelihoods for models with 0, 1, . . . , 20 factors. Data until t = 3000 is treated as training data.
While joint models with r > 0 outperform the marginal (no-factor) model for all points in time, days of particular turbulence particularly stand out. To illustrate this, we display average and top three log predictive gains over the no-factor model in worldwide. It appears that joint modeling of stock prices is particularly important on days of extreme events when conditional correlations are often higher.

Using Observable Instead of Latent Factors
An alternative to estimating latent factors from data is to use observed factors instead (cf. Wang et al. 2011  of the excess return on the market, a size factor SMB, and a book-to-market factor HML (Fama and French 1993); the momentum factor MOM (Carhart 1997) captures the empirically observed tendency for falling asset prices to fall further, and rising prices to keep rising. For estimation of this model we proceed exactly as before, except that we omit the last step of our posterior sampler and keep f fixed at the observed values.
Without presenting qualitative results in detail due to space constraints, we note that both the loadings on and the volatilities of the market excess returns show a remarkably close resemblance to those corresponding to the first latent factor displayed in Figures 5 and 6. To a certain extent (although much less pronounced), this is also true for SMB and the second latent factor as well as HML and the fourth latent factor. However, most of the loadings on MOM are shrunk towards zero and there is no recognizable similarity to the remaining latent factor from the original model. Log predictive scores for this model are very close to those for the SV model with two latent factors. In what follows, we term this approach FF+MOM.

Comparison to Other Models
We now turn to investigating the statistical performance of the factor SV model via out-ofsample predictive measures as well as its suitability for optimal asset allocation. The competi- where ι denotes an m-variate vector of ones. Using these weights, we compute the corresponding realized portfolio returns r t+1 for t = 3000, 3001, . . . , 3999, effectively covering an evaluation period from 5/3/2006 to 3/1/2010. In the first three columns of Table 9, we report annualized empirical standard deviations, annualized average excess returns over those obtained from the equal weight portfolio, and the quotient of these two measures, the Sharpe ratio (Sharpe 1966).
Considering the portfolio standard deviation presented in the first column of Table 9, it turns out that the Ledoit-Wolf shrinkage estimator implies an annualized standard deviation of about 12.5 which is only matched by factor SV models with many factors. Lower-dimensional factor SV models, including FF+MOM, as well as simple MAs and highly persistent EWMAs do not perform quite as well but are typically well below 20. Less persistent EWMAs, the no-factor SV model and the naïve equal weight portfolio exhibit standard deviations higher than 20. Considering average returns, FF+MOM and factor SV models with around 10 to 20 factors tend to do well for the given time span. In the third column, we list Sharpe ratios, where factor SV models with 10 to 20 factors show superior performance, in particular when the Normal-Gamma prior is employed. Note however that column two and three have to be interpreted with some care, as average asset and portfolio returns generally have a high standard error.
Second, we use what we coin pseudo log predictive scores (PLPSs), i.e. Gaussian approximations to the actual log predictive scores. This simplification is necessary because most of the above-mentioned methods only deliver point estimates of the forecast covariance matrix and it is not clear how to properly account for estimation uncertainty. Moreover, the PLPS is simpler to evaluate as there is no need to numerically solve a high-dimensional integral. Consequently, it is frequently used instead of the actual LPS in high dimensions while still allowing for evaluation of the covariance accuracy (Adolfson et al. 2007;Carriero et al. 2016;Huber 2016). More specifically, we use data up to time t to determine a point estimateΣ t+1 for Σ t+1 and compute the logarithm of the multivariate Gaussian density N m 0,Σ t+1 evaluated at the actually observed value y o t+1 to obtain the one-day-ahead PLPS for time t + 1.
In terms of average PLPSs (the last column in Table 9), factor SV models clearly outperform all other models under consideration. In particular, even if r is chosen as small as r = 4 to match the number of factors, the model with latent factors outperforms FF+MOM. Using latent factors is generally preferable; note however that the 4-factor FF+MOM does better than the single-and no-factor SV models. Generally speaking, many factors appear to be needed for accurately representing the underlying data structure, irrespectively of the prior choice. Considering the computational simplicity of the Ledoit-Wolf estimator, its prediction accuracy is quite remarkable. It clearly outperforms the no-factor SV model which, in turn, beats simple MAs and EWMAs.

Conclusion and Outlook
The aim of this paper was to present an efficient and parsimonious method of estimating highdimensional time-varying covariance matrices through factor stochastic volatility models. We did so by proposing an efficient Bayesian MCMC algorithm that incorporates parsimony by modeling the covariance structure through common latent factors which themselves follow univariate SV processes. Moreover, we added additional sparsity by utilizing a hierarchical shrinkage prior, the Normal-Gamma prior, on the factor loadings. We showed the effectiveness of our approach through simulation studies and illustrated the effect of different shrinkage specifications. We applied the algorithm to a high-dimensional data set consisting of stock returns of 300 S&P 500 members and conducted an out-of-sample predictive study to compare different prior settings and investigate the choice of the number of factors. Moreover, we discussed the out-of-sample performance of a minimum variance portfolio constructed from the model-implied weights and related it to a number of competitors often used in practice.
Because the algorithm scales linearly in both the series length T as well as the number of component series m, applying it to even higher dimensions is straightforward. We have experimented with simulated data in thousands of dimensions for thousands of points in time and successfully recaptured the time-varying covariance matrix.
Further research could be directed towards incorporating prior knowledge into building the hierarchical structure of the Normal-Gamma prior, e.g. by choosing the global shrinkage parameters according to industry sectors. Alternatively, Villani et al. (2009) propose a mixture of experts model to cater for smoothly changing regression densities. It might be fruitful to adopt this idea in the context of covariance matrix estimation by including either observed (Fama-French) or latent factors as predictors and allowing for other mixture types than the ones discussed there. While not being the focus of this work, it is easy to extend the proposed method by exploiting the modular nature of Markov chain Monte Carlo methods. In particular, it is straightforward to combine it with mean models such as (sparse) vector autoregressions (e.g., Bańbura et al. 2010;Kastner and Huber 2017), dynamic regressions (e.g., Korobilis 2013), or time-varying parameter models (e.g., Koop and Korobilis 2013;).