Estimating the Ordering of Variables in a VAR using a Plackett-Luce Prior ∗

Estimating Bayesian Vector Autoregressions (VARs) involving the Cholesky decomposition is sensitive to the ordering of variables. We treat the ordering as unknown, develop a prior over variable orderings and Markov Chain Monte Carlo (MCMC) methods for posterior sampling over orderings.


Introduction
Bayesian VARs have traditionally been estimated in reduced form, where the left hand side of the VAR equation involves an n×1 vector of dependent variables, y t , and the error covariance matrix for the VAR, Σ t , is unrestricted (apart from being positive definite). However, as VARs have become larger and larger, researchers have increasingly been estimating VARs in a structural form where the left hand side of the VAR is B 0 y t and the error covariance matrix is diagonal. B 0 is a lower triangular matrix based on the Cholesky decomposition of Σ t . The fact that the error covariance matrix of the structural VAR is diagonal means that Bayesian estimation can proceed one equation at a time. As shown, e.g., in Carriero et al. (2019) the MCMC algorithm based on the reduced form VAR requires O(n 6 ) elementary operations to take one draw of the VAR coefficients. This is reduced to O(n 4 ) with the structural VAR. Thus, in larger VARs, it can be computationally impractical to work in the reduced form, but feasible when working in the structural form. 1 However, the use of the Cholesky decomposition of the reduced form error covariance matrix leads to order dependence. That is, posterior and predictive results will vary depending on the way that the variables are ordered in the VAR. To clarify the precise nature of this order dependence, we highlight the discussion of sub-section 3.1 of Carriero et al. (2019) who demonstrate that the posterior of the structural form VAR coefficients, conditional on Σ t is invariant to ordering. The lack of order invariance arises due to the fact that the implied prior on Σ t is not order invariant. Bognanni (2018) and Arias et al. (2021) demonstrate the importance of the ordering issue in VARs with stochastic volatility (SV) both theoretically and empirically. Arias et al. (2021) shows that, although point forecasts are not sensitive to the way variables are ordered, predictive standard deviations can be substantially affected by the way that the variables are ordered. The VARs considered in Arias et al. (2021) are all low dimensional.  consider ordering issues in high dimensional VARs and demonstrate that the theoretical and empirical findings of Arias et al. (2021) hold with additional force in higher dimensions. Thus, there is growing theoretical and empirical evidence that ordering issues are important, particularly in the large VARs that cannot easily be estimated in reduced form due to the computational burden.
These considerations have stimulated interest in order invariant approaches. Reduced form estimation of VARs is order invariant with commonly-used priors, but is not scalable to large VARs.  critiques various order invariant approaches and proposes a new order invariant approach which avoids the use of the Cholesky decomposition, relies on stochastic volatility to identify the model and is scalable. Wu and Koop (2022) develop an order-invariant approach based on Eigendecomposition. The present paper adopts a different strategy to address the ordering issue. We retain the Cholesky decomposition of the reduced form error covariance but develop methods for finding the optimal ordering of the variables. In this, it shares some similarities to Levy and Lopes (2021) which also uses the Cholesky decomposition for the error covariance matrix for a multivariate time series model and develops Bayesian methods for searching over variable orderings. But the model class they use is different from ours and they use approximate discounting methods which contrasts with our use of MCMC methods. In addition, they do not use a prior over variable orderings.
We treat the ordering of variables in the VAR as unknown and estimate it using Bayesian methods. We do so by developing a prior over variable orderings. The Plackett-Luce distribution is commonly used for the likelihood function for rank ordered data. We use it, not for the likelihood function, but rather as a prior. We develop MCMC methods for the posterior that results. Our model is a Bayesian VAR with SV and Placket-Luce prior over variable orderings which we abbreviate to BVARSV-PL. In a macroeconomic forecasting exercise involving a VAR with 20 variables we compare our BVARSV-PL to a BVARSV which is identical in all respects except that a single ordering is used. We choose this model to be that of Chan (2021) and, for BVARSV, use the same ordering as in this paper. This can be interpreted as a model which is a special case of ours, but has a prior which dogmatically imposes a particular variable ordering. We find that our BVARSV-PL forecasts better than this BVARSV, thus demonstrating the importance of ordering choice in VARs for macroeconomic forecasting.

Bayesian Inference on Orderings in VARs
In this section, we develop Bayesian methods for carrying out inference on ways of ordering the variables in VARs. After defining the likelihood function for our VAR, we develop our Plackett-Luce prior on variable ordering. Subsequently, we develop MCMC methods which allow for posterior inference and prediction.
Our VAR with stochastic volatility is: where a is an n×1 vector of intercepts, A 1 , . . . , A p are n×n matrices of VAR coefficients, B 0 is an n × n matrix, and D t = diag e h 1,t , . . . , e hn,t is diagonal. Each of the logvolatilities follows a stationary AR(1) process: where N (·, ·) denotes the Gaussian distribution. 2 2 It is simple to extend our methods to the homoskedastic VAR or to allow for A i for i = 1, . . . , p to A common assumption, based on taking the Cholesky decomposition of the reduced form error covariance matrix, is that B 0 is a lower triangular matrix with ones on the diagonal. It is this which leads to order dependence as shown, e.g., in . But the lower triangularity assumption has the advantage that it leads to simple and fast Bayesian computation. Hence, in this paper, we retain the lower triangularity assumption but express uncertainty over which lower triangular form is appropriate. To be precise, we use priors which rule out all forms for B 0 except those that can be, via permutations of the columns of the matrix, put into a lower triangular form with ones on the diagonal.
Before discussing our prior for B 0 , we emphasize that we make standard assumptions for the other parameters in this model (i.e. the VAR coefficients and those in the SV processess). There are a range of possible priors for the VAR with SV which could be used. In our empirical work, we use the prior of Chan (2021). 3 To describe the prior for the orderings, we introduce a discrete random 1 × n vector ρ which can take on realizations ρ l for l = 1, .., L. Each ρ l defines an ordering of the variables with ρ l 1 being the variable which takes the first place under ordering l, ρ l 2 the second, etc. For instance, in a VAR with 3 variables there are 6 possible orderings (e.g. 1,2,3; 1,3,2; 2,1,3; etc.) and, thus, L = 6. If ρ 2 = (1, 3, 2), then the prior for B 0 conditional on ρ 2 will imply a lower triangular form for B 0 consistent with the variables being ordered as (1,3,2).
We stress that the variables in y t are ordered in a particular way (from variable 1 through variable n) and that they always appear in this order in the model. When we refer to variable orderings and lower triangular forms this relates to B 0 with the idea being that, through appropriate permutations of its columns, it becomes lower triangular. That is, formally the variables in y t never change order, nor is B 0 itself necessarily lower triangular. However, we consider a prior which only allows for choices of B 0 which can be transformed into lower triangular form after appropriate switching of its columns. This be time varying. Allowing for B 0 to be time varying would be a greater challenge, but could be done by replacing our Plackett-Luce prior with a dynamic Plackett-Luce prior.
3 A word is in order about identification. The homoskedastic version of our model is unidentified, although adding SV identifies the model as shown in Bertsche and Braun (2022). It is well-known that identification is not necessary to carry out Bayesian prediction in models which have proper priors. Even for the homoskedastic version of our model, updating will occur in the sense that the posterior will differ from the prior. This is an example of what Giacomini et al. (2022) call uncertain identification. is equivalent to considering all possible structural VARs with lower triangular triangular impact matrices for different variable orderings. When we use phrases below which refer to different orderings of the variables, they should be interpreted in this context.
For each choice of l we restrict B 0 to lower triangular form (after permutations) through a Dirac spike-and-slab prior. Lower triangularity is obtained if, for i < j, i, j = 1, . . . , n, variable ρ l i is ordered before variable ρ l j , i = j, then it is zero, otherwise it is non-zero. Thus, We set c = 0.0001 and V b = 1.
This defines the prior for B 0 conditional on a specific ordering.
We now need a prior over ρ. When dealing with ordered or ranked data, the Plackett-Luce model, (Luce, 1959;Plackett, 1975), is a popular one. 4 The Plackett-Luce distribution is parameterized by the parameter λ j > 0 for j = 1 . . . n which represents the skill rating or ability of each variable (i.e. λ j is the probability that variable y j,t is ordered first in the VAR). Our prior is It is a hierarchical prior in that λ is treated as unknown and estimated.
and calculate 4 The Plackett-Luce model assumes the ranking is complete, an assumption we maintain in this paper. However, it is worth noting that Bayesian methods for the extended Plackett-Luce model are available (Johnson et al., 2021) and could easily be used in our approach. This model allows for the ranking to be incomplete which, in some cases, could be useful for VAR ordering problems. For instance, a researcher may wish to know whether a block of macroeconomic variables is ordered before or after a block of financial variables but does not care about the ordering of variables within each block.
it can be shown to be the same prior as in (4).
Finally, we require a prior for λ, which is the vector of abilities for the variables. For these, we adopt independent Gamma distributions: As explained in Caron and Doucet (2012), the hyperparameters b λ,i are just scaling parameters on λ i . As the likelihood is invariant to a rescaling of the λ i , these hyperparameters can be fixed without influencing inference. Following Henderson and Kirrane (2018) we set b λ,i = 1. Following Caron and Doucet (2012) we estimate the other set of prior hyperparameters, a λ . We begin by setting a λ,i = a λ for i = 1, . . . , n which leads to an exchangeable prior specification in which we believe that some variables are better than others but we have no beliefs about which the stronger and the weaker variables are (Henderson and Kirrane, 2018). Thus, we assume λ i ∼ G(a λ , 1) and treat a λ as an unknown parameter and give it a noninformative prior (i.e. Uniform over the positive real line).
Posterior inference is carried out using an MCMC algorithm. The MCMC algorithm for the VAR coefficients, B 0 and the SV processes, conditional on a given ordering, is standard, see  and will not be given here. The algorithm for drawing z, which can then be used to provide draws of ρ using (6), is provided in the Appendix. Intuitively, it will first search over all n variables to draw the variable to be ordered first, then search over the remaining n − 1 variables to find the variable ordered second, etc. This we call the forward step and it leads to a draw we label z 1 j for j = 1, . . . , n. We then repeat the process in reverse ordering (i.e. first drawing the variable ordered last, then the variable ordered second last, etc.), leading to z 2 j for j = 1, . . . , n. This is the backward step. Theoretically, we could use only the forward or the backward step in a valid MCMC algorithm, but we find MCMC convergence to be much better by including both steps. These appear in the conditional posterior for λ. In particular, for i = 1, . . . , n, we draw the ability parameters from: where w i denotes the total number of wins of variable i (i.e. cases where variable i is ordered before another variable). If we let w ij denote the number of cases where i beats j, then w i = n j=1,j =i w ij .
For a λ , we use the Metropolis-Hastings step proposed in Caron and Doucet (2012). It proceeds by taking candidate draws, denoted log (a c λ ), from N (ln (a λ ) , ω 2 ), where ω 2 is a tuning parameter and we set ω 2 = 0.1 2 in our application. The acceptance probability for a c λ is given by where Γ(·) denotes the Gamma function.

Empirical Application
We use the first twenty variables, transformed and ordered in the same way, from the data set used in Chan (2021) but we update the data set so that it runs from 1960Q1 through 2021Q3. For brevity, we do not provide details here but note that four important variables are GDP (GDPC1), industrial production (INDPRO), the unemployment rate (UNRATE) and the consumer price index (CPIAUCSL). The benchmark model is the BVARSV with variables in the same order as in Chan (2021). We refer the reader to the the label on the Y-axis of Figure 1 which lists the variables in the order which they enter the BVARSV. All other modelling choices (i.e. prior and lag length) are the same as in Chan (2021) for both BVARSV and BVARSV-PL. For BVARSV-PL we additionally require the prior over the orderings which is described in the preceding section. The forecasting exercise uses an evaluation period beginning in 1988Q1. Forecasts are iterative for horizons h = 1 and h = 4. To assess forecasting accuracy, we use root mean square forecast errors (RMSFEs) for point forecasts and average log predictive likelihoods (ALPLs) for density forecasts. These are benchmarked so that they represent percentage improvements in forecast performance of the BVARSV-PL relative to the BVARSV.
MCMC convergence was checked by analyzing the difference between multiple Markov chains obtained from different starting values. Twenty Monte Carlo Markov chains, each initiated at a random draw of ρ l , l = 1, . . . , 20 are taken. Initial values for B 0 are either set to 0 or 0.5 depending on the initial value for ρ l . Each Markov chain was run for 20,000 iterations. Convergence of each parameter in ρ was checked with the Gelman and Rubin diagnostics in Gelman and Rubin (1992). Table 1 reports the 10 largest diagnostics. As Brooks and Gelman (1998) have suggested, if diagnostics are less than 1.2, one can be confident that convergence has been reached. Evidence on posterior ordering is presented in Figure 1. This presents the probability (obtained by averaging over MCMC draws) that each variable appears in each possible order. For instance, the fact that box in the row labeled FEDFUNDS and column labelled 1 is dark blue means there is a probability near one that the FEDFUNDs should be ordered first. It turns out that there are many probabilities near one or near zero, with few values in intermediate regions and is choosing an ordering of variables that is very different from that of Chan (2021). Note that the ordering used in Chan (2021), which puts macroeconomic variables first and financial variables last, is commonly used in many large VAR forecasting papers. But we are finding strong evidence that the optimal ordering involves financial variables to be ordered first.  Table 2 shows that estimating the ordering leads to substantial forecast improvement relative to a strategy of following a single conventional ordering of variables. We find that there is not much difference between BVARSV-PL and BVARSV in terms of their point forecasts (i.e. RMSFEs for one model do not consistently beat the other), but that the density forecasts produced by BVARSV-PL are substantially better (i.e. ALPLs are much higher for all variables and for both forecast horizons).

Conclusions
Bayesians commonly work with structural VARs which involve the Cholesky decomposition of the error covariance matrix. But posterior and predictive results can depend on the way the variables are ordered in the VAR. To overcome this problem, while retaining the Cholesky-transformed VAR, this paper has developed an algorithm for finding the optimal ordering of variables. We find the algorithm to lead to forecast improvements relative to a strategy of choosing a particular ordering a priori.

Appendix: MCMC algorithm for ρ
This appendix describes how draws from the orderings are taken from their conditional posterior density: p (ρ | y, A, D, B 0 , z, λ, a λ ).
Our algorithm consists of two steps: the forward step and the backward step.
The forward step: 1) we first decide which variable will take the first place. There are n variables in total, so we compare the n variables. The winner needs to beat all other n − 1 variables. The success of variable i means all other variables will not appear in this equation, that is, the i-th row in B 0 will be zero and the prior will come from the Dirac spike component. The success of variable i also means this variable will appear in all other equations, that is, the i-th column in B 0 will be non-zero and the prior will come from the slab component (except itself, because it always appears in its own equation. This is not influenced by the ordering).
Let b i· denote the i-th row in B 0 and delete the i-th element (which is the variable itself), b ·i denote the i-th column in B 0 and delete the i-th element. Let p i denote the success probability of variable i. Then: where φ denotes the pdf of the Normal distribution. Hence, after normalization, we obtain p i = p i n k=1 p k .
We can sample from the multinomial distribution as follows: a) Create an array p containing the cumulative probabilities of p i , i = 1, 2, · · · , n; b) Generate U , a uniform(0,1) random value; c) Select the first index such that p i > U .
2) conditioning on the first place, we can decide which variable will take the second place. There are n − 1 variables in total, so we compare the n − 1 variables. The winner needs to beat all other n − 2 variables. The success of variable j means all other variables will not appear in this equation, that is, the j-th row in B 0 will be zero and the prior will come from the Dirac spike component (except the variable which takes the first place). The success of variable j also means this variable will appear in all other equations, that is, the j-th column in B 0 will be non-zero and the prior will come from the slab component (also, except the variable which takes the first place).