Automated Parameter Blocking for Efficient Markov-Chain Monte Carlo Sampling

Markov chain Monte Carlo (MCMC) sampling is an important and commonly used tool for the analysis of hierarchical models. Nevertheless, practitioners generally have two options for MCMC: utilize existing software that generates a black-box"one size fits all"algorithm, or the challenging (and time consuming) task of implementing a problem-specific MCMC algorithm. Either choice may result in inefficient sampling, and hence researchers have become accustomed to MCMC runtimes on the order of days (or longer) for large models. We propose an automated procedure to determine an efficient MCMC algorithm for a given model and computing platform. Our procedure dynamically determines blocks of parameters for joint sampling that result in efficient sampling of the entire model. We test this procedure using a diverse suite of example models, and observe non-trivial improvements in MCMC efficiency for many models. Our procedure is the first attempt at such, and may be generalized to a broader space of MCMC algorithms. Our results suggest that substantive improvements in MCMC efficiency may be practically realized using our automated blocking procedure, or variants thereof, which warrants additional study and application.


Introduction
Markov chain Monte Carlo (MCMC) has become a core computational method for many statistical analyses.These include routine Bayesian analyses, but also hybrid algorithms that use MCMC as one component, such as Monte Carlo Expectation Maximization (MCEM; Caffo, Jank, and Jones, 2005) or data cloning (Lele, Dennis, and Lutscher, 2007).Nevertheless, the automated generation of black-box MCMC algorithms, as occurs in generally available software, does not necessarily result in efficient MCMC sampling.Analysts are thereby accustomed to MCMC run times measured in minutes, hours or even days for large hierarchical models.Computation time is frequently the limiting factor, either limiting the range of models considered, or limiting the potential for performing diagnostics and comparisons using methods such as bootstrapping (Efron and Tibshirani, 1994), cross validation (Gneiting and Raftery, 2007), or calibration of posterior predictive p-values (Hjort, Dahl, and Steinbakk, 2006), among others.Therefore, any widely applicable improvements to MCMC performance may greatly improve the practical analyses of large hierarchical models.
Among the many MCMC sampling algorithms developed to improve MCMC efficiency, one of the most basic approaches has been block sampling: jointly updating multiple dimensions of a target distribution simultaneously (Roberts and Sahu, 1997;Sargent, Hodges, and Carlin, 2000).
When one or more dimensions of the posterior distribution are correlated, joint sampling of these dimensions (with any variety of block samplers) can increase sampling performance relative to updating each dimension independently (e.g., Liu, Wong, and Kong, 1994).Despite wide recognition of the usefulness of this basic idea for designing efficient MCMC algorithms, there has been no automated method for choosing blocks to optimize -or at least greatly improve -performance.
Here we develop such a method.
Existing theoretical work comparing block samplers to univariate samplers (Mengersen and Tweedie, 1996;Roberts and Tweedie, 1996;Roberts, Gelman, and Gilks, 1997, among others) has provided many insights but falls short of providing a complete assessment of MCMC efficiency for several reasons.First, it uses MCMC convergence rates as the metric for comparison, without consideration of the computational demands of block sampling.Instead, our viewpoint is that any measure of MCMC efficiency must incorporate both the convergence rate and the computational requirements of achieving improvements in convergence rate.This may give a different picture of the actual efficiency of a sampling algorithm.Accelerated convergence at an extreme computational cost is obviously not optimal.Second, the computational requirements of different steps will vary greatly across platforms, depending on such factors as processor, memory architecture, use of efficient linear algebra packages, etc.Therefore, even if theoretical comparisons were extended to incorporate aspects of computation, the best block sampling scheme would remain platform-dependent.It is important to recognize that computational costs affect not only the proposal step -such as the cost of generating a multivariate normal proposal -but also model computations and density evaluations.Some parts of hierarchical models may inherently involve expensive computations, which can impact the relative efficiency of different blocking schemes.Third, existing theories and methods presume that some wise, manual selection of blocks may be feasible, based for example on an understanding of the model structure, which leads to understanding which posterior dimensions may be correlated.
In general, however, it is difficult to know a priori which dimensions will be correlated, which is one purpose of automating a procedure like MCMC in the first place.
Here, we present a procedure for the automated exploration of MCMC blocking schemes, seeking a highly efficient MCMC algorithm specific to the hierarchical model and computing environment at hand.This represents a higher level of automated algorithm generation than is provided by existing software, which serve to produce "one size fits all" MCMC algorithms.The family of BUGS packages (WinBUGS, JAGS, and OpenBUGS;Lunn et al., 2000;Plummer, 2011;Lunn et al., 2012) assigns samplers based on local characteristics of each model parameter, using a combination of Gibbs sampling, adaptive rejection sampling, slice sampling, and, in limited cases, block sampling.
Other MCMC packages including ADMB (Skaug and Fournier, 2006) and Stan (Stan Development Team, 2014) use Hamiltonian MCMC sampling (Neal, 2011), which may generally be more efficient but nevertheless represents a static approach to MCMC algorithm generation.Yet other promising methods such as Langevin sampling (Marshall and Roberts, 2012) are not incorporated into software commonly used by practitioners.For simplicity, we restrict our attention to univariate and blocked adaptive random walk sampling.However, the main concept of exploring the space of parameter blocks to improve MCMC efficiency generalizes to allow the use of other sampling methods.
In section 2, we examine the pros and cons of univariate versus block random walk sampling, both in terms of algorithmic and computational efficiencies.From these considerations we conclude that the combination of samplers that yields optimal MCMC efficiency (defined as an MCMC algorithm's rate of generating effectively independent samples) will be model-and platform-specific.Section 3 introduces a procedure for automated blocking of hierarchical model parameters, designed to maximize the resulting MCMC algorithm efficiency.The main idea of this procedure is to iteratively cluster model parameters based upon empirical posterior correlations, then intelligently subdivide the hierarchical clustering tree (a dendrogram) to determine blockings of parameters that result in efficient MCMC sampling.A series of simulated and real data examples given in section 4 demonstrate that this procedure can improve MCMC efficiency many-fold over statically defined MCMC algorithms and can dynamically generate algorithms comparable in performance to modelspecific, hand-tuned algorithms.We close with a discussion in section 5.

MCMC Algorithms: Definitions and Efficiency
In this section, we first define a space of valid MCMC algorithms.Then, we examine two dominant contributors to the efficiency of an MCMC algorithm: the algorithmic capability to produce effectively-independent samples from the target distribution, and the computational demands of the algorithm in generating MCMC samples; these are composed to form our metric of overall MCMC efficiency.Drawing upon existing asymptotic theory of MCMC sampling, the scaling of computational costs of different sampling schemes, and on several illustrative examples, we argue that achieving an optimally efficient MCMC algorithm for a specific model by pure theory is prohibitively difficult.That conclusion motivates our approach of computationally optimizing -or at least greatly improving -MCMC performance through automated exploration of a space of valid MCMC algorithms.

MCMC Definition
We assume a given, fixed, hierarchical model M, which may be represented as a directed acyclic graph where vertices represent top-level model parameters, latent states, or fixed observations (data), and edges represent dependencies between these components.Denote the set of all top-level model parameters and latent states (the unknown components for which we may seek inferences) as Θ = {θ 1 , . . ., θ d }, which will be referred to as the parameters of M.
An MCMC algorithm may be defined in terms of its sampling scheme over Θ.Let b be any non-empty subset ("block") of Θ, and u ∈ U be any valid MCMC sampling (or "updating") method such as slice sampling or conjugate Gibbs sampling (see Gilks, 2005, for a broad overview of MCMC sampling methods).We define a valid MCMC sampler ψ = u(b) as the application of u to b, where b satisfies any assumptions of u (e.g., conjugacy).In addition to satisfying standard properties of Markov chains (see, for example, Robert and Casella, 2004), we define a valid MCMC algorithm as any set of samplers Ψ = {ψ 1 , . . ., ψ k }, where that is, the MCMC algorithm updates each model parameter in at least one sampler.We represent the chain of samples generated from successive applications of Ψ as X (0) , X (1) , . .., where sample d ), and k , . ..} is the scalar chain of samples of θ k (for k = 1, . . ., d).This paper focuses attention on the restricted set of sampling methods U 0 = {u scalar , u block }, where u scalar denotes univariate adaptive random walk Metropolis-Hastings sampling (hereafter, scalar sampling;Metropolis et al., 1953;Hastings, 1970), and u block denotes the multivariate generalization of this algorithm (hereafter, block sampling; Haario, Saksman, and Tamminen, 1999), with the practical restriction that any ψ = u block (b) satisfies |b| > 1.The u scalar algorithm adaptively tunes the proposal scale, while u block additionally tunes the proposal covariance (Roberts and Rosenthal, 2009).All scalar and block samplers asymptotically achieve the theoretically optimal scaling of proposal distributions (and therefore acceptance rates, and mixing) as derived in Roberts, Gelman, and Gilks (1997), and implement adaptation routines as set out in Shaby and Wells (2011).
For hierarchical model M with parameters Θ, our studies focus almost exclusively on the set of MCMC algorithms Ψ M , which contains all algorithms of the form Ψ = {ψ 1 , . . ., ψ k }, where ψ i = u i (b i ), each u i ∈ U 0 , and the sets b i form a partition of Θ.We specifically name two algorithms in Ψ M which are boundary cases.The first consists of d scalar samplers: Ψ scalar = {ψ 1 , . . ., ψ d }, where each ψ i = u scalar (θ i ).The second has a single block sampler for all parameters: Ψ block = {u block (Θ)}.
Implicit is the assumption that each θ i is continuous-valued, which is the case throughout this paper.

Algorithmic Efficiency
We first consider MCMC algorithmic efficiency, independent of any computational requirements.This measure of efficiency solely represents the best mixing, or equivalently the least autocorrelation, or the highest effective sample size, without consideration for the computational (time) requirements of generating a set of samples.After reviewing the definition of MCMC algorithmic efficiency which is based upon integrated autocorrelation time, we study the use of Ψ scalar or Ψ block for particular choices of M, and quantify the effects on this measure of efficiency.
As in Roberts and Rosenthal (2001), we define MCMC algorithmic efficiency as the effective sample size divided by the chain length.This represents the rate of production of effectively independent samples per MCMC sample.The effective sample size (ESS) of an MCMC chain is defined as ESS = N/τ , where N is the chain length and τ is the integrated autocorrelation time.
For a scalar chain of samples X 0 , X 1 , . .., which is assumed to have converged to its stationary distribution, Straatsma, Berendsen, and Stam (1986) define the integrated autocorrelation time as τ may be interpreted as the number of MCMC samples required, on average, for an independent sample to be drawn.Our measure of algorithmic efficiency is thus τ −1 , the number of effective samples per actual sample (Thompson, 2010).τ −1 also characterizes the speed at which expectations of arbitrary functions of the sample values approach their stationary values (Roberts and Sahu, 1997), and no less satisfies the natural intuition that larger values indicate better performance.
For MCMC algorithm Ψ acting on model M with parameters Θ, we define the algorithmic efficiency of each θ ∈ Θ as A(Ψ, θ) = τ −1 , where τ is the integrated autocorrelation time of the samples of θ generated from repeated application of Ψ. Overloading notation, we define the algorithmic efficiency of MCMC algorithm Ψ as A(Ψ) = min θ∈Θ A(Ψ, θ).This definition is motivated by noting that often an MCMC produces seemingly good mixing of many model dimensions but poor mixing of just a few dimensions.In this case, the poorly mixing dimensions will limit the validity of the entire posterior sample (although this is not universally true of all model structures).
Therefore, we take the conservative approach, and our general aim is to maximize the algorithmic efficiency for the parameter exhibiting the slowest mixing.
We now study the potential for algorithmic inefficiency that may result from scalar or block sampling, by examining situations in which each are particularly ill-suited.

Efficiency loss from block sampling
Consider MCMC algorithm Ψ block ∈ Ψ M .Application of Ψ block generates a random proposal vector X * ∈ R d , then jointly accepts or rejects X * .In the framework of the sampling method , where µ and Σ vary according to current state of the MCMC chain and properties of the target stationary distribution, but not proportionally to d. Roberts, Gelman, and Gilks (1997) show that in order to achieve the asymptotically optimal acceptance rate, and therefore sample chain mixing, σ 2 d ∝ d −1 .As a consequence of this attenuation in the proposal variance, the rate at which Ψ block explores the space of R d , and accordingly This result applies equivalently to block samplers ψ = u block (b) acting on subsets b ⊂ Θ, where the algorithmic efficiency (for the elements of b) achieved by application of ψ is inversely proportional to the number of elements in b.
This result has an important implication on block sampling.All other factors being equal (e.g., effect of posterior correlations), a block sampler of dimension k must generate k times more samples to have the same effective sample size as those samples produced through scalar sampling (Roberts and Rosenthal, 2001).This inefficiency is inherent to block sampling and scales with the dimension of any block sampler.

Efficiency loss from scalar sampling
To study the potential loss of algorithmic efficiency which may result from scalar sampling under Ψ scalar ∈ Ψ M , we consider correlated posterior distributions.It is well-understood that strong posterior correlations can retard the speed of convergence of MCMC sampling (e.g., Roberts and Sahu, 1997;Gilks, 2005), and that block sampling can alleviate this.However, the nature of this inefficiency is not characterized, in particular as a function of the degree of correlation and number of dimensions exhibiting correlation.We undertake a simulation study, to gauge how these factors effect algorithmic efficiency.Consider target distribution N d (0, Σ), where the covariance (equivalently, correlation) matrix Σ consists of 1s on the main diagonal and ρ elsewhere, |ρ| < 1.
We consider the algorithmic efficiencies of individual model parameters under scalar sampling, Empirically, we observe that each A(Ψ scalar , θ) tends toward zero as ρ approaches one, or as d diverges (ρ = 0).The nature of these relationships is characterized on a log-log scale in Figure 1 (left), where the horizontal axis plots − log(1 − ρ), such that positive horizontal shifts represent ρ exponentially approaching the boundary ρ = 1, or perfect correlation between parameters.As one would expect, when ρ = 0 all values of d yield identical algorithmic efficiency.However, when ρ > 0 we enter a linear regime where each A(Ψ scalar , θ) exponentially decays towards zero.Even for fixed d, algorithmic efficiency under Ψ scalar can be arbitrarily poor as ρ approaches unity.It may be extreme to assume a target distribution with arbitrarily large blocks of parameters that exhibit arbitrarily high pairwise-correlation.As an alternative, we consider the same multivariate normal form, but with elements of Σ given as σ i,j = ρ |i−j| , |ρ| < 1, as might occur in the covariance structure of a spatial model (Banerjee, Carlin, and Gelfand, 2003, p.27).The algorithmic efficiency of the first model parameter -since elements of Θ are no longer exchangeable -is shown in Figure 1 (right), where it displays similar attenuation as in the prior example.Most notably, we now observe the incremental effect of d diminishing as d increases, consistent with the covariance structure.
Through a combination of theory and simulated examples, we observe that the algorithmic efficiency achieved under Ψ scalar or Ψ block will depend non-trivially upon the model dimension, and the extent and structure of the posterior correlation.We have only considered two simple, highly modular and systematic posterior correlation structures.In practice, any model of interest will demonstrate a substantially more complex correlation structure (which is unknown, anyway), making a full analytical study of MCMC algorithmic efficiency difficult.No less, we have only considered the two boundary-case algorithms Ψ scalar and Ψ block in Ψ M .Our desire to derive general results for algorithmic efficiency will be complicated substantially further when we consider the complete set Ψ M .

Computational Efficiency
While section 2.2 considered the efficiency of MCMC algorithm Ψ at producing independent samples without regard for computation time, we now consider the computational requirements of Ψ, measured in units of algorithm runtime per MCMC iteration.We focus on computations for model density evaluations.There are also book-keeping and loop-iteration costs, which we label as algorithm overhead.These overhead terms comprise a small fraction of total computation, and we safely disregard them.Denote the computational requirement of Ψ as C(Ψ), and again overload notation to define C(ψ) as the computational requirement of a single sampler ψ.For Ψ = {ψ 1 , . . ., ψ k }, As far as we are aware, an analysis of MCMC efficiency which incorporates C(Ψ) has not been carried out to date.Literature which does address MCMC efficiency typically recognizes that a computational aspect exists, but then focuses solely on A(Ψ), e.g., Roberts and Sahu (1997).
We now present an accounting of the main contributions to C(Ψ scalar ) and C(Ψ block ), general for any M.To support our accounting, we denote the set of all fixed and known components of M (e.g., observations, other data) as Y , which is disjoint from the unknown set of parameters Θ = {θ 1 , . . ., θ d }.For each θ i ∈ Θ, let x i ⊂ Θ∪Y be the set of model components which immediately depend on θ i , or the set of direct descendants of θ i in the model graph introduced in section 2.1.
Finally, we denote the computational cost of evaluating the density functions corresponding to any subset x ⊂ Θ ∪ Y as dens(x).

Scalar Sampling Computation
On each iteration of Ψ scalar = {ψ 1 , . . ., ψ d }, each sampler ψ i will incur computational cost C(ψ i ) = dens(θ i ) + dens(x i ) + O(1).The trailing constant term represents generation of the proposal value and making the MH decision (generation from normal and uniform distributions, respectively).The total computational requirement of Ψ scalar is thus Note that under Ψ scalar , each density evaluation dens(θ i ) must occur independently.This is true even when the evaluation of a particular dens(θ i ) term necessitates the calculation of a subsuming multivariate density -in the most extreme case, dens(Θ).Thus, in the worst case, a single MCMC iteration of Ψ scalar could incur d evaluations of the entire joint density of Θ.A similar computational explosion can result from the calculation of each dens(x i ) term.

Block Sampling Computation
We now consider the components of C(Ψ block ).On each iteration of Ψ block , the sole sampler u block (Θ) requires evaluation of the complete prior and likelihood densities, dens(Θ ∪ Y ).This is notably different from the density evaluation terms appearing in C(Ψ scalar ), in that it incurs only a single evaluation of the complete joint model density.In addition, the adaptation routine of u block (Θ) requires a Cholesky factorization of the adapted covariance matrix, which requires O(d 3 ) operations to calculate in full generality (Trefethen and Bau, 1997, p.176).This factorization occurs every AI iterations, where AI is the adaptation interval of u block , and therefore has an amortized computational cost of O(d 3 /AI).Each iteration of u block requires generating a d-dimensional multivariate normal proposal, which requires O(d 2 ) operations, and performing a single constant-time MH decision, which is dropped as a lower-order term.The total amortized computational requirement of Ψ block may be written as

Timing Comparison
We have seen that C(Ψ block ) is at least quadratic in d, and technically cubic but with a small leading coefficient.Depending on the distributional structure of Θ, the density evaluations comprising C(Ψ scalar ) may be unwieldy.The relative magnitude of these competing terms is difficult to intuitively gauge, so to gain practical insight, we perform a timing study of the Ψ scalar (All Scalar) and Ψ block (All Blocked) algorithms.Three models involving no likelihood components are considered, with prior structures on Θ given as:

Overall Efficiency
We have examined both the algorithmic and computational efficiency of MCMC algorithms, each of which fundamentally affect overall MCMC efficiency.We define the overall efficiency of Ψ simply as E(Ψ) = A(Ψ)/C(Ψ).We consider this to be a sensible measure of the overall efficiency of an MCMC algorithm, since E(Ψ) may be interpreted as the number of effective samples produced per unit of MCMC algorithm runtime, for the slowest mixing model parameter.If we can construct Ψ to maximize E(Ψ), then Ψ is the most time-efficient MCMC algorithm for generating effectively independent samples to approximate the true, joint posterior distribution of Θ.
That being said, from our examination of algorithmic and computational efficiency, it is not immediately clear how to balance tradeoffs between A(Ψ) and C(Ψ) to maximize E(Ψ).We have generally considered the two boundary-case algorithms Ψ scalar and Ψ block , but a huge number of intermediate algorithms exist.For Ψ ∈ Ψ M , we may gain useful insights regarding the factors affecting E(Ψ) in terms of the properties of each ψ i .C(Ψ) = k i=1 C(ψ i ), and the values A(Ψ, θ i ) which determine A(Ψ) each result from a single application of scalar or block sampling -although this neglects the phenomenon where improving A(Ψ, θ i ) may affect A(Ψ, θ j ), i = j.However, finding a global optimum Ψ opt = argmax Ψ∈Ψ M E(Ψ) poses a combinatorial challenge.Instead of seeking Ψ opt , we now propose an iterative procedure to navigate Ψ M , with the aim of maximizing E(Ψ) to the degree possible.

Automated Blocking
In this section, we propose an iterative, self-tuning procedure for automated blocking of hierarchical model parameters to produce an efficient MCMC algorithm.This procedure uses the empirical posterior correlation to cluster groups of correlated parameters into sampling blocks.A hierarchical clustering tree of model parameters is constructed, and subsequently cut at some height (selected using a finite search) to produce parameter groups each exhibiting a minimal intra-group posterior correlation.This procedure is iterated, so that as MCMC efficiency improves, the empirical posterior correlations are more accurate, and the resulting tree and parameter groups stabilize.The endresult is a partition of the model parameters, which uniquely specifies an MCMC algorithm Ψ ∈ Ψ M employing scalar and block sampling, for which the overall efficiency E(Ψ) (section 2.4) is increased to the degree possible.We also discuss more sophisticated approaches, but our heuristic approach allows huge efficiency gains in some cases and establishes the basic procedure.
The procedure begins with the initial MCMC algorithm Ψ 0 = Ψ scalar , or scalar sampling of all model parameters; lacking prior information, this is used as the starting point for gaining insight about the posterior correlation structure.Subsequent iterations are based upon the empirical posterior correlation produced in the previous iteration, and, to a varying degree, will institute blocks for parameter groups exhibiting sufficiency high intra-group correlations.
Prior to calculating the empirical correlation terms ρ j,k , we discard the seemingly excessive and Algorithm 1 Automated Blocking 1: i ← 0 2: Ψ 0 ← Ψ scalar 3: loop: Generate samples of Θ under Ψ i−1 , where X j represents the sample chain of θ j 6: Discard initial 50% of each chain X j 7: Construct distance matrix D from elements d j,k 10: Construct hierarchical clustering tree T from D 11: somewhat arbitrary initial 50% of all samples.This should not be confused with a traditional "burnin," whose purpose is to "forget" the initial state and ensure convergence to the target distribution.
Instead, discarding these initial samples allows all adaptive scalar and block samplers ample time to self-tune, and thereby achieve their theoretically optimal algorithmic efficiency.The choice of 50% is largely arbitrary, and excessive in most cases, and could almost certainly be relaxed without affecting algorithm performance.
Empirical correlations are transformed into distances using the transformation d j,k = 1 − |ρ j,k |.
The form of this transformation is selected to induce several properties for elements of the distance matrix D: the main diagonal consists of zeros; strong correlation results in d ≈ 0; weak or zero correlation results in d ≈ 1; and correlations of ρ and −ρ result in the same distance.
We use the R function hclust to create the hierarchical tree T from the distance matrix D. The default "complete linkage" clustering (Everitt, 2011, chapter 4) is appropriate, since this ensures that all parameters within each cluster have a minimum absolute pairwise correlation.At height h ∈ [0, 1] in T , the absolute correlation between parameter pairs (within clusters) is at least 1 − h.
We use the R function cutree for cutting the hierarchical clustering tree T at a specified height h ∈ [0, 1] to produce disjoint parameter groupings, which may be used to define parameter blocks for the purpose of MCMC sampling.We justify this means of generating parameter sampling blocks, insofar as to increase algorithmic efficiency we strive to group correlated parameters into sampling blocks -the exact effect of cutting T at any particular height.
We define the MCMC algorithm Ψ(T cut=h ) ∈ Ψ M as the unique MCMC algorithm defined by scalar and/or block sampling applied to the parameter blocks that result from cutting T at height h.We note that for all T , Ψ(T cut=0 ) = Ψ scalar , and Ψ(T cut=1 ) = Ψ block .
There is no universally optimal value of the cut height h, as our findings in section 2 imply that any h ∈ [0, 1] may maximize the efficiency E(Ψ(T cut=h )) for a particular model M. We recognize that a combination of distinct cut heights applied to different branches of T may produce the maximal efficiency, but we do not consider such strategies herein.
Rather than attempting to infer what might be an appropriate cut height for model M, we consider a range of potential cut heights, and the resulting MCMC algorithms.These comprise Ψ cand , the candidate set of MCMC algorithms for a particular iteration.This approach allows the blocking procedure to adjust according to the model, the posterior correlation structure, and the underlying computational architecture.The MCMC algorithm for each iteration (i ≥ 1) is selected from among Ψ cand as that which produces the maximal overall efficiency.
To estimate the integrated autocorrelation time, and hence the algorithmic efficiency, of a chain of MCMC samples, we use the effectiveSize function in the R coda package (Plummer et al., 2006).The approach underlying this function -using a fitted autoregressive model to estimate the spectral density at frequency zero -has been seen to converge fastest among several methods to the true integrated autocorrelation time (Thompson, 2010).
As E(Ψ i ) increases through the course of iterations, improved estimates of the posterior correlation are produced, giving the potential for more refined parameter blockings, and thus progressive increases in E(Ψ i ) in subsequent iterations.This iterative procedure continues until either (1) (efficiency decreases between iterations).In practice, our procedure typically halts with terminating condition (1).This may be concurrent with terminating condition (2), on account of stochastic variation in sampling and/or runtime.
We select the output from our automated blocking procedure as Ψ AutoBlock , the MCMC algorithm selected in the final iteration.In our experience, Ψ AutoBlock is typically identical to the MCMC algorithm of the second-to-last iteration; that is, the procedure has converged to a stationary state.If a situation arises where the final iteration produces a different MCMC algorithm with efficiency inferior to that of the previous iteration, then prudence would suggest a thoughtful examination of the posterior samples, empirical correlation matrices, properties of the adapted samplers, convergence diagnostics, etc.

Automated Blocking Performance
We now compare the performance of MCMC algorithms produced using the automated blocking procedure of section 3 against various static MCMC algorithms.First, we describe the computing environment in which our analyses are performed.We then describe a broadly representative suite of example models, and present the performance results of automated and static MCMC algorithms for each.A public Github repository containing scripts for reproducing our results may be found at https://github.com/danielturek/automated-blocking-examples.

Computing Environment
Since one of our points is that optimal design of MCMC algorithms depends on the computing environment, we briefly summarize the software tools and computing platform used.All statistical models and MCMC algorithms were built using the NIMBLE package (NIMBLE Development Team, 2014) for R (R Core Team, 2014).NIMBLE allows hierarchical models to be defined within R using the BUGS model declaration syntax introduced by the BUGS project (Lunn et al., 2000;Lunn et al., 2012).MCMC algorithms in NIMBLE are written using NIMBLE's domain specific language for specifying hierarchical model algorithms.This language is an enhanced subset of R (interfaced through an R session) which is compiled into C++ code, which is subsequently compiled and run.
As a result, the examples here use highly efficient code generated automatically for each model and algorithm.Of particular importance is that matrix operations are done via the highly optimized C++ Eigen library (Guennebaud and Jacob, 2010).Finally, the high-level programmability provided by R facilitated the dynamic exploration of MCMC algorithms.Examples were run using R version 3.1.2,using the BLAS (Basic Linear Algebra Subprograms) provided by R for multivariate density calculations and simulation, and running under Macintosh OSX version 10.9.5 on a 2.5 GHz Intel Core i7 processor.

Model Descriptions
We tested the automated blocking procedure on seven examples, including real-data and toy models, and compared the results against standard MCMC algorithms.When there are obvious "handtuned" algorithms that a seasoned MCMC practitioner would consider for a particular model, we included those as well.For the toy models, the goal was to construct posterior distributions with specific correlation structures as described below.In these cases the models are simply prior distributions without any likelihood component.

Varying Size Blocks of Fixed Correlation
This model structure demonstrates parameter groups of varying size, where each group exhibits a fixed intra-group pairwise correlation.The model contains N = 64 parameters, half of which are grouped to have pairwise posterior correlation of ρ.This is accomplished using a prior multivariate normal distribution with appropriate covariance (equivalently, correlation) matrix, which in the absence of a likelihood term fully determines the posterior distribution for these 32 parameters.
Similarly, additional disjoint groups of correlated parameters are constructed of sizes 16, 8, 4, and 2. The remaining two parameters are uncorrelated to any others, specified using univariate normal prior distributions.We consider three values for the intra-group correlation, ρ = 0.2, 0.5, and 0.8.As these models have no likelihood, we are using MCMC to sample from the prior distribution.We note that the dependence structure is the same as the block-diagonal covariance structure (with the blocks having compound symmetry) obtained when analytically integrating over the exchangeable prior means of clustered random effects.This example thereby mimics the structure found in basic multilevel hierarchical models, albeit without the explicit computational expense of a likelihood calculation.
(1997) and follow the analysis of Zipunnikov and Booth (2006).The dataset contains 968 counts of senior-citizen clinical visits, which are modeled as Poisson counts.The linear predictor contains fixed and random effects, using a variety of covariates and including several interaction terms.

Performance Results
We present three quantities to gauge the performance of MCMC algorithm Ψ.Rather than algorithmic efficiency A(Ψ), for convenience of interpretation we present the proportional quantity ESS = 10,000 A(Ψ), where ESS denotes effective sample size.This scaling of A(Ψ) has a natural interpretation as the number of effective samples (for the slowest mixing parameter) which result from a chain of 10,000 MCMC samples.Similarly, to represent the computational requirement C(Ψ), we present the proportional quantity Runtime = 10,000 C(Ψ), interpretable as the time (in seconds) required to generate 10,000 MCMC samples.We directly present the overall MCMC efficiency as Efficiency = ESS / Runtime = A(Ψ)/C(Ψ) = E(Ψ), which is independent of any scaling, and maintains the intuitive interpretation as the number of effective samples generated per second of algorithm runtime (again, for the slowest mixing parameter).MCMC sampling is performed using a fixed random number seed and identical initial values for each model, so identical MCMC algorithms will produce identical sample chains, and hence ESS, but not necessarily Runtime or Efficiency on account of discrepancies in algorithm runtime.We observe the automated procedure producing the same MCMC algorithm across repeated experiments, with numerical results for Runtime and Efficiency varying less than 5% from those presented herein.
For each example model M, we present results for MCMC algorithm Ψ block denoted as "All Blocked," and those of Ψ scalar as "All Scalar," noting that Ψ scalar also represents the initial state (0 th iteration) of the automated blocking procedure.The maximally efficient algorithm generated via automated blocking is presented as "Auto Blocking," which will generally represent a dynamically determined blocking scheme.We also present a third static MCMC algorithm, which is not necessarily a member of Ψ M on account of the possible use of conjugate sampling.This algorithm assigns block samplers to groups of parameters arising from multivariate distributions, scalar samplers to parameters arising from univariate distributions, and assigns conjugate samplers whenever the structure of M permits; this static algorithm may be more representative of default MCMC algorithms provided by software packages, and is denoted as "Default."Finally, for several example models we include an informed blocking of the model parameters, based upon expert or prior knowledge, which is referred to as "Informed Blocking."Results for the random effects model also include the "Informed Cross-Level" MCMC algorithm which makes use of cross-level sampling, which is not in Ψ M .

Varying Size Blocks of Fixed Correlation
The left pane of Figure 3 displays the Efficiency performance for the model structures containing varying sized blocks of fixed correlation.For ρ = 0.2, the Auto Blocking algorithm selects cut height h = 0, which corresponds to re-selecting the algorithm All Scalar.Since this MCMC algorithm is identical to the initial state, the automated procedure terminates there.The All Blocked scheme actually runs faster, but the algorithmic efficiency loss inherent to large block sampling dominates, resulting in Efficiency approximately four times lower.For larger values of ρ, the All Scalar algorithm suffers progressively more since it fails to institute any blocking in the presence of increasing correlations.For ρ = 0.5 and 0.8, Auto Blocking algorithm selects cut heights h = 0.6 and h = 0.3, respectively, which each exactly place all correlated terms into sampling blocks.In every case, the slowest mixing parameter is from among the largest correlated group of 32 parameters.

Fixed Size Blocks of Varying Correlation
The right pane of Figure 3 presents results for the model structure containing fixed size parameter groupings with correlations between 0 and 0.9.For each size model, the automated blocking procedure selects a particular cut height (and hence, MCMC algorithm) twice consecutively, thus terminating on the third iteration.The cut heights selected for models N = 20, 50, and 100 are h = 0.5, 0.8, and 0.9, respectively (not shown), progressively pushing more of the correlated parameter groupings into sampling blocks.The Auto Blocking algorithm produces increases in Efficiency by factors of 4.5, 7, and 21 in the three models, over the static All Scalar and All Blocked algorithms.q q q q q q q q q 0 250 500 750 0.2 0.5 0.8 Correlation Efficiency (effective samples / time) MCMC Algorithm q q q All Blocked All Scalar Auto Blocking q q q q q q q q q 0 2000 4000 6000 20 50 100 Model size (N) Efficiency (effective samples / time) MCMC Algorithm q q q All Blocked All Scalar Auto Blocking

Random Effects Model
In the random effects model (Table 1), automated blocking generates an MCMC algorithm identical to the Informed Blocking algorithm (blocking each α i , β i pair), which produces a tenfold improvement in Efficiency over the most efficient static algorithm -for this model, All Scalar sampling.
The cut height h = 0.1 indicates that only the α i , β i pairs exhibit posterior correlations above 0.9.
The Informed Cross-Level algorithm requires a substantially longer Runtime and produces a high ESS, which results in nearly identical Efficiency as the efficiently blocked Auto Blocking algorithm.

Auto-Regressive Model
In the auto-regressive model (Table 1), an AR process value exhibited the slowest mixing under All Scalar sampling.When all 24 model parameters (AR process values, fixed effects, and one hyper-parameter) are blocked, the algorithm Runtime is nearly halved.This decrease in Runtime is largely due to the dependency structure inherent to the AR process.Scalar sampling of AR process values requires nearly a three-fold increase in density evaluations of the process values (since it's a second-order AR process) relative to All Blocked sampling.In addition to the improved Runtime, the All Blocked sampling of the correlated AR process values increases their individual algorithmic efficiencies, and the slowest mixing parameter is among the fixed effects.The Efficiency under All Blocked sampling is over double that of All Scalar sampling.The automated blocking procedure identifies a blocking scheme which blocks together all AR process values and fixed effects (23 total; cut height h = 0.4), and performs univariate sampling of the single hyper-parameter.This has a similar Runtime to All Blocked sampling, but increases algorithmic efficiency for all parameters.
The resulting overall Efficiency under the Auto Blocking MCMC algorithm is over three times that of All Scalar sampling.q q q q q q q q q q q q q q q q q q q q q q q q 0 10 20

Efficiency Gains from Automated Blocking
In Figure 4, we present the overall Efficiencies achieved for our suite of example models (excluding the two contrived model structures).The Auto Blocking algorithm consistently out-performs any static algorithm in terms of Efficiency, ranging between roughly a 50% increase to several orders of magnitude of improvement.The exception is the GLMM example, in which Auto Blocking matches the All Scalar algorithm identically.We observe variation in the relative Efficiencies among the static algorithms, reinforcing our notion that overall MCMC efficiency is highly dependent upon hierarchical model structure, and attempting to infer what might be an efficient MCMC algorithm for a particular problem is, in general, difficult.

Discussion
We have presented a general automated procedure for determining an "efficient" MCMC algorithm for hierarchical models.Our procedure is a greedy, iterative algorithm, which traverses a finite and well-defined set of MCMC algorithms.This is the first such automated MCMC-generating procedure of its kind, so far as we are aware.Using a suite of example models, we have observed that our automated procedure generates improvements in efficiency (relative to static MCMC algorithms) ranging between one and three orders of magnitude.In each case, the automated procedure produced an MCMC algorithm at least as efficient as any model-specific MCMC algorithm making use of prior knowledge or expert opinion.In all examples, our iterative procedure terminated within four iterations, although it is plausible that for more complex models it would proceed longer.
Our study has been confined to a single dimension of a much broader problem.We have strictly considered combinations of scalar and blocked adaptive Metropolis-Hastings sampling, with a small number of exceptions only for the purpose of comparison (e.g., the use of conjugate sampling).No less, we have restricted ourselves to non-overlapping sampling: each model parameter may only be sampled by a single MCMC sampler function.We may instead view the domain of our problem (automated determination of an efficient MCMC algorithm) as a broader space of MCMC algorithms.
This space may permit a wide range of sampling algorithms not considered herein: auxiliary variable algorithms such as slice sampling (Neal, 2003), or derivative-based sampling algorithms such as Hamiltonian Monte Carlo (Duane et al., 1987), among many possibilities.The resulting combinatorial explosion in the space of MCMC algorithms makes any process of trial-and-error, or an attempt at comprehensive exploration, futile.It is for this reason we seek to develop an automated procedure for determining an efficient MCMC algorithm, which may not be globally, maximally efficient, but provides non-trivial improvements in efficiency, nonetheless.
It should be noted that aspects of the problem addressed herein superficially resemble, but are fundamentally different in nature from hierarchical clustering, or sparse covariance estimation.
Granted, our automated procedure firstly utilizes an empirical covariance matrix generated from MCMC sampling chains.However, whereas sparse covariance estimation seeks to estimate the nonzero elements of the underlying covariance structure (Cai and Liu, 2011), our procedure concerns the non-trivial (correlated) elements, with little concern for the smaller entries.Our blocking algorithm also makes use of the complete linkage clustering algorithm, for determining groupings of correlated model parameters.Clustering algorithms have been applied to a wide variety of problems (Xu and Wunsch, 2005), but not to parameters of hierarchical models specifically with the aim of accounting for trade-offs between MCMC algorithmic efficiency and computational requirements, to produce a computationally efficient MCMC algorithm.This is a fundamentally different goal than merely producing groupings of "similar" parameters (given some measure of similarity), as is generally the goal in most clustering applications.Thereby, the existing literature on these subjects is related, but not intimately applicable to our problem at hand.A deeper consideration of these topics may be worthwhile, but we consider it beyond the scope of this paper.
Reasonably straightforward improvements could be made to our automated blocking procedure, which is presented as a sensible first approach that addresses the factors affecting MCMC algorithm efficiency.By design, our procedure natively accounts for differences in system platform or architecture that may affect the relative efficiencies of MCMC algorithms.We can envision a wide variety of possible extensions to our algorithm, ranging from only re-blocking the slowest mixing parameter on each iteration, to permitting cuts at different heights on distinct branches of the hierarchical clustering tree.Our procedure is intended as a proof-of-concept for the automated generation of efficient MCMC algorithms, and to serve as a starting point for subsequent research.

Figure 1 :
Figure 1: MCMC algorithmic efficiencies for different values of model dimension (d), and intragroup correlation (ρ).The quantity − log(1 − ρ) is plotted on the horizontal axis.Model structures include constant off-diagonal elements (equal to ρ) in the induced correlation matrix (left), and exponentially decaying correlations (right).

Figure 2
Figure2presents timing results measured in seconds per 10,000 iterations, plotted against dimension d, without consideration of algorithmic efficiency (section 2.2).There are a number of interesting features, which we briefly summarize.C(Ψ scalar ) is O(d) when each θ i independently follows a univariate distribution, and therefore d i=1 dens(θ i ) = dens(Θ) ≤ d•K, where K = max θ∈Θ dens(θ).For all practical purposes, it appears that C(Ψ block ) is O(d 2 ), as the cubic coefficient 1/AI = 1/200 is relatively small.C(Ψ block ) is largely resilient to changes in the underlying distribution of Θ; univariate gamma densities are more costly than normal densities, as we would expect, and as for C(Ψ scalar ); and the multivariate normal structure even slightly more so.Perhaps most striking, C(Ψ scalar ) is O(d 3 ) when the underlying distribution of Θ is multivariate, since each multivariate normal density evaluation is O(d 2 ), which occurs d times for each iteration of Ψ scalar .Although both are cubic in d, C(Ψ scalar ) dwarfs C(Ψ block ) due to the difference in the leading cubic coefficients.

Figure 2 :
Figure 2: MCMC runtimes for the All Scalar and All Blocked algorithms, for univariate normal, univariate gamma, and multivariate normal model structures.

Figure 3 :
Figure 3: Efficiency results for two contrived model structures: varying sized blocks of fixed correlation (left), and fixed sized blocks of varying correlation (right).

Figure 4 :
Figure 4: Efficiencies of MCMC algorithms for the suite of example models.

Table 1 :
MCMC performance results for the suite of example models.Effective sample size (ESS) is measured in effective samples per 10,000 iterations, Runtime is presented as seconds per 10,000 iterations, and Efficiency is in units of effective samples produced per second of algorithm runtime.Table1presents results for both parameterizations of the state space model.In the Independent parameterization, the observation noise parameter is the slowest mixing, in all except the All Blocked algorithm.TheAll Blocked algorithm runs quickly, but is limited by the extremely low algorithmic efficiency of the AR process intercept parameter.The Default algorithm assigns conjugate normal samplers to each latent state, resulting in high algorithmic efficiency but a substantially longer Runtime, which diminishes the overall Efficiency.Auto Blocking (cut height h = 0.8) creates a block of six parameters containing five latent states and the observation noise, and a disjoint block of the two AR process parameters.This combination, unlikely to be discovered though any combination of prior knowledge or trial and error, produces a 40% increase in Efficiency over All Scalar sampling, which is the most efficient static MCMC algorithm.We suspect the intercept and autocorrelation parameters of the AR process to be correlated in the naïve parameterization of the state space model.The All Blocked algorithm once again runs quickly, but is limited by the ESS of the AR process intercept.The Default algorithm is again slow due to conjugate sampling, but similar to the All Scalar algorithm, produces low algorithmic efficiency of the correlated AR process parameters.The Auto Blocking algorithm (cut height h = 0.1) selects the same parameter block as in the Informed Blocking algorithm (AR process intercept and autocorrelation), and additionally a block containing the observation noise and a latent state.For this model (Table1), the automated procedure converges on the All Scalar algorithm, which is the same as its initial state, and which produces overall MCMC Efficiency of about 3. In hindsight this result may not surprise us, since the fully exchangeable nature of the random effects in this model does not induce correlations among the sampled parameters for this particular dataset.Correspondingly, for a large number of un-correlated random effects, and in the absence of multivariate distributions, univariate sampling produces the highest Efficiency.We also note that the All Blocked algorithm, which consists of a single block sampler of dimension 858, has Runtime approximately twice that of the All Scalar algorithm, and produces an overall Efficiency of approximately 0.05.
Spatial ModelMCMC performance results for the spatial model (Table1) display several interesting trade-offs in MCMC efficiency.The spatial model contains 148 latent parameters jointly following a multivariate calculations.