MultiBUGS: A Parallel Implementation of the BUGS Modeling Framework for Faster Bayesian Inference

MultiBUGS is a new version of the general-purpose Bayesian modeling software BUGS that implements a generic algorithm for parallelizing Markov chain Monte Carlo (MCMC) algorithms to speed up posterior inference of Bayesian models. The algorithm parallelizes evaluation of the product-form likelihoods formed when a parameter has many children in the directed acyclic graph (DAG) representation; and parallelizes sampling of conditionally-independent sets of parameters. A heuristic algorithm is used to decide which approach to use for each parameter and to apportion computation across computational cores. This enables MultiBUGS to automatically parallelize the broad range of statistical models that can be fitted using BUGS-language software, making the dramatic speed-ups of modern multi-core computing accessible to applied statisticians, without requiring any experience of parallel programming. We demonstrate the use of MultiBUGS on simulated data designed to mimic a hierarchical e-health linked-data study of methadone prescriptions including 425,112 observations and 20,426 random effects. Posterior inference for the e-health model takes several hours in existing software, but MultiBUGS can perform inference in only 28 minutes using 48 computational cores.


Introduction
BUGS is a long running project that makes easy to use Bayesian modeling software available to the statistics community. The software has evolved through three main versions since 1989: first ClassicBUGS (Spiegelhalter, Thomas, Best, and Gilks 1996), then WinBUGS (Lunn, A different approach is to use multiple CPUs or cores for a single MCMC chain, with the aim of shortening the time taken for the MCMC chain to converge and to mix. One way to do this is to identify tasks within standard MCMC algorithms that can be calculated in parallel, without altering the underlying Markov chain. A task that is often, in principle, straightforward to parallelize, and is fundamental in several MCMC algorithms, such as the Metropolis-Hastings algorithm, is evaluation of the likelihood (e.g., Whiley and Wilson 2004;Jewell, Kypraios, Neal, and Roberts 2009;Bottolo et al. 2013). Another task that can be parallelized is sampling of conditionally-independent components, as suggested by, for example, Wilkinson (2006).
MultiBUGS implements all of the above strategies for parallelization of MCMC. There are thus two levels of parallelization: Multiple MCMC chains are run in parallel, with the computation required by each chain also parallelized by identifying both complex parallelizable likelihoods and conditionally-independent components that can be sampled in parallel.
There are numerous other approaches to MCMC parallelization. Several authors have proposed running parts of the model on separate cores and then combining results (Scott, Blocker, Bonassi, Chipman, George, and McCulloch 2016) using either somewhat ad hoc procedures or sequential Monte Carlo-inspired methods (Goudie, Presanis, Lunn, De Angelis, and Wernisch 2018). This approach has the advantage of being able to reuse already written MCMC software and, in this sense, is similar to the approach used in MultiBUGS. A separate body of work (Brockwell 2006;Angelino, Kohler, Waterland, Seltzer, and Adams 2014) proposes using a modified version of the Metropolis-Hastings algorithm which speculatively considers a possible sequence of MCMC steps and evaluates the likelihood at each proposal on a separate core. The time saving tends to scale logarithmically in the number of cores for this class of algorithms. A final group of approaches modifies the Metropolis-Hastings algorithm by proposing a sequence of candidate points in parallel (Calderhead 2014). This approach can reduce autocorrelations in the MCMC chain and so speed up MCMC convergence.

MultiBUGS software
MultiBUGS is available as free software, under the GNU General Public License version 3, and can be downloaded from https://www.multibugs.org/. MultiBUGS currently requires Microsoft Windows, and version 8.1 or newer of the Microsoft MPI (MS-MPI) parallel programming framework 1 . Note that the Windows Firewall may require you to give MultiBUGS permission to communicate between cores. The source code for MultiBUGS can be downloaded from https://github.com/MultiBUGS/MultiBUGS. The data and model files to replicate all the results presented in this paper can be found within MultiBUGS, as we describe later in the paper, or can be downloaded from https://github.com/MultiBUGS/multibugs-examples.
The paper is organized as follows: In Section 2 we introduce the class of models we consider and the parallelization strategy adopted in MultiBUGS; implementation details are provided in Section 3; Section 4 summarizes the basic process of fitting models in MultiBUGS; Section 5 demonstrates MultiBUGS for analyzing a large hierarchical dataset; and we conclude with a discussion in Section 6.

Models and notation
MultiBUGS performs inference for Bayesian models that can be represented by a directed acyclic graph (DAG), with each component of the model associated with a node in the DAG. A DAG G = (V G , E G ) consists of a set of nodes or vertices V G joined by directed edges are the nodes with an edge pointing to node v. The children ch G (v) = {u : (v, u) ∈ E G } of a node v are the nodes pointed to by edges emanating from node v. We omit G subscripts here, and throughout the paper, wherever there is no ambiguity.
DAGs can be presented graphically (see Figures 1 and 3), with stochastic nodes shown in ovals, and constant and observed quantities in rectangles. Stochastic dependencies are represented by arrows. Repeated nodes are enclosed by a rounded rectangle (plate), with the range of repetition indicated by the label.
To establish ideas, consider a simple random effects logistic regression model (called "seeds") for the number r i of seeds that germinated out of n i planted, in each of i = 1, . . . , N = 21 experiments, with binary indicators of seed type X 1i and root extract type X 2i (Crowder  1978;Breslow and Clayton 1993).
We choose normal priors for the regression parameters α 0 , α 1 , α 2 , α 12 , with mean µ α = 0 and standard deviation σ α = 1000. We fix µ β = 0, and choose a uniform prior on the range σ min = 0 to σ max = 10 for the standard deviation σ β of the random effects β i . Figure 1 shows a DAG representation of the "seeds" model. The data are presented in Crowder (1978).
For ease of exposition of the parallelization methods used by MultiBUGS, we assume throughout this paper that the set of nodes V G includes all stochastic parameters S G ⊆ V G and constant quantities (including observations and hyperparameters) in the model, but excludes parameters that are entirely determined by other parameters. As a consequence, the DAG for the seeds example ( Figure 1) includes as nodes the stochastic parameters S G = {α 0 , α 1 , α 2 , α 12 , β 1 , . . . , β 21 , σ β }, the observations {r i , X 1i , X 2i , n i : i = 1, . . . , 21} and the constant hyperparameters {µ α , σ α , µ β , σ min , σ max }, but not the parameters that are deterministic functions of other parameters (the germination probabilities p i ), which have been assimilated into the definition of the distribution of r i before forming the DAG. Arbitrary DAG models can nevertheless be considered by assimilating deterministic intermediary quantities, such as linear predictors in generalized linear models, into the definition of the conditional distribution of the appropriate descendant stochastic parameter; and considering deterministic prediction separately from the main MCMC computation. For example, in the seeds example, the random effect precision τ β = σ −2 β is deterministically related to the standard deviation σ β , so it would not be considered part of the graph if it were of interest: Posterior inference for τ β could instead be made either by updating its value in the usual (serial) manner after each MCMC iteration, or by post-processing the MCMC samples for σ β .
In DAG models, the conditional independence assumptions represented by the DAG mean that the full joint distribution of all quantities V has a simple factorization in terms of the conditional distribution p(v | pa(v)) of each node v ∈ V given its parents pa(v): Posterior inference is performed in MultiBUGS by an MCMC algorithm, constructed by associating each node with a suitable updating algorithm, chosen automatically by the program according to the structure of the model. Most MCMC algorithms involve evaluation of the conditional distribution of the stochastic parameters S ⊆ V (at particular values of its arguments). The conditional distribution p(v | V −v ) of a node v ∈ S, given the remaining nodes where p(v | pa(v)) is the prior factor and L(v) = u∈ch(v) p(u | pa(u)) is the likelihood factor.

Parallelization methods in MultiBUGS
MultiBUGS performs in parallel both multiple chains and the computation required for a single MCMC chain. In this section, we describe how the computation for a single MCMC chain can be performed in parallel.

Parallelization strategies
MCMC entails sampling, which often requires evaluation of the conditional distribution of the stochastic parameters S in the model. MultiBUGS parallelizes these computations for a single MCMC chain via two distinct approaches.
First, when a parameter has many children, evaluation of the conditional distribution is computationally expensive, since Equation 1 is the product of many terms. However, the evaluation of the likelihood factor L(v) can easily be split between C cores by calculating a partial product involving every Cth child on each core. With a partition {ch (1) (v), . . . , ch (C) (v)} of the set of children ch(v), we can evaluate u∈ch (c) (v) p(u | pa(u)) on the cth core, c = 1, . . . , C.
The prior factor p(v | pa(v)) and these partial products can be multiplied together to recover the complete conditional distribution.
Second, when a model includes a large number of parameters then computation may be slow in aggregate, even if sampling of each individual parameter is fast. However, parameters can clearly be sampled in parallel if they are conditionally independent. Specifically, all parameters in a set W ⊆ S can be sampled in parallel whenever the parameters in W are mutually conditionally-independent; i.e., all w 1 ∈ W and w 2 ∈ W (w 1 = w 2 ) are conditionally independent given V \ W . If C cores are available and |W | denotes the number of elements in the set W , then in a parallel scheme at most |W |/C parameters need to be sampled on a core (where x denotes the ceiling function), rather than |W | in the standard serial scheme.
To identify sets of conditionally-independent parameters, MultiBUGS first partitions the stochastic parameters S into depth sets D h G = {v ∈ S : d G (v) = h}, defined as the set of stochastic nodes with topological depth d G (v) = h, where topological depth of a node v ∈ V is defined recursively, starting from the nodes with no parents.
Note that stochastic nodes v ∈ S have topological depth d G (v) ≥ 1, since the constant hyperparameters of stochastic nodes are included in the DAG.
Sets of conditionally-independent parameters within a depth set can be identified by noting that all parameters in a set W ⊆ D h G are mutually conditionally-independent, given the other nodes V \ W , if the parameters in W have no child node in common. This follows from the d-separation criterion (Definition 1.2.3, Pearl 2009): All such pairs of parameters w 1 ∈ W and w 2 ∈ W (w 1 = w 2 ) are d-separated by V \ W because no "chain path" can exist between w 1 and w 2 because these nodes have the same topological depth; and all "fork paths" are blocked by V \ W , as are all "collider paths", except those involving a common child of w 1 and w 2 , which are prevented by definition of W .

Heuristic for determining parallelization strategy
A heuristic criterion is used by MultiBUGS to decide which type of parallelism to exploit for each parameter in the model. The heuristic aims to parallelize the evaluation of conditional distributions of "fixed effect"-like parameters, and parallelize the sampling of "random effect"like parameters. The former tend to have a large number of children, whereas the latter tend to have a small number of children. Each depth set is considered in turn, starting . The computation of the parameter's conditional distribution is parallelized if a parameter has more children than double the mean number of children ch = mean v∈S | ch G (v)|, or if all parameters in the graph have topological depth h = 1; otherwise the sampling of conditionally independent sets of parameters is parallelized whenever this is permitted. The special case for h = 1 ensures that evaluation of the conditional distribution of parameters is parallelized in "flat" models in which all parameters have identical topological depth. When a group of parameters is sampled in parallel we would like the time taken to sample each one to be similar, so MultiBUGS assigns parameters to cores in order of the number of children that each parameter has.
MultiBUGS creates a C-column computation schedule table T , which specifies the parallelization scheme: Where different parameters appear in a row, the corresponding parameters are sampled in parallel; where a single parameter is repeated across a full row, the evaluation of the conditional distribution for that parameter is split into partial products across the C cores. A single MCMC iteration consists of evaluating updates as specified by each row of the computation schedule in turn. The computation schedule includes blanks whenever a set W of mutually conditionally-independent parameters does not divide equally across the C cores; that is, when |W | mod C = 0, where mod denotes the modulo operator. The corresponding cores are idle when a blank occurs. Appendix A describes the algorithms used to create the C-column computation schedule table T in detail.

Block samplers
MultiBUGS is able to use a block MCMC sampler when appropriate: that is, algorithms that sample a block of nodes jointly, rather than just a single node at a time. Block samplers are particularly beneficial when parameters in the model are highly correlated a posteriori (see, e.g., Roberts and Sahu 1997). The conditional distribution for a block B ⊆ S of nodes, given the rest of nodes Block samplers can be parallelized in a straightforward manner: If we consider a block B as a single node, and define ch(B) = ∪ b∈B ch(b), then the approach introduced above is immediately applicable, and we can exploit both opportunities for parallelization for block updates. A mixture of single node and block updaters can be used without complication.
In the seeds example it is possible to block together α 0 , α 1 , α 2 , α 12 . The block then has 21 children, and so our algorithm chooses to spread evaluation of their likelihood over multiple cores. The computation schedule remains identical to Table 1, but the block sampler waits until all the likelihoods corresponding to rows 7 to 10 of Table 1 are evaluated before determining each update for the {α 0 , α 1 , α 2 , α 12 } block.

Implementation details
BUGS represents statistical models internally using a dynamic object-oriented data structure (Warford 2002) that is analogous to a DAG. The nodes of the graph are objects and the edges of the graph are pointers contained in these objects. Although the graph is specified in terms of the parents of each node, BUGS identifies the children of each node and stores this as a list embedded in each parameter node. Each node object has a value and a method to calculate its probability density function. For observations and fixed hyperparameters the value is fixed and is read in from a data file; for parameters the value is variable and is sampled within a MCMC algorithm. Each MCMC sampling algorithm is represented by a class (Warford 2002) and a new sampling object of an appropriate class is created for each parameter in the statistical model. Each sampling object contains a link to the node (or block of nodes) in the graphical model that represents the parameter (or block of parameters) being sampled. One complete MCMC update of the model involves a traversal of a list of all these sampling objects, with each object's sampling method called in turn. Lunn et al. (2000) provide further background on the internal design of BUGS.
The MultiBUGS software consists of two distinct computer programs: a user interface and a computational engine. The computational engine is a small program assembled by linking together some modules of the OpenBUGS software plus a few additional modules to implement our parallelization algorithm. Copies of the computational engine run on multiple cores and communicate with each other using the message passing interface (MPI) protocol (Pacheco 1997), version 2.0. The user interface program is a slight modification (and extension) of the OpenBUGS software. The user interface program compiles an executable "worker program" that contains the computational engine required for a particular statistical model. It also writes out a file containing a representation of the data structures that specify the statistical model. It then starts a number of copies of the computational engine on separate computer cores. These worker programs then read in the model representation file to rebuild the graphical model and start generating MCMC samples using our distributed algorithms. The worker programs communicate with the user interface program via an MPI intercommunicator object. The user interface is responsible for calculating summary statistics of interest.
Both sources of parallelism described in Section 2.2 require only simple modifications of the data structures and algorithms used in the BUGS software. Each core keeps a copy of the current state of the MCMC, as well as two pseudo-random number generation (PRNG) streams (Wilkinson 2006): a "core-specific" stream, initialized with a different seed for each core; and "common" stream, initialized using the same seed on all cores. Initially, each core loads the sampling algorithm, the computation schedule, and the complete DAG, which is then altered as follows so that the overall computation yields the computation required for the original, complete DAG.
When the calculation of a parameter's likelihood is parallelized across cores, the list of children associated with a parameter on each core is thinned (pruned) so that it contains only the children in the corresponding partition component of ch(v). The BUGS MCMC sampling algorithm implementations then require only minor changes so that the partial likelihoods are communicated between cores. For example, a random walk Metropolis algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller 1953) is performed as follows: First, on each core, the prior factor and a partial likelihood contribution to the conditional distribution are calculated for the current value of the parameter. Each core then samples a candidate value. These candidates will be identical across cores, since the "common" PRNG stream is used. The prior and partial likelihood contributions are then calculated for the candidate value, and the difference between the two partial log-likelihood contributions can be combined across cores using the MPI function Allreduce. The usual Metropolis test can then be applied on each core in parallel using the "common" PRNG stream, after which the state of Markov Figure 2: The specification tool in MultiBUGS, including the "distribute" button, which is used to initialize the parallelization.
chain is identical across cores. Computation of the prior factor and the Metropolis test is intentionally duplicated on every core because we found that the time taken to evaluate these quantities is usually shorter than the time taken to propagate their result across cores.
When a set of parameters W is sampled in parallel over the worker cores, the list of MCMC sampling objects is thinned on each core so that only parameters specified by the corresponding column of the computation schedule are updated on each core. The existing MCMC sampling algorithm implementations used in OpenBUGS can then be used without modification with each "core-specific" PRNG stream. The MPI function Allgather is used to send newly sampled parameters to each core. Note we need to run Allgather only after each core has sampled all of its assigned components in W , rather than after each component in W is sampled. For example, in the seeds example, we use Allgather after row 6. This considerably reduces message-passing overheads when the number of elements in W is large.
Running multiple chains is handled via standard MPI methods. If we have, say, two chains and eight cores, we partition the cores into two sets of four cores and set up separate MPI collective communicators (Pacheco 1997) for each set of cores for Allreduce and Allgather to use. Requests can be sent from the master to the workers using the intercommunicator and results returned. We find it useful to designate a special "lead worker" for each chain that we simulate. Each of these lead workers sends back sampled values to the master, where summary statistics can be collected. Only sampled values corresponding to quantities that the user is monitoring need to be returned to the master. This can considerably reduce the amount of communication between the workers and the master.

Basic usage of MultiBUGS
The procedure for running a model in MultiBUGS is largely the same as in WinBUGS or OpenBUGS. MultiBUGS adopts the standard BUGS language for specifying models, the core of which is common also to WinBUGS, OpenBUGS, JAGS and NIMBLE. A detailed tutorial on the use of BUGS can be found in, for example, Lunn, Jackson, Best, Thomas, and Spiegelhalter (2013).
An analysis is specified in MultiBUGS using the Specification Tool ( Model Specification... ) by checking the syntax of a model ( check model ), loading the data ( load data ), compiling ( compile ) and setting up initial values ( load inits and gen inits ).
We then specify the total number of cores to distribute computation across by entering a number in the box labeled num cores ( Figure 2) and then clicking distribute . This should be set at a value less than or equal to the number of processing cores available on your computer (the default is 2). If multiple chains are run, the cores will be divided equally across chains. We recommend that users experiment with different numbers of cores, since the setting that leads to fastest computation depends both on the specific model and data being analyzed and on the computing hardware being used. While increased parallelization will often result in faster computation, in some cases communication overheads will balloon to the point where parallelization gains are overturned. Furthermore, Amdahl's bound (Amdahl 1967) on the speed-up that is theoretically obtainable with increased parallelization may also be hit in some settings. Note that changing the number of cores will alter the exact samples obtained, since this affects the PRNG stream used to draw each sample (as described in Section 3).
Samples are drawn using the Update Tool ( Model Update... ). The use of the Sample Monitor Tool ( Inference Samples... ) to monitor parameters; to assess MCMC convergence, using, for example, the Brooks-Gelman-Rubin diagnostic (Gelman and Rubin 1992;Brooks and Gelman 1998); and to obtain results is the same as in WinBUGS and OpenBUGS. Analyses can be automated in MultiBUGS using the same simple procedural scripting language that is available in OpenBUGS. The new command modelDistribute(C) can be used to specify that parallelization should be across C cores; for details see Manuals MultiBUGS User Manual Scripts and Batch-mode .

Seeds example
The model, data and initial conditions for the seeds examples can be found within MultiBUGS in Examples Examples Vol I Seeds: random effects logistic regression . This is a simple model involving a small number of parameters and observations, so computation is already fast in OpenBUGS and is no faster in MultiBUGS (both take less than a second to do 1000 MCMC updates) because the benefit of parallelization is canceled out by communication overheads. However, for some more complicated models, MultiBUGS will be dramatically faster than OpenBUGS. We illustrate this with an example based on e-health data.

Illustration of usage with hierarchical e-health data
Our e-health example is based on a large linked database of methadone prescriptions given to opioid dependent patients in Scotland, which was used to examine the influence of patient characteristics on doses prescribed (Gao, Dimitropoulou, Robertson, McTaggart, Bennie, and Bird 2016;Dimitropoulou et al. 2017). This example is typical of many databases of linked health information drawn from primary care records, hospital records, prescription data and disease/death registries. Each data source often has a hierarchical structure, arising from regions, institutions and repeated measurements within individuals. Here, since we are unable to share the original dataset, we analyze a synthetic dataset, simulated to match the key traits of the original dataset.
The model includes 20,426 random effects in total, and was fitted to 425,112 observations. It is possible to fit this model using standard MCMC simulation in OpenBUGS but, unsurprisingly, the model runs extremely slowly and it takes 32 hours to perform a sufficient number of iterations (15,000) to satisfy standard convergence assessment diagnostics. In such data sets it can be tempting to choose a much simpler and faster method of analysis, but this may not allow appropriately for the hierarchical structure or enable exploration of sources of variation. Instead it is preferable to fit the desired hierarchical model using MCMC simulation, while speeding up computation as much as possible by exploiting parallel processing.
The model code, data and initial conditions can be found within MultiBUGS in Examples Examples Vol IV Methadone: an E-health model .

E-health data
The data have a hierarchical structure, with multiple prescriptions nested within patients within regions. For some of the outcome measurements, person identifiers and person-level covariates are available (240,776 observations). These outcome measurements y ijk represent the quantity of methadone prescribed on occasion k for person j in region i (i = 1, . . . , 8; j = 1, . . . , J i ; k = 1, . . . , K ij ), and are recorded in the file ehealth_data_id_available.txt. Each of these measurements is associated with a binary covariate r ijk (called source.indexed) that indicates the source of prescription on occasion k for person j in region i, with r ijk = 1 indicating that the prescription was from a General Practitioner (family physician). No person identifiers or person-level covariates are available for the remaining outcome measurements (184,336 observations). We denote by z il the outcome measurement for the lth prescription without person identifiers in region i (i = 1, . . . , 8; l = 1, . . . , L i ). These data are in the file ehealth_data_id_missing.txt. A binary covariate s il (called source.nonindexed) indicates the source of the lth prescription without person identifiers in region i, with s il = 1 indicating that the prescription was from a General Practitioner (family physician). The final data file, ehealth_data_n.txt, contains several totals used in the BUGS code.

E-health model
We model the effect of the covariates with a regression model, with regression parameter β m corresponding to the mth covariate x mij (m = 1, . . . , 4), while allowing for within-region correlation via region-level random effects u i , and within-person correlation via person-level random effects w ij ; source effects v i are assumed random across regions.
The means µ w and µ ε are both fixed to 0.
The outcome measurements z il contribute only to estimation of regional effects u i and source effects v i .

Parallelization in MultiBUGS
After setting the number of cores, the computation schedule chosen by MultiBUGS can be viewed in Info Show distribution . MultiBUGS parallelizes sampling of all the personlevel random effects w ij , except for the component corresponding to the person with the most observations (176 observations); MultiBUGS parallelizes likelihood computation of this component instead. The likelihood computation of all the other parameters in the model is also parallelized, except for the mutually conditionally-independent sets {µ u , µ v } and {σ 2 u , σ 2 v }, which are sampled in parallel in turn.

Run time comparisons across BUGS implementations
To demonstrate the speed-up possible in MultiBUGS using a range of number of cores, we ran two chains for 15,000 updates for the e-health example. This run length was chosen to mimic realistic statistical practice, since, after discarding the first 5,000 iterations as burn-in, visual inspection of chain-history plots and the Brooks-Gelman-Rubin diagnostic (Gelman and Rubin 1992;Brooks and Gelman 1998) indicated convergence. We ran the simulations (each replicated three times) on a 64-core machine consisting of four sixteen-core 2.4Ghz AMD Operon 6378 processors with 128GB shared RAM. Figure 4 shows the run time against the number of cores on a log-log scale. Substantial time savings are achieved using MultiBUGS: On average using one core took 8 hours 10 minutes; using two cores took 4 hours and 8 minutes; and using 48 cores took only 28 minutes. In contrast, these simulations took 32 hours in standard single-core OpenBUGS 3.2.3; and 9 hours using JAGS 4.0.0 via R 3.3.1.  The scaling of performance with increasing number of cores is good up to sixteen cores and then displays diminishing gains. This may be due to inter core communication being much faster within each processor of 16 cores compared to across processors, or the diminishing returns anticipated by Amdahl's law (Amdahl 1967). Running only one chain approximately halved the run time for two chains.

Results
The posterior summary table we obtained is shown in Table 2.

Discussion
MultiBUGS makes Bayesian inference using multi-core processing accessible for the first time to applied statisticians working with the broad class of statistical models available in the BUGS language software. It adopts a pragmatic algorithm for parallelizing MCMC sampling, which we have demonstrated speeds up inference in a random-effects logistic regression model involving a large number of random effects and observations. While a large literature has developed proposing methods for parallelizing MCMC algorithms (see Section 1), a generic, easy-to-use implementation of these ideas has been heretofore lacking. Almost all users of BUGS language software will have a multi-core computer available, since desktop computers typically now have a moderate number (up to ten) of cores, and laptops typically have 2-4 cores. However, workstations with an even larger number of cores are now becoming available: For example, Intel's Xeon Phi x200 processor contains between 64 and 72 cores.
The magnitude of speed-up provided by MultiBUGS depends on the model and data being analyzed and the computer hardware being used. Two levels of parallelization can be used in MultiBUGS: Independent MCMC chains can be parallelized, and then computation within a single MCMC chain can be parallelized. The first level of parallelization will almost always be advantageous whenever sufficient cores are available, since no communication across cores is needed. The gain from second level of parallelization is problem specific: The gain will be largest for models involving parameters with a large number of likelihood terms and/or a large number of conditionally independent parameters. For example, MultiBUGS is able to parallelize inference for many standard regression-type models involving both fixed and random effects, especially with a large number of observations, since fixed effect regression parameters will have a large number of children (the observations), and random effects will typically be conditionally independent. For models without these features, the overheads of the second level of parallelization may outweigh the gains on some computing hardware, meaning only the first level of parallelization is beneficial.
The mixing properties of the simulated MCMC chains are the same in OpenBUGS and MultiBUGS, because they use the same collection of underlying MCMC sampling algorithms. Models with severe MCMC mixing problems in OpenBUGS are thus not resolved in Multi-BUGS. However, since MultiBUGS can speed-up MCMC simulation, it may be practicable to circumvent milder mixing issues by simply increasing the run length.
Several extensions and developments are planned for MultiBUGS in the future. First, at present MultiBUGS requires the Microsoft Windows operating system. However, most large computational clusters use the Linux operating system, so a version of MultiBUGS running on Linux is under preparation. Second, MultiBUGS currently loads data and builds its internal graph representation of a model on a single core. This process will need to be rethought for extremely large datasets and graphical models.
for h in h to 1 do {T, U, r} ← find_partial_product_parallel(G, C, h, h , T, r) {W 1 , . . . , W l } ← find_conditionally_independent(G, U ) for i in 1 to l do c ← 0 for j in max w∈W i | ch G (w)| to 1 do for x in {w ∈ W i : | ch G (w)| = j} do if c mod C = 0 then r ← r + 1 c ← 0 end if T r(c+1) ← x c ← c + 1 end for end for end for end for Output: T