Gradient-free MCMC methods for dynamic causal modelling

In this technical note we compare the performance of four gradient-free MCMC samplers (random walk Metropolis sampling, slice-sampling, adaptive MCMC sampling and population-based MCMC sampling with tempering) in terms of the number of independent samples they can produce per unit computational time. For the Bayesian inversion of a single-node neural mass model, both adaptive and population-based samplers are more efficient compared with random walk Metropolis sampler or slice-sampling; yet adaptive MCMC sampling is more promising in terms of compute time. Slice-sampling yields the highest number of independent samples from the target density — albeit at almost 1000% increase in computational time, in comparison to the most efficient algorithm (i.e., the adaptive MCMC sampler).


Introduction
This technical note reports comparative evaluations of common gradient-free sampling schemes that can be used for Bayesian inference in dynamic causal modelling. It is the first of a series of technical reports that hopes to provide a comprehensive survey of the various sampling schemes availableboth gradient-free, and (first and second order) gradient-based schemes. These schemes provide a gold standard against which the performance of fixed form (e.g., variational) approximate Bayesian inference can be compared. Furthermore, with advances in computer science, the computational costs usually associated with sampling schemes may be sufficiently reduced to allow their routine use in applications like dynamic causal modelling.
Dynamical Causal Models (DCMs) are used routinely in neuroimaging as generative models of neurophysiological signals (Friston et al., 2003). Inference on their parameters usually proceeds using a parameterised probability density and maximising the (variational free energy) evidence lower bound . Typically a Laplace approximation (Tierney and Kadane, 1986) is used for inference in DCMs because it does not require algebraically involved updates, unlike variational Bayesand is guaranteed to converge as a result of the central limit theorem (Wang and Titterington, 2004). But such deterministic algorithms have their limitations. For example, they underestimate the variability in the posterior density; they get locked in local minima and are unable to approximate multi-modal posteriors (MacKay, 2002). Markov Chain Monte Carlo (MCMC) schemes are stochastic sampling algorithms (Gelfand and Smith, 1990;Geman and Geman, 1984) that eschew these problems. The basic idea behind MCMC is to simulate a Markov chain with the posterior density as its invariant probability density (see Appendix for definitions). After the chain has converged, resulting samples are an approximation of the posterior density.
MCMC methods come in two flavoursgradient-free schemes and gradient-based schemes. Gradient-free methods typically take the form of a Gibbs sampler or some variant of the random walk Metropolis-Hastings algorithm; whilst gradient-based methods use the gradient of the joint log-likelihood function to simulate diffusion (a Langevin algorithm) (Roberts and Tweedie, 1996) or optimise auxiliary variables as in Hamiltonian Monte-Carlo (HMC) algorithm (Neal, 2010). Despite the progress in numerical analysis, gradient-based methods are expensive; however, they avoid the naïve random walk inherent in gradient-free samplers. For both classes of samplers, there exists a natural trade-offbetween rapid (Markov) chain mixing versus compute time efficiency. A computationally efficient sampler would reach the invariant probability density quickly (rapid mixing) using least floating-point cycles (computational efficiency). Since the inception of stochastic sampling methods (Gelfand and Smith, 1990;Geman and Geman, 1984) extensive measure theoretic analyses (Meyn and Tweedie, 2009) have been conducted to gauge the mixing properties of Markov chains but these are at worst problem dependent.
In this note, we evaluate the suitability of gradient-free MCMC methods in terms of unit computation required for producing an independent sample from the posterior distribution. For this, we implemented three variants of the Metropolis-Hastings algorithm; along with a slice sampling algorithm. In addition to the standard random walk Metropolis algorithm, we implemented a sampling algorithm that tunes the properties of the proposal distribution; whilst another scheme makes non-local moves across multiple Markov chains at different temperaturesfacilitating cross-over between chains. In what follows, the parameters of a (single-node) neural mass model (NMM) were estimated using these inference schemes and their computational efficiency benchmarked. We found that adaptive Monte Carlo methods based on stochastic approximations were the most efficient, followed by MCMC methods based on tempered chains. Discounting computation efficiency, the slice-sampler emerged as a clear winnerif performance was restricted to the number of independent samples they could produce.

Methods
In this section, we briefly review the generative (dynamic causal) model used to simulate data that was subject to subsequent inference. These models are used to fit observed electrophysiological data and contain between 10 and 100 parameters. We then review the schemes (random walk Metropolis Hastings sampling, slice sampling, adaptive MCMC sampling and population MCMC sampling) that are subject to a comparative evaluation in the Results section.
Custom code was written in Matlab 2014a (The MathWorks Inc., USA) to simulate the Markov chains. For population MCMC sampling Parallel Computing Toolbox (The MathWorks Inc., USA) was used. Unless stated otherwise, out of the 2000 samples that were collected, the initial 600 samples were discarded as burn-in (see Appendix for definitions). In adaptive MCMC, out of the 600 burn-in samples, the initial 300 samples were used for proposal adaptation. All computations were performed on a 2011 Macbook Pro laptop.

Neural mass models
To test the inference schemes under known parameters, we used a single node neural mass model (NMM) based on David et al. (2006) to create synthetic data (Fig. 1). This model comprises ten parameters ({δ, g, h, τ, u} ⊆ θ with δ (intrinsic delay), {g 1 … 4 } (connection strengths), h e/i (maximum amplitude of post-synaptic potential), τ e/i (rate-constant of the membrane) and u (input to the neural population); for detailed description refer to David et al. (2006)) and nine ordinary differential equations (ODEs) that are a first-order approximation of delaydifferential equations (DDEs) representing three distinct neural populations; namely, inhibitory interneurons (x 7 ), spiny-stellate (x 1 ) and pyramidal neurons (x 9 ), These differential equations simulated for T time-points provide the predicted response (for some known experimental input) which, assuming additive Gaussian noise with covariance Σ, provide a likelihood model of observed data (y) with corresponding log joint density, Priors on all parameters (Table 1) conform to a Gamma distribution with shape k 1 and scale k 2 , whereby constructionapproximately 46-50% of the parameters sampled from this prior result in unstable dynamics, marked by positive real eigenvalues of the Jacobian matrix. This ensured that the inference scheme can recover from dynamical instability. The shape and scale of the Gamma distribution were determined numerically by integrating 200,000 NMMs and performing stability analysis. The shape and scale parameters of the Gamma prior distribution were then chosen, such that 46-50% of the sampled parameters produced unstable dynamics. The fixed-point equations were solved using a Trust-Region Dogleg method (Nocedal and Wright, 2006).
Contrary to David et al. (2006) where experimental input was modelled as a combination of a Gamma density function and a discrete cosine set, we used a simpler Heaviside step function to perturb the spiny-stellate cells. Differential equations were integrated using CVODES (Hindmarsh and Serban, 2002) using implicit backward-differentiation formulas (BDFs). The resulting non-linear equations were solved using Newton's method. Initial simulations established that direct solvers based on dense matrices were computationally more efficient than the three preconditioned Krylov (iterative) solvers (GMRES, Bi-CGStab, and TFQMR) (Golub and Van Loan, 2012). We anticipate that for larger dynamical systems (e.g., a 20-node NMM) iterative solvers may be more efficient. The absolute and relative tolerances of the integrators were both fixed at 10 −3 .

Algorithm Aslice sampler
Slice-sampling is a type of MCMC based on the fact that sampling a random variable can be attained by sampling uniformly under its probability density function and rejecting those that are outside (Neal, 2003). First of all we initialise our parameters to θ 0 so that the target density π(θ 0 ) N 0. Given this previous sample θ i we sample a position n i + 1 uniformly on [0, π(θ i )]. Conceptually, the next step comprises of drawing a horizontal line across the curve at this position. This hypothetical line is nothing but a 'slice' of our target distribution. Consequently, we sample θ i + 1 along the slice so that π(θ i + 1 ) ≥ n i + 1 .
Numerically, to operationalise the inequality, a bracket is first constructed as θ min ≤ θ i + 1 ≤ θ max and tested to see whether each end point lies within the slice. If it does, the endpoint is extended in that direction until it is outside the slice. This process is called Table 1 Model parameters used for dynamic causal modelling. Parameters describing the prior Gamma distribution. Also shown are the parameters for generating the ground truth ( Fig. 1).

Parameter
Shape ( "stepping out". A candidate value e θ is then selected uniformly from the regionand is accepted as the next sample if it lies within the slice i.e., θ iþ1 ¼ e θ. If not, the slice shrinks, such that e θ forms one end of the slice containing θ i . The process is repeated until a sample is accepted. For multivariate distributions, we introduce an auxiliary position (n) for each dimension. Unlike a Gibbs sampler, sampling using slices of the distribution does not require specification of the full conditionals. Similarly in contrast to a Metropolis-Hastings sampling algorithm, a slicesampler does not require specification of a proposal distribution.

Algorithm Brandom walk metropolis sampler
The random walk Metropolis (RWM) is the most common MCMC algorithm for Bayesian inference. Given a current value θ i of a d-dimensional Markov chain, the next value is chosen according to a proposal distribution e θ $ π e θ θ i j . We choose this to be a multi-variate Gaussian. The sample is then accepted with probability, ∧ denotes minimum between the left and the right arguments. If z ≤ α where z~U(0, 1) we setθ iþ1 ¼ e θ. Otherwise, we set θ i + 1 = θ i . The above formula embodies the notion that any proposal that takes the chain closer to a local mode is always accepted, whilst any other proposal is accepted with the probability equal to the relative densities of the posterior at the proposed and the current values.

Algorithm Cadaptive MCMC sampler
The random walk Metropolis (RWM) scheme generally has a slow convergence to the target density because of the inherent random walks (Chumbley et al., 2007). Using the history of samples that are generated from a Markov chain, the adaptive MCMC algorithm (Andrieu and Thoms, 2008;Haario et al., 2001) adapts the expectation and covariance matrix of the proposal distribution using stochastic approximations (Kushner and Yin, 2003). Stochastic approximation is an iterative algorithm that finds extrema (roots) of cost-functions using noisy samples. In adaptive MCMC this cost-function is based on the empirical mean (μ) and covariance (Σ) of the target density as well as the pre-determined acceptance rate (to update the scalar scale parameter λ).
Specifically, a Robbins-Monro algorithm is used (Robbins and Monro, 1951); wherein given current parameters (θ 0 ), mean (μ 0 ) and covariance (Σ 0 ) of the proposal distribution we first sample θ iþ1 $ N μ i ; λ i Σ i ð Þ where λ 0 is initialised to 1. Similarly, Σ 0 was initialised to an identity matrix. Secondly, using the Metropolis-Hastings criteria (Eq. (3)) we set μ i + 1 = θ i + 1 . If not, we reject the sample and set μ i + 1 = μ i . The current and target acceptance probabilities are α i → i + 1 and α target , respectively. It may be easy to understand such a scheme as a stochastic realisation of a deterministic prediction-error learning rule, guided by the Metropolis-Hastings acceptance ratio.
For the subsequent iteration, we adapt the mean (μ), the covariance (Σ) and the global scale of the covariance matrix (λ) with an iteration dependent step-size (γ) as follows, Algorithm Dpopulation MCMC sampler In the two preceding MCMC schemes, one simulates a single Markov chain, where the posterior sample density is said to have converged if multiple starting points yield identical invariant distributions. Slow chain mixing results from non-convexity of the posterior density. In order to promote chain mixing, one can run multiple chains with varying temperatures and implement non-local proposal swaps between chains (Geyer, 1992a). These exchanges also make the algorithm a candidate for sampling from multimodal densities (Frantz et al., 1990). Such an algorithm is known as the population MCMC sampler (Geyer, 1992a). It has been re-invented under various guises (Replica Exchange (Swendsen and Wang, 1986), Metropolis-Coupled MCMC (Geyer, 1992a), population MCMC (Laskey and Myers, 2003), Parallel Tempering (Earl and Deem, 2005), among others) but the standard approach is to initiate multiple Markov chains (indexed by i) totalling N such that the inverse temperature (β i ) of each chain is distributed according to where p = 5 and N = 4 chains, with the posterior density specified as, π θjy ð Þ∝ π yjθ ð Þ β π θ ð Þ: ð7Þ With a β of 1, the chain represents the joint log-likelihood, whilst a β of 0 represents a Markov chain that samples from the priorthe lower the inverse-temperature, the smoother the posterior density. One can visualize the multi-modal target distribution melting with an increase in temperature. At each temperature, the resulting distribution is explored using a single Markov chain whilst a product distribution is considered when moving between individual chains. This enables the sampler to take into account all of the chains, at different temperature levels.
Operationally, each chain uses a local (to that chain) Metropolis-Hastings acceptance criterion to either accept or reject the sample i.e., z b (1 ∧ exp( H old − H new )) where z~U(0, 1) and H is the unnormalised joint log-likelihood. After the kth sample is collected from the local chain, an additional Metropolis-Hastings acceptance criteria is imposedwhereby a pair of chains (t i and t j ) is selected randomly and their samples are swapped with the following acceptance ratio, L is the log-likelihood of the predicted response. It is customary to use a uniform tempering schedule; i.e., β i ¼ i N as discussed in Jasra et al.(2007). For linear regression models Calderhead and Girolami (2009) showed that a power law distribution for the tempering schedule is most optimal. They showed that using uniformly spaced temperature schedule on the other-hand produced worse results, even if the number of Markov chains is increased ten-fold. Therefore, in this paper, we evaluate the sampling efficiency as that obtained by both uniform and power-law temperature schedules.
Population MCMC is reminiscent of an embarrassingly parallel problem; wherein communication between multi-threaded processes is minimised by performing proposal exchange only after k (fixed at 10) iterations.
This concludes our brief description of the gradient-free sampling schemes considered in this paper.

Results
We used a single node neural mass model (NMM, Fig. 1) to characterise the computational performance of four different MCMC sampling algorithms, for parameter inference. Inference was performed under the assumption that the (neuronal) system is partially observable i.e., only the pyramidal cell voltage (x 9 in Eq. (1)) was available.
The efficiency of a MCMC sampler is defined as the ratio of the computation time and the number of effective samples produced in that time. The effective sample size (ESS) for each parameter is calcu- where R is the number of posterior samples post-burn-in and ∑ q γ q ð Þ is the sum of Q monotonic autocorrelations. This auto-correlation is estimated using the initial monotone sequence estimator (Theorem 3.1 in Geyer (1992b)). The minimum ESS reports the number of samples that is effectively uncorrelated over all parameters. Similarly, the time normalised (wall-time/ minimum ESS) ESS tells us how much time we spend sampling a single uncorrelated sample, providing us with a measure of the worst-case behaviour (Cormen et al., 2001) of the sampling algorithm. In short, an efficient sampler would produce a large ESS at the shortest possible time.
We used a normal symmetric random walk Metropolis (RWM) schemeas used in previous work on sampling schemes for DCM by Chumbley et al. (2007). The l 2 error-norm was among the highest over all schemes considered (10.8) ( Fig. 2A, E). The RWM algorithm resulted in the lowest ESS (Table 2). This is because chain mixing is confounded by the inherent random walk displayed by this class of algorithm. The slice-sampler on the other hand had the highest ESS, albeit at increased computational cost (Fig. 2B, F). Also, it had the lowest l 2 error (Table 2).
In terms of computational time, adaptive Metropolis (Fig. 2C, G) with stochastic approximations of the mean, covariance and the scale of the proposal distribution emerged as the clear winner (Table 2). It had lower ESS in comparison to slice-sampling but took 90% less time to produce a single independent sample. The increased ESS reflects the fact thatunlike the RWMthe proposal distribution has been adapted to guarantee a prescribed acceptance rate of 23%.
So far we have only considered a single Markov chain. Multiple chains running at a variety of temperatures can be used to not only facilitate rapid chain mixing but also to sample from multi-modal posterior densities (Fig. 2D, H). At about 150% increase in compute time (with respect to adaptive MCMC), the population Metropolis method (with 4 chains running at 4 different temperatures with proposal swaps every 10 iterations) has the highest ESS after slicesampling, but with a 6-fold decrease in compute time per independent sample. This is dependent upon the temperature spacing of the parallel chains; where uniform spacing of inverse temperature performs poorly (Table 2).
Adaptive Metropolis and population based Markov chains appear to be equally efficient; although the latter requires expert intervention in choosing the number of chains, the form of inverse temperature ladder and the selection of proposal exchange partners. Adaptive Metropolis on the other hand did not have any parameters that require tuning. In summary, for inversion of these sorts of DCMs, our (gradient-free) sampler of choice is the single chain adaptive MCMC algorithm.

Discussion
In this note, we compared four gradient-free MCMC methodsrandom-walk Metropolis sampler, slice-sampler, adaptive MCMC sampler and population-based MCMC sampler in terms of their effective sample size (ESS). Both adaptive and population MCMC take between 0.4 and 0.6 min (on a 2011 Macbook Pro laptop) to generate a single uncorrelated sample. Adaptive MCMC does this by matching the proposal density to the required target density, whilst proposal-exchanges enable neighbouring chains to mix more quickly in population MCMC sampling. This is particularly useful for DCMs, where such population of Markov chains enable the inference algorithm to ameliorate issues like local minima and multi-modal posterior densities that are characteristic of many variational algorithms. The population MCMC that we have used is similar to a genetic algorithm (GA), where the samples from different chains interact to mimic natural selection. The key difference is that GAs find a single optimum point, whilst population MCMC furnishes a probability density. The temperature ladder can be seen as a cross-over process where fitter samples move to a lower temperature.
Poor-mixing results when the Markov chain is confined to isolated modes or mix poorly along samples with strong correlations. Indeed using multiple chains allows for a mode hopping characteristic for the underlying Markov chain. This is especially useful when sampling from multimodal posteriors. This is because non-local moves are made that result in crossing the barrier imposed by a local potential well. In addition to the population estimator that we have evaluated, a multitude of mode-hopping MCMC samplers exist for tortuous posterior densities. This ranges from Jump-walking (J-Walking) estimator (Frantz et al., 1990) with a potential problem of not satisfying detailed balance to Smart-darting (S-darting) where detailed balance of population walkers is maintained (Andricioaei et al., 2001).
In neuroimaging, the practitioner is not only interested in estimating the distribution of parameters that explain the EEG, MEG or fMRI signal but also build a variety of models to test competing hypothesis about the same observed data. Population MCMCin the form of power posteriorsnot only helps in parameter inference but also facilitates estimation of the partition function or model evidence. Via thermodynamic integration (Friel and Pettitt, 2008;Lartillot and Philippe, 2006) the model evidence can be obtained by running multiple Markov chains in parallel and numerically integrating samples over a variety of temperatures (using quadratures) to compute the model evidence (Eqn. 15 in Calderhead and Girolami (2009)). This becomes important when one has to choose between models using model comparison (Claeskens and Lid Hjort, 2008). Standard MCMC technology that relies on unnormalised probability densities infers the model evidence from samples generated by independent chains. Unlike thermodynamic integration, such estimates of the model evidence are highly variableeven in the high sample-size limit (Calderhead and Girolami, 2009;Jasra et al., 2007), rendering Bayesian model comparison useless. In population MCMC, since multiple chains take about the same amount of time as single chains due to the inherent parallelism of the scheme, one can evaluate model evidence without additional cost.
Unlike Lartillot and Philippe (2006) who suggest a uniform schedule for the temperature ladder, we verified that using a temperature ladder with power-law characteristics was beneficial (in terms of ESS) as pointed out by Calderhead and Girolami (2009). Such a schedule is not only beneficial from the point of the ESS but also in reducing the variance of the model evidence (Calderhead and Girolami, 2009). This is because geometric schedulesthat model power-law densitiesare the extremal solutions of the Monte Carlo variance (Gelman and Meng, 1998).
Our inference algorithm based on adaptive MCMC sampler adapted the proposal distribution only during the burn-in iterations. This was done to avoid using past information infinitely often, preserving the Markov property of the transition kernel. An alternate methodology adopted by Gelfand and Sahu (1994) is to run several chains in parallel and use sampling-importance-resampling (SIR) (Rubin, 1998) to form kernels that have higher ESS whilst suppressing those chains that do not, using the approximation to the marginal distribution of the chain as a proposal distribution. It is vital to keep in mind that continued adaptation can disturb the invariant distribution of the chain. Although computationally inefficient, adaptation using delayed rejection (Tierney and Mira, 1999) or regeneration (Gilks et al., 1998) can be helpful.
An important issuewhen using MCMC for Bayesian inferenceis determining when the chain has converged. This criterion is crucial and therefore forms a large part of ongoing research that ascertains rapid convergence. Running an ergodic sampler for an infinite amount of time will result in convergence on the ground truth, per definitiontougher convergence criteria can necessitate longer runtimes. Measure-theoretic analysis of most MCMC samplers gives an estimate of the number of samples required to ensure convergence, according to a total variation distance (with a specified tolerance bound) to the true posterior density. For empirical problems this is seldom possible. A simpler but computationally wasteful strategy involves running multipleyet independentchains and ensuring that the posterior density obtained by each chain is identical in terms of its lower moments. A more cogent diagnostics to estimate convergence of the Markov chain uses the normal theory approximations of Gelman and Rubin (1992). This introduces a shrink factor that tends to 1 (depending on between chain and within chain convergence) as the Markov chain converges. For a discussion of convergence estimators, see Table 1 in Cowles and Carlin (1996).
There is no one sampler that is suitable for all inference problems; MCMC samplers that are based on geometric formulation of gradients, adaptation and tempering/annealing have over the years reduced concerns about local minima and sampling of multi-modal posteriors typically faced by deterministic algorithms. An important  future development would be in terms of combining gradient-free MCMC estimators with their gradient-based counterparts. For NMMs, we notice that the gradient based manifold Langevin samplers are computationally efficient (Sengupta et al., under review); yet the ESS is not as high as the Hamiltonian Monte Carlo (HMC) sampler. Thus, a gradient-based algorithm can be used as a starting algorithm followed with a population gradient-free MCMC sampler. The proposal distribution of each chain could be individually adapted, forming an adequate trade-off between computational time and the number of independent samples.