Designing Simple and Efficient Markov Chain Monte Carlo Proposal Kernels

We discuss a few principles to guide the design of efficient Metropolis– Hastings proposals for well-behaved target distributions without deeply divided modes. We illustrate them by developing and evaluating novel proposal kernels using a variety of target distributions. Here, efficiency is measured by the variance ratio relative to the independent sampler. The first principle is to introduce negative correlation in the MCMC sample or to reduce positive correlation: to propose something new, propose something different. This explains why singlemoded proposals such as the Gaussian random-walk is poorer than the uniform random walk, which is in turn poorer than the bimodal proposals that avoid values very close to the current value. We evaluate three new bimodal proposals called Box, Airplane and StrawHat, and find that they have similar performance to the earlier Bactrian kernels, suggesting that the general shape of the proposal matters, but not the specific distributional form. We propose the “Mirror” kernel, which generates new values around the mirror image of the current value on the other side of the target distribution (effectively the “opposite” of the current value). This introduces negative correlations, leading in many cases to efficiency of > 100%. The second principle, applicable to multidimensional targets, is that a sequence of well-designed one-dimensional proposals can be more efficient than a single d-dimensional proposal. Thirdly, we suggest that variable transformation be explored as a general strategy for designing efficient MCMC kernels. We apply these principles to a high-dimensional Gaussian target with strong correlations, a logistic regression problem and a molecular clock dating problem to illustrate their practical utility.


Introduction
Markov chain Monte Carlo (MCMC) is a class of algorithms for generating samples from a probability distribution π on X ⊂ R d , where π may be known up to a normalizing constant. It is widely used to simulate samples from the posterior distribution in Bayesian inference where the normalizing constant is usually intractable. The Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953;Hastings, 1970) simulates a discrete-time Markov chain on X with stationary distribution π as follows. Given the chain is currently at x, a potential next state x is generated from a proposal kernel Q on X, with density q(x |x). The chain then moves to x with probability α(x, x ) := min 1, Otherwise it stays at x. The resulting Markov chain has transition kernel P with density By construction, this Markov chain is reversible with respect to π, with π(x)p(x |x) = π(x )p(x|x ) for almost every x, x ∈ X. If the proposal kernel Q is irreducible and aperiodic, the transition kernel P will also be irreducible, aperiodic and is π-invariant.
A simulated path (x n ) N n=1 of the chain can be used to estimate an expectation under π. Let f : X → R be an absolutely integrable function and let π(f ) := E π f (x) = X π(x)f (x) dx denote the expected value of f under π. Then π(f ) can be estimated bŷ Provided the Markov chain is ergodic with stationary distribution π, this estimatorπ(f ) converges to π(f ) almost surely as N → ∞ (see e.g. Theorem 3 in Tierney (1994)). Moreover, under certain ergodicity assumptions, the central limit theorem holds for √ Nπ(f ) (see e.g. Theorems 4 and 5 in Tierney (1994)), with the asymptotic variance where V f := Var π (f (x)) = E π (f (x) − E π f (x)) 2 is the variance of f (x) under π, and ρ k := Cor(f (x n ), f(x n+k )) is the lag-k autocorrelation.
When the state space X is discrete, Peskun (1973) showed that given a transition kernel Q, the choice of the acceptance probability α in (1) is optimal in terms of minimising the asymptotic variance ofπ(f ). The analogous result for continuous state spaces was provided by Tierney (1998). However, what features the proposal kernel Q should have to minimise the asymptotic variance is not clear. The most common choice of Q is based on the random walk x = x + u where u has a Gaussian or uniform distribution with variance σ 2 . The Langevin proposal x = x + σ 2 2 ∇ x log π(x) + u with u ∼ N (0, σ 2 ) makes use of gradient information to bias the proposal towards a local mode of the target. All these proposals involve a step-size parameter σ yet to be specified.
It is well known that a poor choice of σ can adversely affect the mixing and convergence properties of the algorithm. Determining the optimal choice of the step-size parameter σ is an active area of research, known as optimal scaling. Gelman et al. (1996) estimated the optimal σ (that minimises ν) for the Gaussian random walk q(x |x) = N (x |x, σ 2 ) for estimating the mean of the N (0, 1) target to be about 2.38, with the corresponding expected acceptance probability to be about 0.44. When the target distribution π is d-dimensional with identically distributed components, Roberts et al. (1997) showed that for the Gaussian kernel q(x |x) = N (x |x, σ 2 I d ), the optimal step size σ that minimises ν is the one that leads to P jump ≈ 0.234 as d → ∞. Recent work on optimal scaling covers more complex algorithms such as Multiple-try Metropolis (Bédard et al., 2012), delayed rejection MH (Bédard et al., 2014), Hamiltonian Monte Carlo (HMC) (Beskos et al., 2013), as well as the MH and Metropolis-adjusted Langevin algorithm (MALA) in the infinitedimensional setting (Beskos et al., 2009;Pillai et al., 2012). However, optimal scaling analysis has been mostly limited to the Gaussian random walk and the Langevin proposal (Roberts and Rosenthal, 1998). Beyond this small collection of proposal kernels, there is no general theory available.
In this work, we address the problem of designing efficient proposal kernels (Q) for the MH algorithm, with each kernel implemented at a nearly optimal scale. Recently, Yang and Rodríguez (2013) empirically demonstrated that using the so-called Bactrian kernels can substantially improve the asymptotic efficiency for a range of univariate target distributions, compared with the uniform random walk, which is in turn more efficient than the Gaussian random walk. We extend this work by proposing several new proposal kernels and evaluate their statistical efficiency at the optimal step size.
In Section 2, we describe the calculation of efficiency and automatic adjustment of the proposal step size (σ). We present new kernels for one-dimensional targets in Section 3, and consider multidimensional targets in Section 4. In Section 5, we apply the new kernels to the Bayesian molecular clock dating problem in molecular phylogenetics. We discuss limitations of our work as well as connections to previous work in Section 6.

Target and proposal distributions
We consider the following five target distributions (Figure 1): (a) Standard normal distribution N (0, 1), with mean 0; (b) Mixture of two normal distributions 1 , with mean 0. Each of the five targets has variance 1. Note that even though the density in (b) has two modes, we focus in this paper on simple targets with a single mode; we do not expect the proposals discussed here to work well when the target has multiple peaks separated by deep valleys.
Note that targets (d) and (e) have a constrained support. Sampling from targets with constrained support is often dealt with using rejection or truncated proposals (or truncated full conditionals in the context of Gibbs sampling) (Gelfand et al., 1992;Browne, 2006). We note that rejection can be very inefficient if a large proportion of proposed values are discarded, while the truncated variables can be expensive to simulate, often based on the inverse transform method (Devroye, 1986, p. 38). It is simpler and typically more efficient (in terms of the amount of computation involved as well as the asymptotic variance of the estimator) to use reflection (e.g. Yang and Rodríguez (2013)). For example, if x has support on the interval [a, ∞) and if the kernel The proposal ratio is 1.

Efficiency calculation
As in Gelman et al. (1996), we define statistical efficiency of the estimatorπ(f ) (2) as the ratio of the asymptotic variance ofπ(f ) for an iid sample to the variance for an MCMC sample of the same size We use the identity function f = 1 in (2) and estimate the mean of π. We used two methods to estimate E, one based on discretization of the target distribution, and another based on the MCMC sample.
Method 1: Using the transition matrix on a discretized state space The state space For each bin k = 1, . . . , K, we use the midpoint x k := x L + (k − 1/2)Δ as its representative. Then we compute the transition matrix P on the discretized space, and calculate E using a closed form expression from Kemeny and Snell (1960) or Peskun (1973). This method requires an analytic expression of the proposal density q(x |x). The calculation details are described in Gelman et al. (1996) and Yang and Rodríguez (2013).
Here, we use x L = −5, x U = 5 and K = 500 for all targets, except for the mixture of two t 4 , where we use x L = −10, x U = 10 and K = 1000.
While our focus is on the mixing property of the Markov chain at stationarity, we also consider two measures of the convergence rate of the Markov chain, namely the absolute value of the second largest eigenvalue of P , denoted |λ| 2 , and the largest total variation distance to the target distribution after n steps among all possible starting points, denoted δ n . We use n = 8. These measures are computed on the discretized space. See Yang and Rodríguez (2013).
Method 2: Using MCMC samples The initial positive sequence method of Geyer (1992) is used, which truncates the infinite sum of (4) as soon as ρ k−1 + ρ k < 0. This uses an MCMC run and requires the evaluation of the proposal ratio, but not the proposal density. Our experience suggests that small sample sizes (say, N < 10 5 ) do not allow reliable estimation, especially when E is small. We typically use N = 10 7 to 10 8 , after a burn-in of 10 4 iterations.
We also use the following efficiency measure in the case of one-dimensional targets , where the expectation is over the joint distribution of x n−1 and x n . This has been used for optimal scaling (Sherlock and Roberts, 2009) and adaptive MCMC (Pasarica and Gelman, 2010). Maximising E 2 π is equivalent to minimising the first-order autocorrelation ρ 1 .

Tuning of the proposal step size
Ideally, the proposal step size σ should be set to give the optimal efficiency E. A computationally intensive approach is to run the algorithm for a range of σ values and choose the one that gives the highest efficiency, referred to as grid evaluation. This is expensive and may not be practical in real applications. It is used for one-dimensional targets in Section 3. In Section 4, we employ an approach of automatic scale adjustment (Yang and Rodríguez, 2013), where we monitor P jump and use it to adjust σ for a onedimensional proposal. Note that there is a monotonic decreasing relationship between σ and P jump (larger σ meaning smaller P jump ). Specifically, we use the relationship P jump (σ) = 2 π tan −1 (2/σ), for the N (0, 1) target and x |x ∼ N (x, σ 2 ) kernel (Gelman et al., 1996), to derive the update formula where σ is the current step size, P jump is the observed acceptance proportion, while σ * and P * jump are the optimal values. The optimal P jump is around 0.4 for unimodal kernels (including Gaussian and uniform kernels) and 0.3 for bimodal kernels (including Bactrian, Box, Airplane and StrawHat kernels); see Section 3 and Yang and Rodríguez (2013). We update σ several times during the burn-in.
Choice of the step size σ for the Mirror moves is discussed below.

New one-dimensional proposals
These proposals attempt to reduce the autocorrelation of the Markov chain, thereby improving the precision of the resulting MCMC estimates. One simple approach is to use a bimodal distribution with two modes on both sides of the current position. We describe three such proposals, called Box, Airplane and StrawHat ( Figure 2). They have a bimodal shape similar to the Bactrian-type kernels given in Yang and Rodríguez (2013), and are symmetric, with q(x |x) = q(x|x ). We then present a novel family of non-symmetric kernels that induces negative correlations in the Markov chain, called the Mirror kernel ( Figure 3).
For each of the proposal kernels described below, we first introduce a standard distribution version with zero mean and unit variance. Then given a current point x of the Markov chain, we give the proposal density with mean x and variance σ 2 .

Box
Given x, we generate x uniformly from two intervals, one on each side of x (Figure 2a). The standard box distribution is p(y; a) := , and a is a parameter taking values in the interval [0, 1). When a = 0, this is U (− √ 3, √ 3), which is the uniform kernel. In the proposal, we set x := x + σy, where y has the standard box distribution, with density q(x |x) = 1 2σ(b−a) , σa ≤ |x − x| ≤ σb. To sample from q(x |x), draw y ∼ U (a, b) and u ∼ U (0, 1). If u < 1 2 , set y ← −y. Then set x ← x + σy.

Airplane
The standard Airplane distribution p(y; a) √ 2) is a parameter (Figure 2b). The proposal density with mean x and variance σ 2 is q(

StrawHat
The standard StrawHat distribution p(y; a) is Then set x ← x+σy. In all three moves (Box, Airplane, StrawHat), when a = 0, the move reduces to the uniform kernel. We note that if a is too close to the upper limit, efficiency tends to drop off quickly as σ becomes too large ( Figure S1). In practice, we fix a at a = 0.5 (b = 1.43) for Box, a = 1 (b = 1.47) for Airplane and a = 1 (b = 1.35) for StrawHat. Each kernel then has a step size (σ) which can be adjusted to achieve good mixing.

Mirror
In the Mirror kernel, we generate values around a point on "the other side" of the target distribution that is the "mirror image" of the current point x. Specifically, let μ * be an estimate of the location of the target such as the mean or median. The proposal kernel is centred at x * := 2μ * − x, the point with the same distance from μ * as the current point x (Figure 3). We consider two variants, using either the uniform or Gaussian distribution. In the MirrorU kernel, we have and in the MirrorN kernel, we have Both have mean 2μ * − x and variance σ 2 .
For example, consider the N (0, 1) target. If μ * is the true mean (μ) of the target, optimal asymptotic efficiency (for estimating μ) is achieved by having σ = 0, in which case E = ∞ with P jump = 1. However, in that case, the chain does not sample from the target, and E for estimating other functions may be 0. In general, if μ * is close to the true mean, one would prefer a small σ to achieve a high efficiency (E), but a small σ may lead to slow convergence to the target distribution. On balance, we suggest two choices of the step size: σ =ŝ or σ = 1 2ŝ , whereŝ is the estimated target standard deviation. Both μ * andŝ are estimated during the burn-in of the MCMC.
If the target support is not the whole real line, the proposed value may lie outside the target support. While one could reject such values, rejection is not workable if all possible proposed values are outside the target support (i.e. when the proposal and target supports do not overlap). Reflection is another possibility but there are two problems. First, reflection would defeat the purpose of moving to the other side of the target. Second, with a small step size, the reverse move x → x of the MirrorU move with reflection may not be possible, thus breaking the detailed balance condition. As an example, consider a target with support [0, ∞). For MirrorU (6) with μ * = 1.5 and step size σ = 1, suppose the current value is x = 5. Then Instead of rejection or reflection, we transform the target support X onto the real line R before applying the Mirror move. For instance, if X = [a, ∞), we apply the Mirror move on the transformed variable y : . With these log transformations, the original variable in the X space is multiplied by a random factor, and the Mirror proposal is referred to as Mirror Multiplier.

Results
Figures 4 and 5 show the performance of eight proposal kernels applied to five targets plotted against the proposal step size σ. We observe large variations in efficiency as σ changes, which emphasises the importance of choosing σ to achieve high efficiency. We also note that for the uniform and Gaussian kernels, optimal σ for convergence rate (δ 8 and |λ| 2 ) is larger than that for mixing, while the opposite is true for the bimodal kernels.
The Box, Airplane and StrawHat kernels have similar efficiency to the Bactrian-type kernels from Yang and Rodríguez (2013), with Box and StrawHat generally performing slightly better than the BactrianTriangle kernel (Table 1). In addition, all these bimodal kernels are better than the unimodal Gaussian and uniform kernels. The detail of the distributional form appears to be less important. Among the bimodal kernels, we prefer the StrawHat as it tends to achieve high efficiency and it is not very sensitive to the choice of step size.
For the MirrorU and MirrorN kernels, we fixed μ * to 0.1 for all targets in this Section, except the gamma target, where we used μ * = 1.5. Using a fixed μ * allowed us  i.e. we apply the Mirror kernel to log x, with mean 0.563 and log μ * = 0.405. For uniform target (mean 0), we use μ * = 0.1; i.e. we apply the Mirror kernel to log to optimise the step length σ and obtain smooth efficiency curves ( Figure 4) without averaging over many simulation replicates. The two Mirror kernels generally achieve several-fold improvements in efficiency, and are "super-efficient", with E > 1, in most cases (Table 1). In practical applications, we suggest setting μ * =μ and σ =ŝ or σ = 1 2ŝ , with both the target mean μ and standard deviation s estimated during the burn-in (Section 3.4). If the estimated mean is closer to the true mean than the fixed μ * used in our experiments, performance will be better as well. For the N (0, 1) target, the efficiency, averaged over 10 replicates, is 1.290 for σ =ŝ, and 2.815 for σ = 1 2ŝ for MirrorN, compared with E = 1.82 when μ * = 0.1 is fixed and σ is optimised in Table 1.
We note that the ranking of the proposal kernels is largely the same across these five targets ( Table 1), suggesting that this pattern may hold for fairly arbitrary targets. For the Box, Airplane and StrawHat kernels, the optimal P * jump is reasonably stable across targets evaluated, and we suggest using the automatic scale adjustment (5) for setting the proposal step size σ, with P * jump = 0.3. Finally, to assess whether the efficiency ordering of the kernels depends on the specific function estimated, we consider estimating a tail probability of the normal target N (0, 1). For estimating the probability P(x > 2.3263) = 0.01, the same ordering of the kernels holds as for estimation of the target mean, with comparable optimal σ (Figure 6a). The highest efficiency is E ≈ 0.4, achieved by the two Mirror kernels. Similar results were obtained for estimating the probability P(x > 1.2815) = 0.1 (Figure 6b), but with a generally narrower range of σ achieving the maximum efficiency. The MirrorU kernel was the most efficient, with E ≈ 0.5.

Multivariate Gaussian target using multidimensional uniform and Mirror kernels
We extend the one-dimensional uniform and MirrorN kernels to multiple dimensions for the N d (0, I) target, obtaining optimal scaling, optimal efficiency and P jump (Table 2). For the uniform kernel, we consider the Cube and Sphere extensions in multi-dimensions. For MirrorN, we consider two variants, MirrorN1 with x |x ∼ N (x * ,Σ) and MirrorN 1 2 with x |x ∼ N (x * , 1 4Σ ), where x * = 2μ * − x, with μ * andΣ to be the estimated target mean and variance. Efficiency is calculated by averaging over 10 replicates.
We find that the Cube and Sphere kernels are more efficient than the Gaussian kernel for d = 1, 2, 3, 4, but both are very similar to the Gaussian when d > 4. The MirrorN1 and MirrorN 1 2 kernels are several times more efficient than Gaussian, Cube and Sphere kernels for d ≤ 10, with MirrorN 1 2 being over twice more efficient than MirrorN1. Note that these MirrorN moves evaluated in Table 2 are d-dimensional moves. In comparison, the efficiency of one-dimensional MirrorU and MirrorN is higher than 100% whatever the dimension of the target is (Table S1).
In the supplementary material (Thawornwattana et al., 2017), we also perform detailed simulations for the two-dimensional case, comparing the two-dimensional extensions of the uniform kernel (the Square and Disc), with one-dimensional Gaussian, uniform and MirrorU kernels.  Table 2: Optimal step size (σ) and asymptotic efficiency (E) for the Gaussian target N d (0, I) and five proposal kernels. The results for the Gaussian kernel are from Table 1 in Gelman et al. (1996). For MirrorN1 and MirrorN 1 2 , the proposal covariance isΣ and 1 4Σ , respectively, whereΣ is the estimated target covariance from the burn-in. The results are averaged over 10 replications.

Hundred-dimensional Gaussian distribution
To demonstrate the scalability of our approach to high dimensions, we consider the N 100 (0, Σ) target where Σ −1 is generated from a Wishart distribution with identity scale matrix and 100 degrees of freedom. The target distribution used has many strong correlations, with 1627 out of 4950 pairs of variables having correlations with magnitude greater than 0.99.
We compare the one-dimensional Gaussian and MirrorU kernels. For MirrorU, the parameter μ * is set to the target mean estimated during the burn-in, and the componentspecific proposal step size σ is set to eitherŝ or 1 2ŝ whereŝ is the estimated standard deviation of the component in the target. These two proposals are referred to as Mir-rorU1 and MirrorU 1 2 , respectively. We use the whitening transformation to remove correlations among the components and rescale all the components to have variance 1. This transformation requires the target's covariance matrix Σ, which can be estimated during the burn-in.
We use 10 5 iterations of burn-in and 10 7 iterations of the main chain. If estimation of Σ is required, we initialise Σ with the identity matrix and update every 10 4 iterations (thus ten rounds of update in total). The final covariance matrix used by the sampler is based on the last 10 4 burn-in samples. For the Gaussian kernel, we use automatic tuning of proposal step size (5) with optimal P jump = 0.4. For this problem, the MirrorU1 and MirrorU 1 2 kernels give about four-fold and tenfold increase in efficiency compared with the Gaussian kernel (Table 3). Efficiency is similar whether the true or estimated variances are used, illustrating that the approach of estimating the variance is practical. Stan does not perform well and takes about 100 times longer than the Gaussian and MirrorU kernels.

Bayesian logistic regression
Next, we apply the MirrorU kernel to the Bayesian logistic regression analysis of the German credit dataset. The same dataset was used by Girolami and Calderhead (2011) to demonstrate several state-of-the-art MCMC algorithms, namely MALA, HMC and their Riemannian manifold versions. We also include for comparison the Stan algorithm (version 2.15.1) (Carpenter et al., 2017), which implements HMC with automatic tuning. Note that MALA and HMC require the first derivatives, while their manifold versions additionally require the Fisher information matrix as well as its derivatives. The target distribution is where θ is a vector of an intercept term and 24 regression coefficients, x n is a vector of 24 normalized predictors (with zero mean and unit variance), y n ∈ {0, 1} is an indicator for a good credit risk, and N = 1000. We give each component of θ an independent Gaussian prior N (0, α) with α = 100, following Girolami and Calderhead (2011). Each chain is run for 10 7 iterations after 10 4 burn-in iterations. For MALA, MCMC and the manifold versions, we use the Matlab implementation of Girolami and Calderhead (2011) and run for 10 6 iterations.
From Table 4, the multidimensional MALA and HMC proposals are worse than the simple one-dimensional 1DUniform and are comparable to the 1DGaussian kernel. The manifold versions of MALA and HMC are much better than all those four, and Stan performs the best. The MirrorU 1 2 kernel has comparable efficiency to manifold HMC and Stan, achieving super-efficiency (E > 1) for most of the 25 parameters, while taking less time. We note that the Mirror kernel requires estimation of the target mean and variance, but is otherwise very simple to implement, and does not require any finetuning. MALA and HMC require estimation of the target variance, and the manifold versions in addition need higher derivatives or Fisher information. In complex models where analytic expressions of the required derivatives are not available, automatic differentiation may be used to evaluate derivatives at machine precision, but with increasing running time. In addition, MALA, HMC and their manifold versions all have at least one parameter that requires tuning. Thus the Mirror kernel is simpler to implement.
However, manifold MALA, HMC and manifold HMC give consistent efficiency across dimensions, while for the Mirror kernel, some components can have much lower efficiency than the rest.

Application to phylogenetics
We apply the proposal kernels studied above to a Bayesian inference problem of estimating species divergence time and evolutionary rate using molecular sequence data from two species. The dataset is the 12S rRNA gene from the mitochondrial genome of human and orangutan from Horai et al. (1995), summarized as x = 90 differences out of n = 948 sites. Running time (s), C 936 936 937 n/a n/a n/a n/a 2263 Running time (s), Matlab n/a n/a 1981 299 4649 7069 25039 n/a Table 4: Efficiency for estimating the mean of the posterior distribution for the logistic regression problem. Running time (in seconds) is for 10 6 iterations for all kernels. The 1D Gaussian kernel is implemented in both C and Matlab, and indicates a 2-fold difference in running time between the two languages.

Model
The evolutionary process at each site is modelled as a continuous time Markov process on the four nucleotides (T, C, A, and G) with the transition rate matrix Q = {q ij }, with q ij = λ for any i = j (Jukes and Cantor, 1969). The substitution rate for each nucleotide is thus r = 3λ per time unit, which is defined as one million years (Myrs). The transition probability matrix is Given the data of x differences at n sites, the likelihood is This is a function of the genetic distance θ := 2tr, but not of t and r individually. The maximum likelihood estimate of θ isθ = 3 4 log( 3n 3n−4x ) = 0.1015, and the 95% likelihood interval is (0.0817, 0.1245).

MCMC algorithms for posterior inference
Since the uniform proposal is generally more efficient than the Gaussian proposal, we consider seven proposal kernels (A1-7) based on the uniform and MirrorU kernels and five state-of-the-art MCMC algorithms: MALA, HMC, HMC (Stan), manifold MALA, manifold HMC (A8-A12), which are based on a multivariate Gaussian proposal. The derivatives and Fisher information matrices required by algorithms A8-A12 are derived using the unnormalized posterior (9); these quantities are tractable but tedious (see supplementary material). We use variable transformations to deal with correlations and/or scale differences of the target variables (Figure 7b). Depending on the transformation used, each algorithm has component-specific scaling parameters. Specifically, σ t and σ r are standard deviations of proposals on t and r; σ w and σ z are for w := log t and z := log r ( Figure 7c); σ x and σ y are for x := log(tr) and y := log(t/r) (Figure 7d). The details for tuning these step-size parameters are summarised in Table 5, and explicit steps for each algorithm are provided in the supplementary material.   Table 5: Efficiency of twelve kernels for the molecular clock dating problem. The scaling factor c = σ/s is the ratio of the proposal standard deviation σ over the target standard deviation s. NOTE: s w and s z are the standard deviations of w := log t and z := log r; s x and s y are the standard deviations of x = log(tr) and y = log(t/r),μ denotes the estimate of the true mean μ, andŝ denotes the estimate of the true standard deviation s. The running time is an average from 10 replications. The scaling factors for A6a and A6b are not exactly 1 and 0.5 because the variances estimated during burn-in involve inaccuracies. For manifold MALA and manifold HMC, the scaling factor depends on the current position of the Markov chain.
Algorithm A9 HMC on w, z.
Algorithm A11 Manifold MALA on w, z.
Algorithm A12 Manifold HMC on w, z.
Note that A4 and A7 use generic logarithm and whitening transformations to deal with correlations and scale differences, while A5 and A6 use certain features of the model (namely the fact that the likelihood depends on tr only) to design efficient transformations or search direction.
For each kernel, we simulate a Markov chain for 5 × 10 7 iterations, after a burn-in of 8×10 4 iterations. The estimates of the two marginal posterior means (and the 2.5th and 97.5th percentiles) are identical for all algorithms: 14.58 (10.5, 19.4) for t and 0.00361 (0.0025, 0.0051) for r, while the efficiency of the algorithms varies by nearly 40 folds (Table 5).
When the target's covariance structure is not taken into account, the efficiency achieved is less than 10%. The one-dimensional uniform proposals on t and r and on log t and log r (A1 and A2, respectively) are very inefficient, with E ≈ 5%, even less efficient than the two-dimensional uniform kernel (A3). This is not surprising as both pairs (t, r) and (log t, log r) are highly correlated (correlation about −0.8), as is expected from the fact that the likelihood depends on the product tr only. Removing the correlation and adjusting for the scale differences between the target variables via the whitening transformation (8) (A4) improves the efficiency significantly. An alternative and computationally cheaper way to reduce the correlation is to use the transformation x = log(tr) and y = log(t/r) (A5), based on our knowledge of the model. This reduces the correlation to −0.28, and yielded a similar efficiency boost as A4.
The MirrorU kernels A6 and A7 show a superior performance to the uniform kernel using the same transformation (A4 and A5) with no extra computational cost. Independent simulations with different estimates of the target mean and variance suggest that efficiency is stable around the values given in Table 5 (supplementary material). Both MALA (A8) and manifold MALA (A11) perform better than the uniform kernel (A4) (E ≈ 60%), but do not beat the MirrorU kernel. HMC (A9) and manifold HMC (A12) also give super-efficient estimates, but at greater computational and implementation cost. Stan (A10) does not perform as well as other variants of HMC (A9, A12). In terms of efficiency per second, all variants of the MirrorU kernel outperform manifold MALA, manifold HMC and Stan by a substantial margin (Table 5). Finally, although well-tuned MALA and HMC also give good efficiency-per-time results, the need for high-order derivatives and manual tuning of the step-size parameters makes them challenging to implement.

Measures of performance
We have compared the mixing efficiencies of different MH proposals, measured by the asymptotic variance for estimating a function defined on the target distribution (such as the mean or tail probability). As the efficiency of the kernel may depend on the function or target (Mira, 2001), we have included several targets in our evaluation. We note that the rankings of the proposal kernels are largely the same for all the targets we evaluated, suggesting the existence of some general principles that may apply to fairly arbitrary targets.
Besides the mixing efficiency, another useful measure is the rate of convergence of the Markov chain to the stationary distribution (such as δ 8 in Table 1). This rate should affect the desired length of the burn-in. We consider the convergence rate to be less important than the mixing efficiency because the burn-in is typically a small fraction of the MCMC run, and because a kernel efficient for mixing tends to also be good for convergence. For example, the uniform kernel converges faster and mixes more efficiently than the Gaussian kernel (Table 1). It is also cheaper to simulate than the Gaussian kernel. For the Mirror kernel, a small step size gives an estimate with a lower asymptotic variance, but causes slower convergence. It is thus preferable to use large steps during the burn-in for fast convergence, and small steps afterwards for fast mixing.
In practical MCMC applications, the computational and implementational costs are of major concern. We note that the computational cost may depend on hardware and software implementation details, as well as the specific inference problem. For example, certain one-dimensional moves may not change the likelihood and are thus computationally efficient, such as the change to y = t/r when x = tr is fixed in the molecular clocking dating example. Our analyses of the logistic regression and the molecular clock dating examples suggest that the Mirror moves are simpler to implement and run faster than the manifold MALA and HMC kernels. We leave it to the algorithm developer to assess the computational cost of different proposals in their specific applications.

Comparison with other MCMC algorithms
Several MCMC algorithms have been proposed to improve mixing by suppressing the diffusive behaviour of random walk proposals in which every iteration tends to take a small step in a random direction. We discuss a few that are related to our work.
The idea of proposing values on the other side of the distribution appeared in the literature before. For instance, the overrelaxation method (Adler, 1981;Barone and Frigessi, 1990) is a Gibbs sampler for Gaussian conditionals that makes a move to the other side of each component's full conditional. The update for the component i is where μ i|−i and σ 2 i|−i are the conditional mean and variance of x i given all other variables x −i , and α ∈ (−1, 1) is a user-specified parameter. Choosing α ∈ (−1, 0) will make a Figure 8: Sample path from a few steps of four algorithms for sampling from N 2 (0, Σ), with Σ = ( 1 9 9 100 ): (a) standard Gibbs sampler, (b) overrelaxed Gibbs sampler (α = −0.98), (c) MH using 1D TransfMirrorN 1 2 kernel, and (d) MH using 2D MirrorN 1 2 kernel. The first three consist of a sequence of two one-dimensional moves, while the last one is a single two-dimensional move. The 1D TransfMirrorN 1 2 kernel applies the MirrorN kernel y i |y i ∼ N (2(Σ −1/2 μ * ) i − y i , 1 4 ), i = 1, 2, on y =Σ −1/2 x, where x = (x 1 , x 2 ) is the target variable, and μ * andΣ are estimated mean and covariance matrix of the target from the burn-in as described in Section 4.2. The 2D MirrorN 1 2 kernel proposes x |x ∼ N (2μ * − x, 1 4Σ ). Triangle = starting point (−1, 4); filled circle = state of the Markov chain; empty circle = intermediate step (for the one-dimensional moves). Two ellipses enclose the 50% and 90% probability mass of the target. move to the other side of the full conditional distribution of x i . The Markov chain does not move to the other side of the target in one step, but instead moves along the density contour (Figure 8b), with higher-order autocorrelations oscillating between positive and negative signs (Figure 9). This results in cancellations of autocorrelations in (3), yielding a lower asymptotic variance than the standard Gibbs sampler in certain cases. In contrast, the Mirror is a general MH proposal kernel that moves to the other side of the target in one step, giving a negative first-order autocorrelation (Figure 9). In addition, its implementation does not require the knowledge of the full conditionals. The mirror reflection of the current state through a centre point as an MH proposal kernel to induce negative correlations has been suggested by Tierney (1994, Section 4.3.3), who referred to it as an antithetic variate method, but theoretical analysis and empirical comparisons have been lacking.
In the antithetic coupling method (Hammersley and Morton, 1956;Frigessi et al., 2000), two Markov chains are constructed with one to be the mirror reflection of the other. Combining the two chains yields a low-variance estimate. In contrast, the Mirror kernel introduces negative correlations within a chain rather than between chains.
HMC is another method that aims to propose a value away from the current position, in the direction of the peak of the target. A proposal is generated by simulating a trajectory of the so-called Hamiltonian dynamics. It requires computation of the first derivative of the log target density, and the tuning of its parameters is currently a topic of research (Wang et al., 2013;Hoffman and Gelman, 2014). MALA is an MH algorithm that uses the Langevin proposal (see Section 1) and can be viewed as a special case of HMC. For the N (μ, s 2 ) target, choosing step size σ = 2s gives the MALA update x |x ∼ N (2μ − x, 4s 2 ), which is equivalent to the MirrorN kernel with a fixed step size of 2s.

Parametrisation, variable transformation and efficiency for different functions
Parametrisation of the target distribution or variable transformation is a useful approach to designing efficient MCMC samplers. We have illustrated this with several transformations that deal with correlations and/or scales of the variables. We note that using different functions f in (2) to evaluate MCMC mixing efficiency for the same target π is equivalent to using different transformations or target densities but the same function (such as the mean). Given that the ranking of kernels does not appear to be sensitive to the target used or the function to be estimated, a useful approach may be to transform the variable into a density for which efficient proposal kernels are known, and design proposals for the original variables accordingly.
To find a good proposal q(x |x) for the target π X (x), we use a one-to-one transformation y = T (x) so that the resulting density π Y (y) resembles a simple density for which an efficient proposal q(y |y) is known. The X-and Y -chains are then coupled, in the sense that if the initial states are the same with y 0 = T (x 0 ) and if the same sequence of random numbers is used to run the two chains, then y n = T (x n ) for all n ≥ 1. Estimating E π X (f (x)) using the X-chain sample (x n ) N n=1 is then the same as estimating E π Y (f (T −1 (y))) using the Y -chain sample (y n ) N n=1 . Thus finding an efficient proposal kernel for a given target is equivalent to finding a good variable transformation or parametrisation. It is then profitable to study the mixing efficiency for estimating various functions for simple targets such as the uniform. Viewed in this light, our early  Table 6: Efficiency for estimating the mean of three target distributions. The transformation y = e −x is used for Exp(1) and folded Gaussian, and the transformations y ∼ t 2 and y ∼ logistic are used for N (0, 1).
observation that different proposal kernels with the same general shape have similar performances is equivalent to the observation that the ranking of proposals is insensitive to the target or function used.
Consider the target x ∼ Exp(1/μ) with mean μ. Then y = e −x/μ ∼ U (0, 1). From Table 1, the uniform kernel y |y ∼ U (y − w 2 , y + w 2 ) with reflection at w = 2.8 achieves E = 1.537 for estimating E(y). Transformed onto the original variable x, the move is as follows. Set y = e −x/μ , sample y |y ∼ U (y − w 2 , y + w 2 ) and reflect so that y ∈ (0, 1). Then set x = −μ log y . The acceptance probability is which equals 1. This algorithm gives E = 1.298 for estimating E(x) = E(−μ log y) ( Table 6). This is good performance since w was optimized for estimating E(y) instead of E(x). Even higher efficiency is achieved by using bimodal kernels such as Bactrian-Triangle or the new StrawHat on y (Table 6).
Next, we use the same transformation y = e −x/μ to sample from the folded Gaussian π(x) ∝ exp(− 1 2 x 2 ), x > 0, to estimate E(x) = 0.7979. The acceptance probability is given by (10) although this does not equal 1. The uniform kernel on y gives E = 1.075 (Table 6). This is good because Exp(1) has only a passing resemblance to the folded Gaussian. Again bimodal kernels such as BactrianTriangle and StrawHat give even higher efficiency (Table 6).
Lastly, we consider two generic transformations for targets with support on the real line. We sample from x ∼ N (0, 1) using uniform, BactrianTriangle or StrawHat kernel on y = h((x −μ)/ŝ) where h is the CDF of the t 2 or logistic distribution, andμ andŝ are estimates of the mean and standard deviation of the target from the burn-in. For both transformations, the uniform kernel gives E close to 1, for estimating E(x) = 0, whereas BactrianTriangle and StrawHat kernels give E > 1 (Table 6).