Asymptotically exact inference in differentiable generative models

Many generative models can be expressed as a differentiable function of random inputs drawn from some simple probability density. This framework includes both deep generative architectures such as Variational Autoencoders and a large class of procedurally defined simulator models. We present a method for performing efficient MCMC inference in such models when conditioning on observations of the model output. For some models this offers an asymptotically exact inference method where Approximate Bayesian Computation might otherwise be employed. We use the intuition that inference corresponds to integrating a density across the manifold corresponding to the set of inputs consistent with the observed outputs. This motivates the use of a constrained variant of Hamiltonian Monte Carlo which leverages the smooth geometry of the manifold to coherently move between inputs exactly consistent with observations. We validate the method by performing inference tasks in a diverse set of models.


Introduction
Developments in generative modelling with deep architectures such as Variational Auto-Encoders (VAEs) [15,31] and Generative Adversarial Nets (GANs) [12] have made it possible to learn probabilistic models of increasingly complex, high-dimensional data-sets.The generator of these models can be formulated as a differentiable 1 function ∶  →  which maps random vector inputs ∈  = ℝ drawn from a probability distribution with known density [ ] = ( ) to an implicitly defined distribution over random vector outputs = ( ) ∈  = ℝ .We will refer here to a model of this form as a differentiable generative model (DGM). 1 Differentiable here means that the Jacobian is defined a.e.
As well as parametric models learnt from data, the DGM framework also encapsulates a wide class of simulator models where is defined procedurally, e.g.numerical integration of a system of stochastic differential equations.Here the random inputs are the series of draws from a random number generator during the simulator's execution.The operations used in many simulations are differentiable and automatic differentiation provides a computationally efficient framework for calculating the exact derivatives of a differentiable simulator's outputs with respect to its random inputs given just the code used to define the model.
Often we will be interested in using a generative model to make inferences about the modelled variables given observations related to outputs of the model.For example given a generative model of images we may wish to infer plausible in-paintings of an image region given knowledge of the surrounding pixel values.Similarly given a simulator of a physical process and generator of parameters of the process which we believe are reasonable a priori, we may wish to infer our posterior beliefs about the parameters under the model given observations of the physical process.
Concretely, we consider the generated output as being partitioned into observed variables ∈  = ℝ and latent variables ∈  = ℝ to be inferred, i.e. = [ ; ] and  =  × .Likewise we partition the generator such that = ( ) and = ( ).Inference is then the task of computing expectations of conditioned on an observed .
A common special case is where is generated from a subset of the random inputs , and is then generated as a function of and the remaining random inputs .For example when is sampled from a prior and sampled from a likelihood given .We will refer to this specialism as a directed model in reference to the fact that it has a natural interpretation as a directed graphical model as illustrated in figure 1.
For many DGMs we cannot explicitly compute the joint density on the generator outputs [ ] = , [ , ] and so conditional density | [ | ] we wish to compute expectations with respect to.In directed models this usually corresponds to not being able to evaluate the likelihood

Approximate Bayesian Computation
A lack of explicit likelihoods is the motivation for likelihoodfree inference methods such as Approximate Bayesian Computation (ABC) [4,20].In ABC the simulated observed outputs are decoupled from the observed data ̄ by a noise model or ) with tolerance parameter , e.g. 2 (Gaussian kernel).The ABC likelihood is then defined as The ABC posterior is likewise defined as and we can express expectations of functions of the latent variables with respect to this approximate posterior Various Monte Carlo inference schemes can be used to estimate this expectation.The simplest is to generate a set of independent samples ( ) , ( ) =1 from , [ , ] 2 and use an importance sampling estimator 2 ABC is usually applied to directed models hence this is usually considered as generating from a prior then simulating given however more generally we can just sample from the joint.
In the case of a uniform ball kernel this corresponds to the standard ABC reject algorithm, with expectations being estimated as averages over the latent variable samples for which the corresponding simulated outputs are within a (Euclidean) distance of from the observations.Alternatively a MCMC scheme can be used to estimate the ABC posterior expectation in (3) [21].A Markov chain is constructed with stationary distribution (6) the overall transition operator leaving (5) stationary.The samples of the chain state can then be used to compute consistent estimators of (3).This scheme relies on being able to evaluate [ ] and so is only applicable to directed models where the generator is of a tractable form such that [ ] is known.
In the limit of → 0 valid ABC kernels ( ̄ ; ) collapse to a Dirac delta distribution ( ̄ − ) and so the ABC posterior in (2) converges to the posterior Similarly for the ABC posterior expectation in (3) we have For both the ABC reject and ABC MCMC schemes just described, simulated observations are independently generated from the conditional given the current latent variables .The observed values ̄ are a zero-measure set in  under non-degenerate | [ | ] and so as → 0 the probability of accepting a sample / proposed move becomes zero.Applying ABC with a non-zero therefore can be seen as a practically motivated relaxation of the constraint that true and simulated data exactly match, and hence the 'approximate' in Approximate Bayesian Computation.
Alternatively the kernel can be given a modelling interpretation as representing uncertainty introduced by noise in the observations or mismatch between the unknown generative process by which the observed data was produced and the generative model [30,37].In practice however the kernel and choice of seems more often to be motivated on computational efficiency grounds [32].

Inference in the input space
Using the Law of the Unconscious Statistician we can reparametrise (3) in terms of the random inputs to the generative model This immediately indicates we can estimate ABC expectations by applying any MCMC method of choice in the input space to construct a chain with stationary density We can also however again consider the limit of → 0 as derived in the output space in (7).We have that the kernel term ̄ ; ( ) → ̄ − ( ) and so The Dirac delta term restricts the integral across the input space  to an embedded, − dimensional, implicitlydefined manifold  = ∈  ∶ ( ) − ̄ = .By applying the the Co-Area Formula [11, §3.2.12] the integral with respect to the Lebesgue measure across  in (9) can be rewritten as integral across the embedded manifold  with respect the Hausdorff measure for the manifold Therefore if we generate a set of input MCMC samples { ( ) } =1 restricted to  and with invariant density with respect to the Hausdorff measure for the manifold, then Intuitively the determinant term in (11) adjusts for the change in the infinitesimal 'thickness' (extent in directions orthogonal to the tangent space) of the manifold when mapping through the generator function .The result (10) is given in a slightly different form in [9].A derivation is provided in Appendix A of the Supplementary Material.
A general framework for performing asymptotically exact inference in differentiable generative models is therefore to define MCMC updates which explore the target density (11) on the constraint manifold .We propose here to use a method which simulates the dynamics of a constrained mechanical system.

Constrained Hamiltonian Monte Carlo
In Hamiltonian Monte Carlo (HMC) [10,27] the vector variable of interest is augmented with a momentum variable ∈ ℝ .The momentum is taken to be independently Gaussian distributed with zero mean and covariance , often called the mass matrix.The new joint target density is then proportional to exp [− ( , )] where is termed the Hamiltonian.
The canonical Hamiltonian dynamic is described by the system of ordinary differential equations This dynamic is time-reversible, measure-preserving and exactly conserves the Hamiltonian.Symplectic integrators allow approximate integration of the Hamiltonian flow while maintaining the time-reversibility and measure-preservation properties.Subject to stability bounds on the time-step, such integrators will exactly conserve some 'nearby' Hamiltonian, and so the change in the Hamiltonian will tend to remain small even over long simulated trajectories [17].
These properties make simulated Hamiltonian dynamics an ideal proposal mechanism for a Metropolis MCMC method.The Metropolis accept ratio for a proposal ( p , p ) generated by simulating the dynamic time steps forward from ( , ) with a symplectic integrator and then negating the momentum, is simply exp ( , ) − ( p , p ) .Typically the change in the Hamiltonian will be small and so the probability of acceptance high.To ensure ergodicity, dynamic moves can be interspersed with updates independently sampling a new momentum from  ( , ).
In our case the system is subject to a constraint of the form ( ) = ( ) − ̄ = .By introducing Lagrangian multipliers for each of the constraints, the Hamiltonian for a constrained system can be written as and a corresponding constrained Hamiltonian dynamic A popular numerical integrator for simulating constrained Hamiltonian dynamics is RATTLE [2] (and the algebraically equivalent SHAKE [33] scheme).This a natural generalisation of the Störmer-Verlet (leapfrog) integrator typically used in standard HMC with additional projection steps in which the Lagrange multipliers are solved for to satisfy the conditions (17).RATTLE and SHAKE maintain the properties of being time-reversible, measure-preserving and symplectic [18].
The use of constrained dynamics in HMC has been proposed several times.In the molecular dynamics literature, both [14] and [19] suggest using a simulated constrained dynamic within a HMC framework to estimate free-energy profiles.
Most relevantly here [6] proposes using a constrained HMC variant to perform inference in distributions defined on implicitly defined embedded non-linear manifolds.This gives sufficient conditions on ,  and for the scheme to satisfy detailed balance and be ergodic: that is 2 continuous, and  is a connected smooth and differentiable manifold and has full row-rank everywhere.

Method
Our constrained HMC implementation is shown in algorithm 1.We use a generalisation of the RATTLE scheme to simulate the dynamic.The inner updates of the state to solve for the geodesic motion on the constraint manifold are split into multiple smaller steps, which can be considered a special case of the scheme described in [16].This allows more flexibility in choosing an appropriately small step-size to ensure convergence of the iterative solution of the equations projecting on to the constraint manifold while still allowing a more efficient larger step size for updates to the momentum due to the negative log density gradient.
We have assumed = here; other mass matrix choices can be equivalently implemented by adding an initial linear transformation stage in the generator.
Each inner geodesic time-step involves making an unconstrained update → ̃ and then projecting ̃ back on to  by solving for which satisfy ( ̃ − T ) = .This is performed in the function PROJECTPOSITION in algorithm 1.
Here we use a quasi-Newton method for solving the system of equations in the projection step.The true Newton update would be This requires recalculating the Jacobian and solving a dense linear system within the optimisation loop.Instead we use a symmetric quasi-Newton update as proposed in [3], the Jacobian Gram matrix T evaluated at the previous state used to condition the moves.This matrix is positive-definite and a Cholesky decomposition can be calculated outside the optimisation loop allowing cheaper quadratic cost solves within the loop.In the rare cases where the quasi-Newton iteration fails we fall back to a MINPACK [24] implementation of the robust Powell's Hybrid method [29].
For larger systems, the Cholesky decomposition of the con-straint Jacobian Gram matrix (line 36) will become a dominant cost, generally scaling cubically with .The elementwise or autoregressive noise structures of many models however allows a significantly reduced quadratically scaling computational cost as explained in the supplementary material in Appendix C.
The momentum updates in the SIMULATEDYNAMIC routine require evaluating the gradient of the logarithm of the target density (11).This can by achieved by using automatic differentiation to calculate the gradient from the expression given in (11), however both the log-density and gradient can be more efficiently calculated by reusing the Cholesky decomposition of the constraint Jacobian Gram matrix computed in line 36.Details are given in Appendix B.
In the PROJECTPOSITION routine convergence is signalled when the elementwise maximum absolute value of the constraint function is below some tolerance .This acts analogously to the parameter in ABC methods, however here we typically set this parameter to some multiple of machine precision and so the approximation introduced is comparable to that otherwise incurred for using non-exact arithmetic.
A final implementation detail is the requirement to find some initial satisfying ( ) = .In some cases structure in the generator function such as that described in Appendix C can be leveraged to come up with an efficient non-iterative scheme for finding a constraint satisfying .For general generators, we can choose a subset of the inputs (or linear projections of the inputs) of dimensionality and plug the resulting system of equations into a black-box solver.

Related work
Closely related is the Constrained HMC method of [6].This demonstrates the validity of the constrained HMC framework theoretically and experimentally, and we do not claim any original contribution in this respect.The focus in [6] is on performing inference in distributions inherently defined on a fixed non-Euclidean manifold such as the unit sphere or space of orthogonal matrices.
Our work builds on [6] by highlighting that conditioning on the output of a generator imposes a constraint on its inputs and so defines a density in input space restricted to some manifold.Unlike the cases considered in [6] our constraints are therefore data-driven and the target density on the manifold implicitly defined by a generator function and base density.[7] also considers applying a HMC scheme to sample from non-linear manifolds embedded in a Euclidean space.Similarly to [6] however the motivation is performing inference with respect to distributions explicitly defined on a manifold such as directional statistics.

33:
̃ ← + 34: ← PROJECTMOMENTUM( ̃ , , ) ← ′ 40: return , , , geodesic flow on the manifold.Our use of constrained Hamiltonian dynamics, and in particular the geodesic integration scheme of [16], can be considered an extension for cases when an exact geodesic solution is not available.Instead the geodesic flow is approximately simulated while still maintaining the required measure-preservation and reversibility properties for validity of the overall HMC scheme.
Hamiltonian ABC [22], also proposes applying HMC to perform inference in simulator models as considered here.
An ABC set-up is used with a Gaussian synthetic-likelihood formed by estimating moments from simulated data.
Rather than using automatic differentiation to exactly calculate gradients of the generator function, Hamiltonian ABC uses a stochastic gradient estimator.This is based on previous work considering methods for using a stochastic gradients within HMC [8,36].It has been suggested however that the use of stochastic gradients can destroy the favourable properties of Hamiltonian dynamics which enable coherent exploration of high dimensional state spaces [5].
In Hamiltonian ABC it is also observed that representing the generative model as a deterministic function by fixing the random inputs to the generator is a useful method for improving exploration of the state space.This is achieved by including the seed of the pseudo-random number generator in the chain state rather than the set of random inputs.
Also related is Optimisation Monte Carlo [23].The authors propose using an optimiser to find parameters of a simulator model consistent with observed data (to within some tolerance ) given fixed random inputs sampled independently.The optimisation is not measure-preserving and so the Jacobian of the map is approximated with finite differences to weight the samples.Our method also uses an optimiser to find inputs consistent with the observations, however by using a measure-preserving dynamic we avoid having to reweight samples which can scale poorly with dimensionality.
Our method also differs in treating all inputs to a generator equivalently; while the Optimisation Monte Carlo authors similarly identify the simulator models as deterministic functions they distinguish between parameters and random inputs, optimising the first and independently sampling the latter.This can lead to random inputs being sampled for which no parameters can be found consistent with the observations (even with a 'soft' within constraint).Although optimisation failure is also potentially an issue for our method, as we jointly optimise all inputs and are approximating some exact continuous time constraint-satisfying dynamic we found this occurred rarely in practice if an appropriate step size is chosen.Our method can also be applied in cases were the parameter dimension is greater than the dimension of the observations unlike Optimization Monte Carlo.
To illustrate the general applicability of our method we performed inference tasks in three diverse settings: parameter inference in a stochastic Lotka-Volterra predator-prey model simulation, 3D human pose and camera parameter inference given 2D joint position information and finally in-painting of missing regions of digit images using a generative model trained on MNIST.In all three experiments Theano [35] was used to specify the generator function and calculate the required derivatives.All experiments were run on an Intel Core i5-2400 quad-core CPU.Python code for the experiments is available at https://git.io/dgm.

Lotka-Volterra parameter inference
As a first demonstration we considered a stochastic continuous state variant of the Lotka-Volterra model, a common example problem for ABC methods e.g.[23].In particular we consider parameter inference given a simulated solution of the following stochastic differential equations where 1 represents the prey population, 2 the predator population, 4 =1 the system parameters and 1 and 2 zero-mean, unit variance white noise processes.
We compared our method to various ABC approaches ( §2) using a uniform ball kernel with radius .ABC rejection failed catastrophically, with no acceptances in 10 6 samples even with a large = 1000.ABC MCMC with a Gaussian proposal distribution also performed very poorly with the dynamic having zero acceptances over multiple runs of 10 5 updates for = 100 and getting stuck at points in parameter space over many updates for larger = 1000, even with small proposal steps.Based on a method proposed in the pseudo-marginal literature [26], we tried using alternating elliptical slice sampling updates of the random inputs used to generate the parameters and remaining random inputs used to generate the simulated observations given parameters.The slice sampling updates locally adapt the size of steps made to ensure a move can always be made.Using this method we were able to obtain reasonable convergence over long runs for both = 100 and = 10.
The results are summarised in Figure 2. Figure 2a shows the simulated data used as observations and ABC sample trajectories for = 10 and = 100 .Though both samples follow the general trends of the observed data there are large discrepancies particularly for = 100.Our method in contrast samples parameters generating trajectories exactly matching the observations at all points.Figure 2b shows the marginal histograms for the parameters.The inferred posterior on the parameters are significantly more tightly distributed about the true values used to generate the observations for our approach and the = 10 case compared to the results for = 100; even for the = 10 case however it can be seen that there are spurious appearing peaks in the distributions for 3 and 4 .
Figure 2c shows the relative sampling efficiency of our approach against the ABC methods, as measured by the effective sample sizes (ESS) (computed with R-CODA [28]) normalised by run time averaged across 10 sampling runs for each method.Despite the significantly higher per-update cost in our method, the coherent movement about the state space afforded by the Hamiltonian dynamic gave significantly better performance even over the very approximate = 100 case.

Human pose and camera model inference
For our next experiment we considered inferring a 3D human pose and camera model from 2D projection(s) of joint positions.We used a 19 joint skeleton model, with a prior density over poses parametrised by 47 local joint angles learnt from the PosePrior motion capture data-set [1] with a VAE with a 30 dimensional hidden representation .For the bone lengths , a log-normal model was fitted to data from the ANSUR anthropometric data-set [13], due to symmetry 13 independent lengths being specified.A simple pin-hole projective camera model with 3 position parameters and fixed focal-length was used 3 .A log-normal prior was placed on the depth co-ordinate z ,3 to enforce positivity with normal priors on the other two co-ordinates z ,1 and z ,2 .A small amount of Gaussian noise was added to the projected positions to give the observed 2D joint positions .This ensured the constraint Jacobian was full row-rank everywhere and gave a known hierarchical joint density on , , , , allowing comparison with HMC as a baseline.
We first considered binocular pose estimation, with the 3D scene information inferred given 2D projections from two cameras with a known offset between them.In this stereo vision case, the disparity between the projections gives information about the depth direction and so we would expect the posterior distribution on the 3D pose to be tightly distributed around the true values used to generate the observations.We compared our constrained HMC method to running standard HMC on the conditional density of , , , Figure 3a shows the root mean squared error (RMSE) between the posterior mean estimate of the 3D joint positions and the true positions used to generate the observations as the number of samples included in the estimate increases for three test scenes.For both methods the horizontal axis has been scaled by run time.
The constrained HMC method (blue curves) tends to give significantly more accurate estimates particularly over longer periods.Visually inspecting the sampled poses and individual run traces (not shown) it seems that the HMC runs tended to often get stuck in local modes corresponding to a subset of joints being 'incorrectly' positioned while still broadly matching the (noisy) projections.The complex dependencies of the joint positions on the angle parameters mean the dynamic struggles to find an update which brings the 'incorrect' joints closer to their true positions without moving other joints out of line.The constrained HMC method seemed to be less susceptible to this issue.
We also considered inferring 3D scene information from a single 2D projection.Monocular projection is inherently information destroying with significant uncertainty to the true pose and camera parameters which generated the observations.Figure 3b shows pairs of orthographic projections of 3D poses: the left most column is the pose used to generate the projection conditioned on and the right three columns are poses sampled using constrained HMC consistent with the observations.The top row shows front -views, corresponding to the camera view though with a orthographic rather than perspective projection, the bottom row shows side -views with the axis the depth from the camera.
The dynamic is able to move between a range of plausible poses consistent with the observations while reflecting the inherent depth ambiguity from the monocular projection.

MNIST in-painting
As a final task we considered inferring in-paintings for a missing region of a digit image given knowledge of the rest of the pixel values .A Gaussian VAE trained on MNIST was used as the generative model, with a 50-dimensional hidden code .We compared our method to running HMC in the known conditional density on given ( can then be directly sampled from its Gaussian conditional given ).Example samples are shown in Figure 4.In this case the constrained and standard HMC approaches appear to be performing similarly, with both able to find a range of plausible in-paintings given the observed pixels.Without cost adjustment the standard HMC samples show greater correlation between subsequent updates, however for a fairer comparison the samples shown were thinned to account for the approximately 40× larger run-time per constrained HMC sample.Although the constrained dynamic does not improve efficiency here neither does it seem to hurt it.

Discussion
We have presented a generally applicable framework for performing inference in differentiable generative models.Though simulating the constrained Hamiltonian dynamic is computationally costly, the resulting coherent exploration of the state space can lead to significantly improved sampling efficiency over alternative methods.
Further our approach allows asymptotically exact inference in differentiable generative models where ABC methods might otherwise be used.We suggest an approach for dealing with two of the key issues in ABC methods -enabling inference in continuous spaces as collapses to zero and allowing efficient inference when conditioning on high-dimensional observations without the need for dimensionality reduction with summary statistics (and the resulting task of choosing appropriate summary statistics).As well as being of practical importance itself, this approach should be useful in providing 'ground truth' inferences in more complex models to assess the affect of the approximations used in ABC methods on the quality of the inferences.
In molecular simulations, constrained dynamics are often used to improve efficiency.Intra-molecular motion is removed by fixing bond lengths.This allows a larger time-step to be used due to the removal of high-frequency bond oscillations [16].An analogous effect is present when performing inference in an ABC setting with a kernel 'soft-constraint' to enforce consistency between the inputs and observed outputs.As → 0 the scales over which the inputs density changes value in directions orthogonal to the constraint manifold and along directions tangential to the manifold increas-ingly differ.To stay within the soft constraint a very small step-size needs to be used.Using a constrained dynamic decouples the motion on the constraint manifold from the steps to project on to it, allowing more efficient larger steps to be used for moving on the manifold.
A limitation of our method is the requirement of differentiability of the generator.This prevents applying our approach to generative models which use discontinuous operations or discrete random inputs.In some cases conditioned on fixed values of discrete random inputs the generator may still be differentiable and so the proposed method can be used to update the continuous random inputs given values of the discrete inputs.This would need to be alternated with updates to the discrete inputs, which would require devising methods for updating the discrete inputs to the generator while constraining its output to exactly match observations.
A common technique in ABC methods is to define a kernel or distance measure in terms of summary statistics of the observed data [20].This is necessary in standard ABC approaches to cope with the 'curse of dimensionality' with the probability of accepting samples / moves for a fixed tolerance exponentially decreasing as the dimensionality of the observations conditioned on increases.Although as already noted the proposed method is better able to cope with high observation dimensions, if appropriate informative statistics are available (e.g. based on expert knowledge) and these are differentiable functions of the generator outputs, they can be easily integrated in to the proposed method by absorbing the statistic computation in to the definition of .The base density ( ) will typically be of a simple form e.g. standard Gaussian, therefore we can evaluate the logarithm of the target density up to an additive constant at a marginal computational cost that scales linearly with dimensionality.
For the gradient we can use reverse-mode automatic differentiation to calculate the gradient of (38) with respect to .This requires propagating partial derivatives through the Cholesky decomposition [25]; efficient implementations for this are present in many automatic differentiation frameworks including Theano.
Alternatively the gradient of (34) can be explicitly derived.The gradient of log ( ) will generally be trivial and The matrix inside the square brackets is independent of and can be computed once by solving the system of equations by forward and backward substitution.The matrix of second partial derivatives 2 can either be manually derived for the specific generator function or calculated using automatic differentiation.The trace of the matrix product is then just the sum over all indices of the element-wise product of the pair.

C Exploiting structure in the generator
Often the generator inputs can be split in to two distinct groups -global inputs which effect all of the observed outputs (e.g.inputs which map to model parameters) and local 'noise' inputs , each element of which affect only a subset of the outputs.
The decomposition of T = T + T can then be computed by low-rank Cholesky updates of the triangular / diagonal matrix with each of the columns of .As dim( ) = is often significantly less than, and independent of, the number of outputs conditioned on , the resulting ( 2) cost of the Cholesky updates is a significant improvement over the original ( 3 ).
Many learnt differentiable generative models have an element-wise noise structure including the Gaussian VAE.The autoregressive noise structure commonly occurs in stochastic dynamical simulations where the outputs are a time sequence of states, with noise being added each time-step, for example the Lotka-Volterra model considered in the experiments in Section 7.

|
[ | ].The lack of a closed form density impedes the direct use of approximate inference methods such as variational inference and Markov chain Monte Carlo (MCMC).

Figure 1 :
Figure 1: (a) and (b) Stochastic computation graphs (SGC) [34] visualising models considered in this paper.Circular nodes represent stochastic variables sampled from a conditional distribution on their parent nodes or marginal for root nodes.Square nodes represent variables that are deterministic functions of their parents.Shaded nodes are observed.(a) SGC of general case in which and are jointly generated from .(b) SGC of the directed case in which we first generate from then generate from and .(c) Undirected graphical model corresponding to (a) when integrating out .(d) Directed graphical model which is a natural representation of (b) when integrating out and .

Figure 2 :Figure 3 :
Figure 2: Lotka-Volterra (a) Observed predator-prey populations (solid) and ABC sample trajectories with = 10 (dashed) and = 100 (dot-dashed).(b) Marginal empirical histograms for the (logarithm of the) four parameters (columns) from constrained HMC samples (top) and ABC samples with = 10 (middle) and = 100 (bottom).Horizontal axes shared across columns.Red arrows indicate true parameter values.(c) Mean ESS normalised by compute time for each of four parameters for ABC with = 10 (red), = 100 (green) and our method (blue).Error bars show ±3 standard errors of mean.
(a) Constrained HMC samples (b) HMC in hierarchical model samples

Figure 4 :
Figure 4: MNIST In-painting samples.The top black-on-white quarter of each image is the fixed observed region and the remaining white-on-black region the proposed in-painting.In left-right scan order the images in (a) are consecutive samples from a run; in (b) the images are every 40 th sample to account for the quicker updates.
which scales as ( 2 ).However as part of the constrained dynamics updates the lower-triangular Cholesky decomposition of the Gram matrix T is calculated.Using basic properties of the matrix determinant we have log ( ) = log ( gradient of the second term can be calculated using ) by proposing a new state ( ′ , ′ ) given the current state ( , ) by sampling ′ ∼ (⋅ | ) from some perturbative proposal distribution on  and then generating a new ′ ∼ | [⋅ | ].This is then accepted with probability