Efficient acquisition rules for model-based approximate Bayesian computation

Approximate Bayesian computation (ABC) is a method for Bayesian inference when the likelihood is unavailable but simulating from the model is possible. However, many ABC algorithms require a large number of simulations, which can be costly. To reduce the computational cost, Bayesian optimisation (BO) and surrogate models such as Gaussian processes have been proposed. Bayesian optimisation enables one to intelligently decide where to evaluate the model next but common BO strategies are not designed for the goal of estimating the posterior distribution. Our paper addresses this gap in the literature. We propose to compute the uncertainty in the ABC posterior density, which is due to a lack of simulations to estimate this quantity accurately, and define a loss function that measures this uncertainty. We then propose to select the next evaluation location to minimise the expected loss. Experiments show that the proposed method often produces the most accurate approximations as compared to common BO strategies.


Introduction
We consider the problem of Bayesian inference of some unknown parameter θ ∈ Θ ⊂ R p of a simulation model. Such models are typically not amenable to any analytical treatment but they can be simulated with any parameter θ ∈ Θ to produce data x θ ∈ X . Simulation models are also called simulator-based or implicit models [Diggle and Gratton, 1984]. Our prior knowledge about the unknown parameter θ is represented by the prior probability density π(θ) and the goal of the analysis is to update our knowledge about the parameters θ after we have observed data x obs ∈ X .
However, if one can only simulate from the model, that is, draw samples x θ ∼ π(· | θ), and not evaluate the likelihood function π(x obs | θ), the standard Bayesian approach cannot be used. Approximate Bayesian computation (ABC) replaces likelihood evaluations with model simulations, see e.g. Marin et al. [2012], Turner and Van Zandt [2012], Lintusaari et al. [2017a] for an overview. Such approaches are also called likelihood-free inference in the literature. The main idea of the basic ABC algorithm is to draw a parameter value from the prior distribution, simulate a data set with the given parameter value, and accept the value as a draw from the (approximate) posterior if the discrepancy between the simulated and observed data is small enough. This algorithm produces samples from the approximate posterior distribution π ABC (θ | x obs ) ∝ π(θ) π ε (x obs | x)π(x | θ) dx, (2) where π ε (x obs | x) ∝ 1 ∆(x obs ,x)≤ε , although other choices of π ε are also possible [Wilkinson, 2013]. The function ∆ : X × X → R + is the discrepancy that tells how different the simulated and observed data sets are, and it is often formed by combining a set of summary statistics. Choosing the discrepancy and the corresponding summary are active research topics [Blum et al., 2013, Fearnhead and Prangle, 2012, but in this article we assume a suitable discrepancy is provided to us. The threshold ε controls the trade-off between the accuracy of the approximation and computational cost: a small ε yields accurate approximations but requires more simulations, see e.g. Marin et al. [2012]. Given t samples from the model for some θ, that is, x (i) θ ∼ π(· | θ) for i = 1, . . . , t, the value of the ABC posterior in Equation (2) can be estimated as where " ∝ ∼ " means that the left-hand side is approximately proportional to the right-hand side. Algorithms based on Markov Chain Monte Carlo and sequential Monte Carlo have been used to improve the efficiency of ABC as compared to the basic rejection sampler [Marjoram et al., 2003, Sisson et al., 2007, Beaumont et al., 2009, Toni et al., 2009, Marin et al., 2012, Lenormand et al., 2013. Unfortunately, the sampling based methods still require a very large number of simulations. In this paper we focus on the challenging scenario where the number of available simulations is limited, e.g. to fewer than a thousand, rendering these sampling-based ABC methods infeasible. Different modelling approaches have also been proposed to reduce the number of simulations required. For example, in the synthetic likelihood method summary statistics are assumed to follow the Gaussian density [Wood, 2010, Price et al., 2017 and the resulting likelihood approximation can be used together with MCMC. Wilkinson [2014], Meeds and Welling [2014], Jabot et al. [2014], Kandasamy et al. [2015], Drovandi et al. [2015], Gutmann and Corander [2016], Järvenpää et al. [2017] all use Gaussian processes (GP) to accelerate ABC in various ways. Some other alternative approaches are considered by Fan et al. [2013], Papamakarios and Murray [2016]. Also, Beaumont et al. [2002], Blum [2010], Blum and François [2010] have used modelling as a post-processing step to correct the approximation error of the nonzero threshold.
While probabilistic modelling has been used to accelerate ABC inference, and strategies have been proposed for selecting which parameter to simulate next, little work has focused on trying to quantify the amount of uncertainty in the estimator of the ABC posterior density under the chosen modelling assumptions. This uncertainty is due to a finite computational budget to perform the inference and could be thus also called as "computational uncertainty". Consequently, little has been done to design strategies that directly aim to minimise this uncertainty. To our knowledge, only Kandasamy et al. [2015] have used the uncertainty in the likelihood function to propose new simulation locations in a query-efficient way but they assumed that the likelihood can be evaluated, although with high computational cost. Also, Wilkinson [2014] modelled the uncertainty in the likelihood to rule out regions with negligible posterior probability. Rasmussen [2003] used GP regression to accelerate Hybrid Monte Carlo but did not consider ABC. Osborne et al. [2012] developed an active learning scheme to select evaluations to estimate integrals such as the model evidence under GP modelling assumptions, however, their approach is designed for estimating this particular scalar value. Finally, Gutmann and Corander [2016] proposed Bayesian optimisation (BO) to efficiently select new evaluation locations, but the BO strategies they used to illustrate the framework are originally designed for optimisation and not for ABC.
In this article we propose an acquisition function for selecting the next evaluation location tailored specifically for ABC. The acquisition function measures the expected uncertainty in the estimate of the (unnormalised) ABC posterior density function over the future evaluation of the simulation model, and proposes the next simulation location so that this expected uncertainty is minimised. We also consider some other variants of this strategy. In Section 2 we formulate our probabilistic approach on a general level. In Section 3 we propose a particular algorithm, based on modelling the discrepancy with a GP. Section 4 contains experiments. Some additional details are discussed in Section 5 and Section 6 concludes the article. Technical details and additional experiments are presented in the supplementary material.

Problem formulation
We start by presenting the main idea of the probabilistic framework for query-efficient ABC inference. Suppose we have training data D 1: of simulation model outputs x i ∈ X with corresponding parameters θ i . Suppose also that we have a Bayesian model that describes our uncertainty about the future simulation output x * ∈ X with parameter θ * conditional on the training data D 1:t . This uncertainty is represented by a probability measure π(x * | θ * , D 1:t ). Instead of modelling the full data x * ∈ X , we note that in practice it might be enough to model only some summary statistics s(x * ) ∈ S ⊂ R r . Alternatively, the discrepancy between the observed data and simulator output can be modelled as is done later in this article. Now, our estimate for the ABC posterior density π ABC depends on the training data as is clearly the case if, e.g. Equation (3) is used, and can therefore also be considered a random quantity. Given the training data D 1:t , we assume that we can represent the uncertainty in π ABC using a probability measure π(π ABC | D 1:t ) over the space of (suitable smooth) density functions π ABC : Θ → R + .
If the amount of available simulations is limited due to high computational cost, we may have considerable uncertainty of the ABC posterior π ABC . Let L(π(π ABC )) denote the loss due to our uncertain knowledge of the ABC posterior density. This loss function could measure, for example, the overall uncertainty in the probability density π ABC , the uncertainty in a particular point estimate of interest such as posterior mean, or the maximum uncertainty over all possible parameter values. We consider the sequential setting where, at each iteration, we need to decide the next evaluation location(s). After each iteration, we can compute the uncertainty in the ABC posterior and the corresponding loss function, and fit a model that predicts the next simulation output, given all data available at the time. If m future simulations can be performed in the current iteration, our aim is to choose evaluation locations θ * = {θ t+1 , θ t+2 , . . . , θ t+m } such that the expected loss, after simulating the model at these locations, is minimised. That is, we want to minimise E x * | θ * ,D1:t (L(π(π ABC | x * , θ * , D 1:t ))) = L(π(π ABC | x * , θ * , D 1:t ))π(x * | θ * , D 1:t ) dx * , with respect to θ * , where we need to average over the unknown simulator outputs x * = {x t+1 , . . . , x t+m } at parameters θ * using our model for the new simulator output π(x * | θ * , D 1:t ). If the loss function measures the uncertainty of the ABC posterior density, then this approach is a Bayes optimal solution to a decision problem of minimising the expected uncertainty when m simulations can be done in parallel and offers thus a query-efficient approach to determine the evaluation locations. This approach resembles the entropy search (ES) method Schuler, 2012, Hernández-Lobato et al., 2014]. Other related approaches have been proposed by Wang et al. [2016], Bijl et al. [2016], Wang and Jegelka [2017]. Different from these approaches, our main goal is to select the parameter for a future run of the costly simulation model so that the uncertainty in the approximate posterior is minimised. ES, in contrast, is designed for query-efficient global optimisation and it aims to find a parameter value that maximises the objective function, and to minimise the uncertainty related to this maximiser. We note that the rationale of our approach is essentially the same as in probabilistic numerics literature (see e.g. Hennig et al. [2015]) or in sequential Bayesian experimental design (see Ryan et al. [2016] for a recent survey). However, different from these approaches, our interest is to design the evaluations to minimise the uncertainty in a quantity that itself describes the uncertainty of the parameters of a costly simulation model. The uncertainty in the former is due to limited budget for model simulations that we can control, while the uncertainty in the latter is caused by noisy observations that have already been provided to us and are considered here as fixed.
The framework outlined above requires some modelling choices and can lead to computational challenges as the selection of the future evaluation location itself can require costly evaluations (as is the case with the ES). In the following section we propose an efficient algorithm based on a loss function that measures the uncertainty in the (unnormalised) ABC posterior over the parameter space and a GP surrogate model. We also consider some alternative strategies that are more heuristic but easier to evaluate. Furthermore, we restrict our discussion to a sequential setting, i.e. m = 1, and leave extensions to multiple simultaneous acquisitions for future work. We note that the outlined strategy is "myopic", meaning that the expected uncertainty after the next batch of evaluations is considered only, and the number of simulations left in a limited budget is not taken into account, see e.g. González et al. [2016] for some discussion in BO context; non-myopic strategies are beyond the scope of this work.
Details of our approach appear in the next section, but the main idea is shown in Figure 1. We model the discrepancy ∆ θ = ∆(x obs , x θ ) with GP regression (Fig. 1b). The ABC posterior is proportional to the prior times the probability of obtaining a discrepancy realisation that is below the threshold when the model is simulated. However, because the GP is fitted with limited training data, this probability cannot be estimated exactly, causing uncertainty in the ABC posterior density function (Figure 1a). We propose an acquisition function that selects the next evaluation location to minimise the expected variance of the (unnormalised) ABC posterior density over the parameter space.

Nonparametric modelling and parameter acquisition
This section contains the details of our algorithms. Section 3.1 describes the GP model for the discrepancy, which permits closed-form equations for many required quantities of the ABC posterior estimate which are derived in Section 3.2. In Sections 3.3 and 3.4 we formulate the proposed acquisition functions and handling uncertainty in GP hyperparameters is briefly discussed in Section 3.5.

Quantifying the uncertainty of the ABC posterior estimate
As in Gutmann and Corander [2016], one can compute the posterior predictive density for new discrepancy value at θ using ∆ θ | D 1:t , θ, φ ∼ N (m 1:t (θ), v 2 1:t (θ) + σ 2 n ) and obtain a model-based estimate for the acceptance probability given the modelling assumptions and training data D 1:t , as P is the cdf of the standard normal distribution. This probability is approximately proportional to the likelihood and yields a useful point estimate of the likelihood function. We here take a different approach and explicitly exploit the fact that part of the probability mass of ∆ θ is due to our uncertainty in latent function f and GP hyperparameters φ. For simplicity, we first assume that the GP hyperparameters φ are known and discuss relaxing this assumption later in this article. Indeed, if we knew f , the (unnormalised) ABC posteriorπ ABC (θ) and the acceptance probability p a (θ) 2 could be computed using the formulas With a limited number of discrepancy-parameter pairs in D 1:t there is uncertainty in the values of the function f (and in GP hyperparameters φ) which we propose to quantify and attempt to minimise in order to accurately estimate the ABC posterior. The following result (whose proof is found in the supplementary material) allows us to compute the expectation and variance of the unnormalised ABC posterior.
Lemma 3.1. Under the GP model described in Section 3.1, the pointwise expectation and variance ofπ ABC with respect to π(f | D 1:t ) are respectively, where m 1:t (θ) and v 2 1:t (θ) are given by Equations (5) and (6), respectively, and T (·, ·) is Owen's t-function which satisfies for h, a ∈ R.
We note that Equation (8) equals the point estimate of the (unnormalised) posterior density that was used in Gutmann and Corander [2016]. The variance of the unnormalised ABC posterior in Equation (9) depends on Owen's t-function that needs to be computed numerically. However, there exists an efficient algorithm to evaluate its values by Patefield and Tandy [2000].
It is of interest to examine when the variance in Equation (9) is large. If a parameter θ satisfies m 1:t (θ) = ε, then the first term of Equation (9) is maximised, and in this case the second term is maximised when v 2 1:t (θ) is maximised. On the other hand, if m 1:t (θ) ε but v 1:t (θ) |m 1:t (θ) − ε|, the first term in Equation (9) is approximately maximised and the second term is also close to its maximum value, especially if also v 1:t (θ) σ n . Because the ABC threshold ε is usually chosen very small, we thus conclude that the variance in Equation (9) tends to be high in regions where the mean of the discrepancy m 1:t (θ) is small and/or the variance of the latent function v 2 1:t (θ) is large relative to the mean function. Some further insight to Equation (9) is obtained by using the approximation ). This approximation, also known as the delta method, produces where the last term is constant and can be dropped. This equation has some similarity with the lower confidence bound (LCB) criteria used in bandit problems and Bayesian optimisation where β t is a tradeoff parameter. Both equations produce small values if the mean of the discrepancy m 1:t (θ) is small (assuming also that ε ≤ m 1:t (θ)) or the variance v 2 1:t (θ) is high relative to the mean m 1:t (θ). However, the LCB tradeoff parameter β t typically depends on the iteration t and the dimension of the parameter space (see Srinivas et al. [2010] for theoretical analysis) while in the posterior variance (Equation (11)) this tradeoff is determined automatically in a nonlinear fashion and, unlike for LCB, the variance formula depends on the prior density π(θ).
Oher useful facts forπ ABC can also be obtained. Some of these formulas are not required to apply our methodology and are thus included as supplementary material. For instance, the cdf for the acceptance probability p a (θ) is if z ∈ (0, 1), and zero if z ≤ 0, and 1 if z ≥ 1. This formula enables the computation of quantiles which can be used for assessing the uncertainty via credible intervals. Setting α = F pa(θ) (z), where α ∈ (0, 1) and solving for z yields the α-quantile that was already used in Figure 1a, From the above equation we see, e.g., that the median is given by Φ((ε − m 1:t (θ))/σ n ).
Above we assumed that the GP hyperparameters φ are known but in practice these need to be estimated. One can use the MAP-estimate in the place of the fixed values in the previous formulae. The MAP-estimate is computed by maximising the logarithm of the marginal posterior φ MAP where π(φ) is the prior density for GP hyperparameters and where the covariance function in K(θ 1:t ) = k(θ 1:t , θ 1:t )+σ 2 n I depends naturally also on φ. For the rest of the paper, we assume that the MAP estimate is used for GP hyperparameters, however, we also briefly discuss how one could integrate over them in Section 3.5.

Efficient parameter acquisition
We define our loss function for model-based ABC inference as whereπ ABC (θ) = π(θ)p a (θ) is the unnormalised ABC posterior and the variance is taken with respect to the unknown latent function f conditioned on the training data D 1:t . We call the function in Equation (16) as the integrated variance loss function (abbreviated as "expintvar"). It measures the uncertainty in the unnormalised ABC posterior density averaged over the parameter space Θ. The loss function is defined in terms of the unnormalised ABC posterior because we are here interested in minimising the uncertainty in the posterior shape. Also, this choice allows tractable computations unlike some other potential choices such as defining the integrated variance over the normalised ABC posterior density function. However, in principle, other loss functions, suitable for particular problem at hand, can be defined. We obtain the following formula for computing the expected integrated variance loss function L 1:t (θ * ) when the new candidate evaluation location is θ * . The proof is rather technical and can be found in the supplementary.
A future evaluation at θ * causes a deterministic reduction of the GP variance that is given by the Equation (19). However, the variance of the unnormalised ABC posterior depends on the discrepancy realisations and we need to average over π(∆ * | θ * , D 1:t ). It is easy to see that if τ 2 1:t (θ, θ * ) → v 2 1:t (θ), then the integrand at the corresponding parameter θ approaches zero. It can also be shown (using Owen [1980, Eq. 2.3]) that if τ 2 1:t (θ, θ * ) = 0, then the integrand equals the current variance given by Equation (9). While some of the derivations could be done analytically, computing the expected integrated variance requires integration over the parameter space Θ. This can be done with Monte Carlo or quasi-Monte Carlo methods; we here use importance sampling (IS) to approximate the integral when p > 2. Using the IS estimator [Robert and Casella, 2004, Eq. 3.10, p. 95], we obtain where w 1:t+1 (θ, θ * ) is the term inside the square brackets in Equation (18) and the importance weights are given by and θ (i) ∼ π q (·) for i = 1, . . . , s. The importance distribution π q (θ) is proportional to the current variance of the unnormalised ABC posterior i.e. π q (θ) ∝ π 2 (θ)V(p a (θ) | D 1:t ). This importance distribution is a reasonable choice, because one evaluation is unlikely to change the variance surface much and the expected variance thus has similar shape as the current variance surface. It is easy to see that if the prior is bounded and proper i.e. π(θ) < ∞ and Θ π(θ) dθ = 1, then π q defines a valid probability density (up to normalisation). Because the normalising constant of π q is unavailable, we need to normalise the weights in Equation (21). Generating samples from the importance distribution π q is not straightforward but can be done using (e.g. adaptive) Metropolis algorithm or using sequential Monte Carlo methods. As outlined in Section 2, the new evaluation location is chosen to minimise the expected loss, that is where the right hand side is a set of parameters because the minimiser may not be unique. We call this new strategy 'an acquisition rule' according to the nomenclature in the Bayesian optimisation literature. Unlike in BO, however, our aim is not to optimise the discrepancy but to minimise our uncertainty in the ABC posterior approximation. The second term in Equation (18) does not depend on θ * and its value can be computed just once (or omitted completely) and the normalisation of the prior density π(θ) does not affect the solution of (22) as it only scales the objective function. Gradient-based optimisation with multiple starting points can be used for solving (22) and the gradient is derived in the supplementary material. The resulting algorithm for estimating the ABC posterior is outlined as Algorithm 1.

Alternative acquisition rules
We briefly discuss some alternative acquisition functions, also motivated by the goal of ABC inference. The required derivations follow directly from our previous analysis and we include these strategies to our experiments in Section 4. One such alternative to the expected integrated variance strategy is to evaluate where the current uncertainty of the unnormalised ABC posterior is highest. This approach is similar to Kandasamy et al. [2015]. This strategy is a reasonable heuristic in the sense that it computes next in the point where improvement in estimation accuracy is needed most, although it does not account for how large an improvement can be expected in the point (or overall). This approach requires solving the optimisation problem Algorithm 1 GP-based ABC inference using the expected integrated variance acquisition function.
To encourage further exploration, similarly to Gutmann and Corander [2016], we also consider a stochastic variant of the maxvar acquisition function in Equation (24). Specifically, we generate the evaluation point randomly according to the variance surface π q (θ) ∝ π 2 (θ)V(p a (θ)) which can also be used as an importance distribution for the expected integrated variance acquisition function as discussed earlier. That is, instead of finding the maximiser, we generate θ t+1 ∼ π q (θ). This strategy requires generating random samples from π q (θ) but sampling (and optimising) the variance function can be done fast compared to the time required to run the simulation model. We call this method "rand_maxvar".
The stochastic acquisition rule is reminiscent of Thompson sampling, but it is actually quite different. In our method, acquisitions are drawn at random from the probability distribution which is proportional to the (point-wise) variance of the approximate posterior density. In Thompson sampling, instead, one generates a posterior density realisation from the model, and chooses the next point as the maximiser of this realisation.
The maxvar and rand_maxvar strategies avoid the integration over the parameter space that is necessary for the expintvar method. However, one could replace the integration in expintvar by only a single evaluation at the candidate point. In other words, this method chooses a location with the highest expected reduction in the uncertainty of the unnormalised ABC posterior in that particular location. We call this variant as "expdiffvar" from now on.
A comparison of the acquisition functions in a one-dimensional toy problem is shown in Figure 2. A simulation model has been run eight times and the acquisition functions for selecting the ninth evaluation location are plotted for comparison. The current variance surface (maxvar) and the expected integrated variance (expintvar) function are plotted with three values of the threshold ε. Unlike the current variance surface, the expected integrated variance appears insensitive to the value of the threshold. Figure 2b shows also that using the MAP-estimate for the GP hyperparameters causes underestimation of the variance of the unnormalised ABC posterior. In the next section we show how the uncertainty in GP hyperparameters is (approximately) taken into account.

Uncertainty in hyperparameters
Above we assumed that either the GP hyperparameters φ are known or MAP estimate is used. Here we briefly discuss how the uncertainty in the GP hyperparameters could be also taken into account. Integrating over the uncertainty in the GP hyperparameters requires Monte Carlo sampling as in Murray and Adams [2010]. An alternative approach is to use central composite design [Rue et al., 2009, Vanhatalo et al., 2010. Briefly, in central composite design (CCD) certain design points φ i are chosen and each of them is given where γ i is a design weight. This approach has the advantage that the amount of design points grows only moderately with increased dimension and has been shown to yield good accuracy in practice. Further details on choosing the design points and their weights are given in Vanhatalo et al. [2010].
Integrating over the uncertainty in the GP hyperparameters φ in Lemma 3.1 leads to the following calculations. Using the law of total expectation yields where the grid points and the corresponding weights are φ i and ω i , respectively, and where a(θ, If Monte Carlo sampling is used, then ω i = 1/s for all i = 1, . . . , s, where s is the number of samples. Similarly, for the variance we obtain where This formula with CCD integration was already used in Figure 2b. One can also take into account the uncertainty in GP hyperparameters in the expected integrated variance acquisition function. The posterior predictive distribution for a future simulation is then approximated by a Gaussian mixture and one can make the simplification by (incorrectly) assuming that the future evaluation will not affect the GP hyperparameters but only the latent function f . Evaluating the resulting acquisition function requires a high number of calls to Owen's t-function and GP formulas and is computationally more costly and thus possibly impractical. Alternatively, one can define an integrated 3 acquisition function as in Snoek et al. [2012], Hernández-Lobato et al. [2014], Wang and Jegelka [2017] which only requires computing Equation (18) for each sampled GP hyperparameter φ i . The integrated acquisition function is then averaged over these values. Alternatively, one could simply use the posterior mean of the hyperparameters in the place of the MAP estimate. However, we leave a detailed analysis for future work.

Experiments
We compare the proposed expected integrated variance acquisition rule (expintvar) to commonly used BO strategies: expected improvement (EI) and lower confidence bound (LCB) criterion, see e.g. Shahriari et al. [2015]. We use the same trade-off parameter for LCB as Gutmann and Corander [2016], but unlike them, we consider the deterministic LCB rule. As a simple baseline, we also draw points sequentially from the uniform distribution, abbreviated as "unif". We included also the probability of improvement (PI) strategy in preliminary experiments, but it resulted in poor estimates and was therefore excluded from the comparisons.
In addition to expintvar, we also include the maxvar, rand_maxvar and expdiffvar strategies, which were briefly described in Section 3.4, to our list of methods to be compared. The MAP estimate for the GP hyperparameters φ is used in all the experiments. We use MATLAB and GPstuff 4.6 [Vanhatalo et al., 2013] for GP fitting. For fast and accurate computation of Owen's t-function, we use a C-implementation of the algorithm by Patefield and Tandy [2000]. The algorithms in this article are also made available in the ELFI (engine for likelihood-free inference) Python software package by Lintusaari et al. [2017b].
The total variation (TV) distance is used for assessing the accuracy of the posterior approximation when the ground-truth is available. It is defined as TV = 1/2 Θ | π ABC (θ) − π true (θ)| dθ, where π ABC is the estimated ABC posterior and π true is the true baseline. As the baseline we use the true ABC posterior with the same threshold as used for the approximations but in some scenarios, however, the baseline is the true posterior for computational convenience and to see the overall approximation quality. The point estimate of the ABC posterior density function for the comparisons is always computed using Equation (8).

Synthetic 2D simulation models
To compare the different acquisition strategies first without the need to actually handle different simulation models, we construct "synthetic" discrepancies by adding Gaussian noise to certain parametric curves, and use these to simulate the discrepancy realisations directly. The true baseline is computed using the posterior density given by Equation (7) with a small predefined threshold ε. As test cases we consider 1) a unimodal density with two correlated variables, 2) a bimodal density, 3) a density where the first parameter is (almost) unidentifiable, and 4) a banana shaped density, all with a uniform prior. The resulting posterior densities are illustrated in Figure 3. (See supplementary material for additional details). The integration and sampling steps required by expintvar and rand_maxvar strategies are performed in a grid of 50 2 evaluation locations. The initial training set size is t 0 = 10 and the initial training sets for the repeated experiments are generated randomly from the uniform prior as is done in the other test cases as well.
The threshold is fixed so that differences in approximation quality between the acquisition methods are solely caused by the selection of the evaluation locations. However, because selecting a reasonable threshold can be challenging in practice, we also examine how updating this value adaptively during the acquisitions affects the results. In the supplementary we show corresponding results when the threshold is constantly updated so that it matches either the 0.01th or the 0.05th quantile of the realised discrepancies. The results in Figure 4 indicate that the expintvar is the best method overall but also rand_maxvar produces good results. Of the common alternatives, LCB is clearly the best and produces results with similar accuracy as rand_maxvar. The performance of the EI strategy is poor because it tends to focus evaluations greedily around the mode and sample insufficiently in the tail areas which often results in poor estimates to the ABC posterior density and high variability between different experiments. Interestingly, the uniform strategy produces the best estimates in the case of the unimodal example. Most of the acquisitions are not focused on the modal region but because the modelling assumptions hold everywhere and the parameter space is rather small, the extrapolation seems to work well in this case.

Gaussian simulation model
A simple Gaussian simulation model is used to study the effect of prior strength and the dimension of the parameter space. Data points are generated independently from x i ∼ N (· | θ, Σ), i = 1, . . . , n, where θ ∈ Θ = [0, 8] p needs to be estimated and the covariance matrix Σ is known. If θ ∼ N (a, B) truncated to Θ, the true posterior is x i is the sample mean. As discrepancy, we use the Mahalanobis distance ∆ θ = ((x obs −x θ ) T Σ −1 (x obs −x θ )) 1/2 .

Strength of the prior
In the first experiment we set p = 2, n = 5, Σ ii = 1, and Σ ij = 0.5 for i = j. The initial training set size is t 0 = 10 and the threshold is fixed to ε = 0.1. Integration and sampling for expintvar and rand_maxvar are done as in Section 4.1. The true data mean of θ is [2,2] T . The mean of the (truncated) Gaussian prior is a = [5, 5] T and the covariance matrix is B = b 2 I. We vary b, allowing us to study the impact of prior strength relative to the likelihood. Figure 5 shows the results, and we see that the proposed acquisition rules perform consistently well regardless the strength of prior, and focus the evaluations on the posterior modal region. On the other hand, LCB samples where the discrepancies are small, i.e. in areas of high likelihood, leading to sub-optimal posterior estimation whenever the prior is also informative. Comparing Figures 5a and 5b shows that using the expintvar strategy also avoids unnecessary evaluations on the boundary, which is often undesired also in the Bayesian optimisation methods, see Siivola et al. [2017] for some discussion. Curiously, the unif rule works well when prior information is strong.

High-dimensional test cases
Next we investigate the effect of the dimension of the parameter space p. The settings are as before, except that now we use uniform priors supported on Θ = [0, 8] p and the threshold is set adaptively to the 0.01th quantile as described in Section 4.1. Further, n = 15, and the initial training set sizes are t 0 = 20 (3d) and t 0 = 30 (6d and 10d). Adaptive MCMC (with multiple chains) is used to sample from the posterior estimates and, in the case of expintvar and rand_maxvar, from the probability density π q (θ). For expintvar we use s = 500 importance samples in 3d and s = 200 in 6d and 10d. Unlike in the other test problems, the TV distance measures here the average of TVs between all the marginal densities. Figure 6 shows the results. With p ≤ 6, the rand_maxvar is the most accurate and slightly better than expintvar strategy. However, in 10d it suffers from instability in MCMC convergence. Detailed examination shows that the method often produces multimodal posterior estimates which makes the sampling difficult. Such densities are likely a result of the random acquisitions. Namely, even if the uncertainty is high in some region, it can happen that no evaluations occur there during the available iterations, due to the randomness and the curse of dimensionality. EI also tends to produce multimodal difficult-to-sample posterior estimates but similar issues were only rarely observed with other strategies. The results suggest that in high dimensions the strategies that select the acquisition locations deterministically should be preferred over the stochastic ones.

Realistic simulation models
We consider the Lotka-Volterra model and a model of bacterial infections in day care centers to illustrate the proposed acquisition methods in practical modelling situations.

Lotka-Volterra model
The Lotka-Volterra (LV) model [Toni et al., 2009] is described by differential equations , where x 1 (t) and x 2 (t) describe the evolution of prey and predator populations as a function of time t, respectively, and θ = (θ 1 , θ 2 ) is the unknown parameter to be estimated. We use a similar experiment design as in Toni et al. [2009] but with discrepancy ∆ θ = log ij |x obs , 2} denote the noisy observations at times t i , and x mod j (t i , θ) are the corresponding predictions. In comparisons we use the uniform prior with support on [0, 5] 2 , and the baseline is the exact posterior distribution that can be computed analytically. We set t 0 = 10. The threshold is set to the 0.01th quantile of the observed discrepancy realisations and the integration and sampling required by expintvar and rand_maxvar are done as in Section 4.1.
The results are presented in Figure 7. We see that the expintvar strategy produces the best posterior mean estimates (Figure 7a), while the best posterior variance estimates are obtained by LCB and rand_maxvar ( Figure 7b). However, the expintvar strategy clearly produces the most accurate posterior approximations in terms of TV distance followed by rand_maxvar and the LCB strategy ( Figure 7c).

Bacterial infections model
This model describes transmission dynamics of bacterial infections in day care centers. The model has three parameters: an internal infection parameter β ∈ [0, 11], an external infection parameter Λ ∈ [0, 2] and a co-infection parameter θ ∈ [0, 1]. Details of the model and data are described in Numminen et al. [2013]. The true posterior is not available and thus an approximate posterior computed using PMC-ABC algorithm with over two million simulations is used as the ground-truth [Numminen et al., 2013]. We use the same experiment design and discrepancy as Gutmann and Corander [2016], who used the model to illustrate their approach. Specifically, the initial training data size is t 0 = 20 and the uniform prior is used. Adaptive MCMC is again used to sample from the posterior estimates and from the probability density π q (θ). For expintvar we use s = 500 importance samples. Figure 8 shows the results. Unlike in other test cases, the additional exploration of expintvar and the stochastic rand_maxvar causes their credible intervals to be slightly wider than those obtained by other methods. Similarly, Gutmann and Corander [2016] obtained conservative estimates with their stochastic variant of the LCB acquisition rule. However, we notice that our estimates, using both the proposed acquisition methods as well as the LCB strategy, are more accurate than those reported in their paper. Running high number of additional bacterial model simulations indicates that the discrepancy is well approximated with a Gaussian in the modal area but, on the other hand, the variance of the discrepancy σ 2 n is not exactly constant as is assumed in the GP model but grows towards the tail areas. This explains why the expintvar and rand_maxvar strategies, that tend to explore more than the other proposed methods, have tendency to overestimate the variance σ 2 n and thus produce slightly overestimated posterior credible intervals in this particular case. Nevertheless, the expintvar strategy is the most stable method in the sense that it produces more consistent results over different realisations of the initial training data sets and simulator outputs than the other methods.
The maxvar and (deterministic) LCB produce estimates that are closest to the estimated ground-truth. However, the credible region for the parameter Λ is often underestimated possibly due to excess exploitation producing too small variance σ 2 n . Using input-dependent GP model as in Järvenpää et al. [2017] would likely improve the approximation quality but we do not investigate this possibility here. While the EI strategy appears to work well on average in this example, it actually has high variability and occasionally produces too narrow posterior estimates and thus performs poorly overall.

Discussion
In this section we offer guidelines for potential users and discuss some additional details of our algorithms. While the developed methods worked well, it may not be clear to the end user which method to use in practice. The expintvar strategy has a sound Bayesian decision theoretic basis and it performed the best overall producing consistent approximation quality in different scenarios. We thus recommend this strategy. However, expintvar required up to 1 minute of computation time for selecting the next evaluation location in our 6d Gaussian test problem while this optimisation step took up to 4s for UCB and EI, and at most 8s for maxvar. The corresponding time for rand_maxvar was 30s. In our 2d Gaussian case, the expintvar strategy required at most 10s and all the other methods less than a second. These values are, however, suggestive since the computational time depends on various settings and the amount of training data. All these times are also negligible compared to the run time of many realistic simulation models which can be hours or days. In the supplementary material we consider an alternative approach where a non-uniform acceptance threshold is used that allows slightly faster expintvar computations than the uniform case. However, in high dimensions, when p 10, we recommend maxvar because we expect the approximation error be quite large then anyway. While not designed for ABC, the LCB criterion still worked surprisingly well overall and offers another reasonable choice in practice. However, standard LCB is not suitable if the prior is informative and its accuracy deteriorated in the high-dimensional experiment. EI (and PI) performed poorly and we see little reason to use them unless the goal is to learn only the maximiser of the discrepancy. Furthermore, unlike the standard BO strategies such as LCB (with the exception of Shahriari et al. [2016]), the developed acquisition rules do not necessarily require box-constrained domain. Namely, if the prior is a bounded and proper density (to ensure that the acquisition functions are bounded and the ABC posterior defines valid pdf), the requirement of the bounded support can be relaxed.
In addition to the acquisition strategy, the posterior approximation quality also depends on the GP model and some other choices. For example, the proposed acquisition strategies and the final posterior estimate depend on the threshold. We either assumed this value to be known or used a heuristic approach and set the threshold to the 0.01th quantile of the realised discrepancies. We also considered other choices but this approach worked well. In principle, the strategy for selecting the threshold could also vary during the iterations. While some ABC methods bypass selecting the threshold, they may not be applicable when the budget for simulations is very small. Our framework is also applicable for model-based ABC methods that do not require the threshold.
We used the zero mean GP model in our experiments. While Wilkinson [2014], Drovandi et al. [2015], Gutmann and Corander [2016] considered certain parametric mean functions which might help focusing the simulations on the modal area, our choice is a safe option. Namely, if there is a large region containing no simulations, the discrepancy tends to zero there. Thus, the uncertainty will be high in the region, attracting future simulations. Futhermore, even though in some studies (e.g. [Snoek et al., 2012]), the Matern kernel has been empirically shown to perform slightly better than squared exponential, we expect the discrepancy in many ABC modelling applications to be smooth and, consequently, used the squared exponential kernel.
To demonstrate the framework, we chose to model the discrepancy with a GP. However, this approach may not be optimal if the Gaussian assumption is violated Corander, 2016, Järvenpää et al., 2017]. Non-Gaussian measurement models can be used but the acquisition criteria in Equation (9) or (18) may become costly to evaluate. One could also model the log-likelihood directly with a GP and select the evaluations at the maximiser of the variance of the likelihood function [Kandasamy et al., 2015], which is similar to our maxvar criterion. One could also model the individual summaries with independent GPs as in Jabot et al. [2014], Meeds and Welling [2014]. In both of these cases the evaluation locations could be chosen based on the ideas in Section 2.
An alternative to the proposed stochastic acquisition rule is to sample new evaluation locations from the current ABC posterior estimate. This approach seems to work well in some scenarios but no systematic comparison was done. However, the posterior estimate could get stuck to a poor region due to an "unlucky" discrepancy realisation, after which new evaluations would be focused on this seemingly good region only and the method has little chance to recover from the local optimum.
While our approach is designed for fitting costly simulation models, we note that it can be useful even when the simulation model is relatively cheap to run. For example, for a developer of a simulation model, it may be useful to first obtain rough estimates for the model parameters before using costly computations for final and accurate results. Our method is also applicable for estimating the tail probabilities of Gaussian processes over some parameter domain. An approach similar to ours has also been applied to the problem of estimating an excursion set by Chevalier et al. [2014].

Conclusions
We considered the challenging problem of performing Bayesian inference when the likelihood function cannot be evaluated and simulating data from the statistical model is costly. We proposed to use another instance of Bayesian inference to quantify the uncertainty in the approximate posterior due to the limited budget of simulations and to design the simulations to minimise the expected uncertainty in the posterior approximation. Such computations can be costly themselves but we chose a loss function that measures such uncertainty and allowed developing a tractable and practical algorithm for selecting the next evaluation location to run the simulation model. Notably, compared to many realistic simulation models, the run time of which can be hours or days, the computational overhead introduced by our approach is negligible. Experiments demonstrated that the proposed method performs better or similarly as the commonly used Bayesian optimisation strategies and other, more heuristic approaches obtained as a side-product of our derivations. Our approach also takes prior density into account, does not necessarily require box-constrained parameter space and has a sound decision-theoretic basis that extends to other ABC surrogate modelling scenarios beyond those considered in this article.
As future work, other surrogate models and principled approaches for selecting the threshold could be investigated. We here focused on single acquisitions but our approach in principle extends to batch acquisitions as well. This enables parallelised inference, which is particularly useful when computationally very costly simulation models need to be fitted.

A Proofs
We provide the derivations of Lemma 3.1 and Proposition 3.2 below.
A formula for the variance ofπ ABC (θ) can be obtained similarly. First we see that The first term of Equation (29) can be further written as where we have used Fubini-Tonelli theorem to change the order of integration. The integrand can be now written as an unnormalised Gaussian pdf for f and, after integrating over f and some further calculations, the resulting formula can be recognised as where 1 = [1, 1] T and the function Φ 2 denotes the bivariate Gaussian cdf with mean m 1:t (θ)1 and covariance matrix which is clearly symmetric and positive definite since σ 2 n > 0. Denoting the correlation coefficient ρ(θ) = v 2 1:t (θ)/(σ 2 n +v 2 1:t (θ)), we see that Equation (31) can be further written as where 0 = [0, 0] T . The equality follows from the connection between bivariate Gaussian cdf and Owen's t-function, see Owen [1956] for this fact and its proof. Also, simple calculations show that The rest of the result now follows from the derivations above, Equation (8) and the fact V(π(θ)p a (θ)) = π 2 (θ)V(p a (θ)).

Similar computations as for the mean function show that
This formula further shows that the change in the GP variance is deterministic and depends only on the chosen evaluation location θ * . Changing the order of integration using Tonelli theorem we obtain To simplify the notation in the following formulas, we define m 0 := m 1:t (θ), m 1 := m 1:t+1 (θ) and similarly for the variance terms v 2 0 and v 2 1 . We also define τ 2 0 (θ * ) := τ 2 1:t (θ, θ * ). Using these conventions and Equations (37) and (38) we see that The integral of Equation (44) can be computed similarly as in the proof of Lemma 3.1 and after some straightforward computations one obtains To simplify Equation (45), we use again the Fubini-Tonelli theorem to change the order of integration and some straightforward (but tedious) manipulations to obtain where we have defined c(x) := (σ 2 n + v 2 1 − τ 2 0 (θ * ))/(1 + x 2 ) and used again the well-known product rule for Gaussian pdfs.

B.1 Additional derivations
We start by deriving the cdf for the random variable p a (θ) when the uncertainty in the GP hyperparameters φ is taken into account. The corresponding results forπ ABC (θ) follow easily. The cdf of p a (θ) evaluated at z ∈ (0, 1) is it is zero if z ≤ 0, and one if z ≥ 1. In above, the density π(φ) describes our knowledge about the GP hyperparameters φ given the training data D 1:t (conditioning on data is ignored to simplify notation). The integral in the equations above is taken over the domain of the GP hyperparameters φ. The integral can be approximated using e.g. CCD as discussed in the main text. If the hyperparameters are fixed and π φ (φ) is replaced with a point mass, one obtains Equation (13).
A formula for the pdf can be obtained by differentiating the cdf. We first realise that z = Φ(Φ −1 (z)) which implies 1 = dz dz = d dz Φ(Φ −1 (z)) = Φ (Φ −1 (z))(Φ −1 ) (z) for z ∈ (0, 1). This fact is used to further show that Using the formula (52) allows to compute for z ∈ (0, 1) and it is zero elsewhere. Finally, the pdf is obtained by marginalising the GP hyperparameters, that is The mean and variance of p a (θ) were already presented in Sections 3.2 and 3.5. The quantiles can be computed as in Equation (14). If the uncertainty in the GP hyperparameters is taken into account, then numerical root finding such as bisection search is required for inverting the cdf.

B.3 Gradient of the maxvar acquisition function
We compute the gradient of the maxvar acquisition function with respect to the parameter vector θ. We take into account the uncertainty in the GP hyperparameters but if a point estimate is used instead, the formulae can be simplified by ignoring the summations, setting ω i = ω 1 = 1 and replacing φ i with the point estimate. First we denote where so that π 2 (θ)V(p a (θ)) ≈ π 2 (θ)(I 1 (θ) − I 2 (θ)).
Differentiating Equation (67) with respect to the parameter vector θ yields Computing the derivatives of I 1 produces Using the Leibniz integration rule, the gradient of I 2 can be written as where the integration is applied elementwise. The second term in Equation (70) can be further simplified as b(θ,φ i ) where on the third line we have recognised the integrand as an unnormalised Gaussian pdf. Finally, straightforward calculations show that Gradients of the GP mean and variance functions m and v 2 depend on the chosen covariance function and are not shown here.

B.4 Gradient of the ABC posterior approximation
The gradient of the posterior approximation with respect to parameter θ is useful if e.g. Hamiltonian Monte Carlo algorithm is used to sample from this density. The unnormalised approximate posterior is π ABC (θ | x obs ) = π(θ) Φ(a(θ)) where, as earlier, π(θ) denotes the prior density and a is defined as in Equation (66). Differentiating the logarithm of Equation (74) yields where the gradient of the term a can be computed as in Equation (72).
We present additional results for the 2d experiments in Section 4.1. The settings are the same except that the threshold is not predefined but is set to the 0.01th quantile of the realised discrepancies, and updated constantly during the acquisitions. Consequently, the selection of the evaluation locations also affects how the threshold is chosen. These results are shown in Figure 9. Figure 10 shows the corresponding results when the threshold is determined using the 0.05th quantile.

D Non-uniform acceptance threshold
Instead of using the uniform (i.e. "0-1") threshold π ε (x obs | x) ∝ 1 ∆(x obs ,x)≤ε , other choices are possible. For instance, one can use "Gaussian threshold" π ε (x obs | x) ∝ N (∆(x obs , x) | m ε , σ 2 ε ), where the threshold ε is replaced by two new parameters m ε and σ 2 ε that control the quality of the ABC approximation. The unnormalised ABC posterior approximation at θ is then given bỹ This approach can be seen as an approximation to the uniform threshold but it could be interpreted also as additional Gaussian measurement (or modelling) error as described by Wilkinson [2013]. Proceeding similarly as in the proof of Lemma 3.1 and using Gaussian identities in the appendix of Rasmussen and Williams [2006] to compute the required integrals, the expectation and variance ofπ N ABC with respect to π(f | D 1:t ) can be shown to be E(π N ABC (θ)) = π(θ) N (m ε | m 1:t (θ), σ 2 + v 2 1:t (θ)), These experiments are as in Figure 9 except that the 0.05th quantile is used. Larger threshold as in Figure  9 generally produces slightly worse posterior estimates.