Distilling Importance Sampling for Likelihood Free Inference

Abstract Likelihood-free inference involves inferring parameter values given observed data and a simulator model. The simulator is computer code which takes parameters, performs stochastic calculations, and outputs simulated data. In this work, we view the simulator as a function whose inputs are (1) the parameters and (2) a vector of pseudo-random draws. We attempt to infer all these inputs conditional on the observations. This is challenging as the resulting posterior can be high dimensional and involves strong dependence. We approximate the posterior using normalizing flows, a flexible parametric family of densities. Training data is generated by likelihood-free importance sampling with a large bandwidth value ϵ , which makes the target similar to the prior. The training data is “distilled” by using it to train an updated normalizing flow. The process is iterated, using the updated flow as the importance sampling proposal, and slowly reducing ϵ so the target becomes closer to the posterior. Unlike most other likelihood-free methods, we avoid the need to reduce data to low-dimensional summary statistics, and hence can achieve more accurate results. We illustrate our method in two challenging examples, on queuing and epidemiology. Supplementary materials for this article are available online.


Introduction
Many statistical models are specified by simulators, which can be used to generate data under the model given values of parameters. Probability calculations are often not possible for such models. In particular, it can be infeasible to evaluate the likelihood function L(θ): the probability (or density) of the observed data under parameters θ. This makes it difficult to use standard methods of statistical inference such as maximum likelihood and many Monte Carlo methods for Bayesian inference.
Likelihood-free inference (LFI) -also known as simulation-based inference -describes methods to perform inference using simulators without evaluating the likelihood function. One popular class of LFI methods is approximate Bayesian computation (ABC) . Here one runs the simulator under many θ values, and each time calculates a distance between the simulated data y and the observed data y 0 . The distances are used to produce a sample (or weighted sample) of θs from an approximation to the posterior e.g. by selecting the θs with the smallest distances.
A drawback of ABC is that is suffers from a curse of dimensionality: the cost of producing an accurate posterior approximation rise rapidly with dim(y). An intuitive explanation is that, even under the best θ choices, close matches of y and y 0 are rare unless dim(y) is low. Therefore it is common to use dimension reduction, replacing y with a vector of low dimensional summary statistics s(y) when calculating distances. However, using summary statistics typically results in a loss of posterior accuracy.
An alternative class of LFI methods use conditional density estimation (Grazian and Fan, 2019). These estimate a density based on simulated pairs of parameters and data. Then one can condition on the observed data to approximate its posterior distribution. These methods avoid the ABC curse of dimensionality, but typically still require dimension reduction of the data to summary statistics. So the associated loss of information remains a problem.
We aim to improve the efficiency of LFI algorithms by parameter augmentation. The idea is to infer not only the model parameters θ, but also all the random variables the simulator samples during its operation. Effectively this learns to control the simulator to produce output similar to the observations. Now it is no longer the case that outputting close matches will always be rare. In principle this can avoid the ABC curse of dimensionality without the need for summary statistics.
However, inference under parameter augmentation is challenging. The posterior is high dimensional and may have strong dependencies e.g. lying on a lower dimensional manifold. This makes it hard to design effective proposals for Monte Carlo methods. We use approximate versions of the posterior, which correspond to inference under noisy observation of the data, with bandwidth parameter controlling the variance of the noise. The resulting distribution is the prior for → ∞, and the posterior for → 0. Thus large produces a less challenging inference problem. We approximate these distributions using normalizing flows (Papamakarios et al., 2021), a family of flexible and computationally tractable distributions. They transform a simple base random distribution to a complex distribution, by applying a sequence of learnable transformations. Normalizing flows can be trained using batches of data sampled from a target distribution.
We propose a method which alternates two steps. The first is importance sampling, using the current normalizing flow as the proposal. The target distribution is an approximate posterior with large , chosen so that importance sampling produces reasonably accurate results. The second step is to use the resulting weighted sampled to train the normalizing flow further. Following Li et al. (2017), we refer to this as distilling the importance sampling results. By iteratively distilling importance sampling results, we can target increasingly accurate posterior approximations i.e. reduce .
We demonstrate our algorithm on two challenging examples. In a queuing example, we obtain approximate inference results which are much more accurate than two ABC baselines: with and without summary statistics. In an epidemiology example, we are able to target the exact posterior.
In the remainder of the paper, Section 2 presents background material. Section 3 discusses LFI parameter augmentation. Sections 4 and 5 describe our method. Section 6 illustrates it on a simple two dimensional inference task. Sections 7 and 8 present our main examples. Section 9 concludes with a discussion, including limitations and opportunities for extensions. Further details are available in the supplement, including a discussion of recommended tuning choices (Section E). Code for the examples can be found at https://github.com/dennisprangle/DistillingImportanceSampling, written using PyTorch (Paszke et al., 2019). All examples were run on a 16-core desktop PC.

Related Work and Novelty
Several recent papers (Müller et al., 2019;Arbel et al., 2021;Duan, 2021;Naesseth et al., 2021) train a normalizing flow to use as an importance sampling proposal. Variations include training a Gaussian mixture (Jerfel et al., 2021) or a distribution transformed using polynomials (Cotter et al., 2020), rather than the usual neural network approaches in normalizing flows literature (see Section 2.3). The novelty of our work is an application to likelihoodfree inference. To do so, we sequentially target a sequence of approximate distributions, unlike the prior work listed above.
Parameter augmentation methods for likelihood-free inference have been used previously. Graham and Storkey (2017) propose a method using constrained Hamiltonian Monte Carlo (HMC). This alternates HMC steps with projection steps to move the MCMC state back onto a target manifold. A limitation of this method is that it requires a differentiable simulator. Optimisation Monte Carlo (OMC) (Meeds and Welling, 2015) generates random seeds to use within the simulator then uses optimisation to find the corresponding θ values which minimise the distance of the resulting y to y 0 , and computes appropriate weights. Robust OMC (Ikonomov and Gutmann, 2020) produces multiple θs given a random seed. Rare event ABC (Prangle et al., 2018) is a MCMC algorithm which, whenever a new θ is proposed, uses rare event methods to estimate the probability of the simulator producing a close match.
Let x denote the internal random variables used by the simulator (defined more precisely in Section 3.) Then OMC methods essentially sample x from its prior, which could be inefficient if the posterior for x is concentrated compared to the prior. Rare event ABC infers p(x|θ, y) for each proposed θ, which becomes expensive for long MCMC chains. In contrast to these two methods, our approach aims to infer p(θ, x|y), which we argue is more efficient.
Joint inference of θ and x can also sometimes be performed by sophisticated MCMC algorithms specialised to particular applications. For instance Shestopaloff and Neal (2014) give an MCMC algorithm for the queuing example we consider in Section 7. Our goal in this paper is to provide a more generic approach.
More broadly, our approach has connections to several inference methods. Concentrating on its importance sampling component, it is closely related to adaptive importance sampling (Cornuet et al., 2012). Concentrating on training an approximate density, it is related to the cross-entropy method (Rubinstein, 1999), an estimation of distribution algorithm (Larrañaga and Lozano, 2002), and reweighted wake-sleep (Bornschein and Bengio, 2014).

Likelihood-Free Inference
Suppose we observe data y 0 , assumed to be the output of a probability model with density p(y|θ) under some parameters θ, and we have access to a prior density π(θ). Bayesian inference aims to find the corresponding posterior density p(θ|y 0 ) ∝ p(y 0 |θ)π(θ). When the likelihood function p(y|θ) can be evaluated, many approaches to posterior inference are available.
However, many models are specified using a simulator, typically some computer code. The simulator's inputs are parameters θ, and the output is data y. This effectively defines a distribution 1 for y conditional on θ. However, probability calculations under this distribution are typically not feasible, including evaluation of the likelihood function. So standard likelihoodbased methods of inference are not possible. Likelihood-free inference (LFI) methods attempt to perform approximate inference in this setting without directly evaluating the likelihood function.
One popular LFI approach is approximate Bayesian computation (ABC). The simplest ABC algorithm, rejection ABC, samples (θ, y) pairs from the prior and simulator, and accepts those where a distance between y and y 0 (e.g. the Euclidean distance ||y−y 0 || if data is a vector of fixed length) is below some threshold . The accepted θs form a sample from an approximation to the posterior. More efficient ABC algorithms exist, for instance using more sophisticated methods of sampling θ values, or using importance sampling ideas to allow weighted samples.
Despite these improvements, close matches of y and y 0 are rare unless dim(y) is low. As a consequence, ABC suffers from a curse of dimensionality: in the asymptotic case → 0, the cost of ABC rapidly increases with dim(y). So ABC typically uses dimension reduction. That is, acceptance now occurs when ||s(y) − s(y 0 )|| is small, for some function s mapping data to a vector of summary statistics. The loss of information entailed by dimension reduction reduces inference accuracy (as sufficient statistics are rarely available).
See Prangle et al. (2018) (end of Section 2.1) for a brief review of the ABC curse of dimensionality, and Marin et al. (2012) for a review of other relevant aspects of ABC. The supplement (Section C) gives more details of a specific ABC algorithm we use as a baseline in one of our examples.
As mentioned in Section 1, an alternative approach to LFI is through conditional density estimation methods (CDE), as reviewed by Grazian and Fan (2019). CDE methods estimate a density, often a normalizing flow, using simulated (θ, y) pairs. The density estimated could be the joint π(θ, y) or a conditional p(θ|y) or p(y|θ). One can then condition on the observed data to approximate its posterior distribution. The method for conditioning depends on which density was estimated. CDE methods avoid the ABC curse of dimensionality, but typically still require dimension reduction of the data to summary statistics. So the resulting loss of information remains a problem.

Importance Sampling
Let p(ξ) =p(ξ)/Z be a target density where onlyp(ξ) can be evaluated, and the value of the normalizing constant Z = p(ξ)dξ is unknown. Importance sampling (IS) is a Monte Carlo method to estimate expectations of the form I = E ξ∼p [h(ξ)], for some function h. Here we review relevant aspects. For full details see e.g. Robert and Casella (2013) and Rubinstein and Kroese (2016).
IS requires a proposal density λ(ξ) which can easily be sampled from, and must satisfy supp(p) ⊆ supp(λ), where supp denotes support. Then So an unbiased Monte Carlo estimate of I iŝ where ξ (1) , ξ (2) , . . . , ξ (N ) are independent samples from λ, and w i =p(ξ (i) )/λ(ξ (i) ) is an importance weight. Typically Z is estimated as 1 a biased, but consistent, estimate of I. Equivalentlŷ for normalized importance weights s i = w i / N i=1 w i . A drawback of IS is that it can produce estimates with large, or infinite, variance if λ is a poor approximation to p. Hence diagnostics for the quality of the results are useful. A popular diagnostic is effective sample size (ESS), Under various assumptions, Var(Î 2 ) roughly equals the variance of an idealized Monte Carlo estimate based on N ESS independent samples from p(ξ) (Liu, 1996). Elvira et al. (2022), amongst others, note that ESS can be an unreliable diagnostic in practice and research is needed to propose better alternatives.

Normalizing Flows
A normalizing flow represents a random vector ξ with a complicated distribution as an invertible transformation of a random vector z with a simple base distribution, typically N (0, I). Recent research has developed flexible learnable families of normalizing flows. See Papamakarios et al. (2021) for a review. We focus on autoregressive neural spline flows , which we will refer to as "spline flows" as a shorthand. See Section 5.5 for comments on exploratory work with an alternative choice.
Spline Flows Description An autoregressive transformation transforms input vector u to output vector v using v i = τ (u i , h(u <i )) (where u <i refers to the u j values with j < i). Here τ is a monotonic and invertible transformation of its first argument. The particular transformation used depends on the second argument. So v i is a transformation of u i , and the transformation used depends only on u <i . Therefore the overall transformation has a triangular Jacobian matrix, facilitating quick density calculations. Furthermore, a masked autoregressive neural network (Germain et al., 2015;Nash and Durkan, 2019) is typically used for h. This allows h(u <i ) to be computed in parallel for all i values. Durkan et al. (2019) propose using a spline transformation for τ . This transformation is a piecewise function based on partitioning a bounding box [−B, B] into several bins. The type of function used is chosen so it can be fully defined by: the location of knots (bin boundaries); the function values; and (in some cases) derivatives at the knots. All these details should be the output of h(u <i ). We use piecewise rational quadratic transformations, following Durkan et al. (2019). Outside the bounding box the function is defined to be the identity.
Spline Flows Properties Some relevant properties of spline flows are as follows. See Papamakarios et al. (2021) for full discussion of the points made here.
• Universality. An autoregressive flow can represent any distribution meeting some simple conditions, if any τ and h functions can be used. In practice this means the expressiveness of spline flows is only limited by the complexity of the splines and neural networks used.
(Throughout the paper ∇ represents the gradient operator with respect to φ.) Universality means that spline flows can approximate many distributions well. Sampling and gradient calculation are required in our algorithm. Alternative types of flow can allow faster gradient calculation, but are often less expressive.
We note that Gaussian mixture models are an alternative to flows. However their cost of sampling and density evaluation grows rapidly with dim ξ, as it involve a O([dim ξ] 3 ) matrix inversion cost.

Parameter Augmentation
Here we outline an approach to likelihood-free inference which our algorithm will use. The idea is to infer not just the parameters θ, but also all the internal stochastic behaviour of the simulator. To do so, we introduce a random variable x. These represent all outputs of an underlying random number generator used by the simulator. We suppose all the sampled random variables required during simulation can be expressed as transformations of θ and x. So the simulator is now a deterministic function y(ξ), where ξ represents the augmented parameters (θ, x).
Throughout this paper we work with simulators where x can be expressed as a fixed length random vector x ∼ N (0, I). We denote its density as π(x). Variations are possible for more complex settings. One is that x could be an infinite sequence of independent N (0, 1) random variables. This is helpful if the number of random draws required by the simulator is unknown in advance. For instance the simulator could perform a loop for a number of iterations which depends on θ in a complex fashion. Another possible choice is to allow x to be dependent on θ.

Advantages and Challenges
An advantage of parameter augmentation is avoiding the ABC curse of dimensionality. We first give an intuitive explanation. Sampling ξ from the posterior p(ξ|y 0 ), or a close approximation, can be viewed as controlling the simulator through x so that y(ξ) is a close match to y 0 . This avoids the problem of close matches being rare when only θ is controlled, as discussed in Section 2.1. Secondly, we give a more direct explanation. Suppose we can produce a good approximation q(ξ) of the augmented posterior p(ξ|y) 2 , which is suitable as an importance sampling proposal. This allows straightforward inference using standard importance sampling. .
A difficulty of parameter augmentation is that the posterior for ξ often has a challenging form. Firstly, the posterior usually has high dimension.
MacKay (2003) argues that importance sampling is not suitable for high dimensional targets due to the difficulty of producing suitable proposals. Secondly, the posterior may have strong dependencies.
Recent advances in approximate distributions have made progress on these challenges, with Müller et al. (2019) demonstrating that normalizing flows can learn good importance sampling proposals for difficult target distributions. (A theoretical justification for this is the universality property discussed in Section 2.3.) This is the motivation for using flows in this paper.
Posterior dependency under parameter augmention may be extremely strong. Indeed the posterior is often a singular distribution whose mass lies on a lower dimensional manifold (Graham and Storkey, 2017). Monte Carlo inference for such posteriors is challenging due to the difficulty of producing proposals exactly on an unknown manifold. We make progress by performing inference on a sequence of increasingly accurate approximate posteriors, which are all non-singular.

Approximate Posteriors
We define approximate posterior densities for ξ, controlled by a bandwidth value > 0, as p (ξ) =p (ξ)/Z wherẽ Here π(ξ) = π(θ)π(x) and ||y(ξ) − y 0 || is the Euclidean distance between the simulated and observed data. Densities of the form (6) are often used in ABC and correspond to the posterior for data observed with independent N (0, 2 ) additive errors (Wilkinson, 2013). Other similar approximate posteriors can be defined. For instance, the most common ABC approximation replaces the exponential term in (6) with 1[||y(ξ) − y 0 || ≤ ], where 1 denotes an indicator function. We prefer the approximation (6) in this paper since it has the same support as π(ξ). This makes the occurrence of zero importance weights in our algorithm less likely (or impossible if π(ξ) has full support). Such zero weights can cause numerical problems to our algorithm.
It will be useful later to extend the unnormalised density (6) to = 0 by takingp A valid density p 0 results when Z 0 > 0. This is typically the case when the data y is discrete, but not when it is continuous. When Z 0 = 0, the distribution for → 0 is singular with respect to π(ξ).

Objective and Gradient
Our algorithm approximates p using a family of densities q(ξ; φ), typically a normalizing flow. This section introduces an objective function to judge how well q approximates p . It then discusses how to estimate the gradient of this objective with respect to φ. Section 5 presents our algorithm, which uses these gradients to update φ while also reducing .

Objective
Given p , we aim to minimize the inclusive Kullback-Leibler (KL) divergence, This is equivalent to maximizing a scaled negative cross-entropy, which is our objective, (Scaling by Z avoids this intractable constant appearing in our gradient estimates below.) The inclusive KL divergence penalizes φ values which produce small q(ξ; φ) when p (ξ) is large. Hence the optimal φ tends to make q(ξ; φ) nonnegligible where p (ξ) is non-negligible, known as the zero-avoiding property. This is an intuitively attractive feature for importance sampling proposal distributions. Indeed recent theoretical work shows that, under some conditions, the sample size required in importance sampling scales exponentially with the inclusive KL divergence (Chatterjee and Diaconis, 2018).
Our work could be adapted to use the χ 2 divergence (Dieng et al., 2017;Müller et al., 2019), which also has theoretical links to the sample size needed by IS (Agapiou et al., 2017).
Choice of Proposal Density In our main algorithm, defined below in Section 5, when gradient estimates are required we have an existing estimate of φ (i.e. at the start of the outer for loop in Algorithm 1 we have access to φ t−1 .) Therefore is is appealing to take λ(ξ) = q(ξ; φ). Since we use choices of q with full support, (1) is satisfied. This λ is a function of φ, so the ξ (i) terms may have some dependence on φ. Interpreting ∇ as full differentiation could cause terms involving ∇ξ (i) to appear in g 1 , preventing it being an unbiased estimate of ∇J (φ). To avoid this henceforth take ∇ is the gradient operator based on partial differentiation with respect to φ. (Or equivalently take λ(ξ) = q(ξ; ⊥φ), where ⊥ is the "stop-gradient" operator in the notation of Foerster et al., 2018. This matches the implementation in code, as ⊥ corresponds to the torch command detach.)

Improved Gradient Estimates
Here we discuss reducing the variance and cost of g 1 .
Truncating Weights To avoid high variance gradient estimates we apply truncated importance sampling (Ionides, 2008). This truncates the weights at a maximum value ω, giving truncated importance weightsw i = min(w i , ω). The resulting gradient estimate is This typically has lower variance than g 1 , but has some bias. See the supplement (Section A) for more details and discussion, including how we choose ω automatically, and a heuristic argument why truncation bias is not problematic.
One can also think of g 2 as replacingp (ξ (i) ) with min[p (ξ), ωλ(ξ)]. Thus we attempt to optimize q in a local neighbourhood of λ.
A more sophisticated alternative to truncation is Pareto smoothing the largest importance weights (Yao et al., 2018). We did not use this as it is more expensive to implement than truncation, but it would be interesting to investigate in future work.
Resampling Calculating g 2 requires evaluating ∇ log q(ξ (i) ; φ) for 1 ≤ i ≤ N . Each of these has a computational cost, but often many receive small weights and so contribute little to g 2 .
To reduce this cost we can discard many low weight samples, by using importance resampling (Smith and Gelfand, 1992) as follows. We sample n N times, with replacement, from the ξ (i) s with probabilitiess i =w i /S where S = N i=1w i . Denote the resulting samples asξ (j) . Then an unbiased estimate of g 2 is

Algorithm
Algorithm 1 gives our inference algorithm, and this section discusses various details of it.

5:
Select a new value t ≤ t−1 (see Section 5.3 for details).

10:
Calculate φ t,j from φ t,j−1 and g using a step of an optimization algorithm such as Adam (where φ t,0 = φ t−1 .) 11: end for 12:

13:
if = 0 or runtime above prespecified limit then

Algorithm Overview
A summary of the algorithm follows. Steps 4-6 perform importance sampling with target p . Steps 7-11 use the output to train φ, aiming for the resulting density to approximate the importance sampling target. As the algorithm progresses, step 5 reduces , aiming for the importance sampling ESS to equal M (a tuning choice). The goal is to make the importance sampling target closer to the posterior, slowly enough to avoid high variance gradient estimates.
To use the importance sampling output efficiently, we sample from it (with replacement) several times to create training batches, and use each batch for one update of φ. We fix training batch size to n = 100, and number of batches B to M/n . So approximately M importance sampling outputs are used for training. Limiting the number of outputs used aims to avoid overfitting from too much reuse of the same training data.
An update of φ is done by a step of an optimisation algorithm based on gradient estimates, aiming to increase J . The gradient estimate is calculated from the current batch of training data, and is calculated using (11). We use the Adam optimisation algorithm (Kingma and Ba, 2015), and found it worked well, but many alternatives could also be used (Ruder, 2016 reviews many alternatives).
Steps 13-14 terminate the algorithm after a fixed runtime or once = 0 is reached (when this is possible i.e. for discrete data.) Alternatively, the termination decision could be based on approximate inference diagnostics, such as those of Yao et al. (2018) or Huggins et al. (2020).
We investigate the remaining tuning choices empirically in Section 7. For now note that N must be reasonably large since our method to update t relies on making an accurate ESS estimate, as detailed in Section 5.3. A further summary and discussion of tuning choices appears in the supplement (Section E).

Pretraining
Our initial target is the prior π(ξ), since we take 0 = ∞. The initial q should be similar to this. Otherwise the first gradient estimates produced by importance sampling are likely to have high variance. To achieve this we often make use of pretraining. We assume the prior can easily be sampled from. Then pretraining iterates the following steps: 1. Sample (ξ (i) ) 1≤i≤n from π(ξ).
(Also, to avoid numerical errors, we define the ESS to be zero if all weights are zero.) In step 5 of Algorithm 1 we first check whether N ESS ( t−1 ) < M , the target ESS value. If so we set t = t−1 . Otherwise we set t to an estimate of the minimal such that N ESS ( ) ≥ M , computed by a bisection algorithm 3 .
We perform at least 50 iterations of bisection, and then terminate once

Reaching t = 0: Exact Inference
For continuous data, the set of ξ such that y(ξ) = y 0 typically has probability zero 4 under any choice of q(θ; φ). Hence DIS cannot reach = 0 since the unnormalised densityp 0 from (7) will almost surely result in all weights being zero. Instead, like ABC, DIS produces increasingly good posterior approximations as → 0. However for discrete data, it is plausible to reach t = 0. When this happens, the algorithm produces a proposal density for exact importance sampling of the augmented posterior p 0 (ξ).

Choice of q
For all our examples we use an autoregressive neural spline flow for q(ξ; φ) with 5 bins for each variable, and bounding box [−10, 10]. The spline details are output by an autoregressive residual neural network (Nash and Durkan, 2019) with 3 blocks, 20 hidden features and ReLU activation. More comments on the choice of q are in Section E of the supplement.
We use Algorithm 1 with N = 4000 training samples and a target ESS of M = 2000. These values give a clear visual illustration: we investigate efficient tuning choices later. We pretrain so that q(ξ; φ 0 ) approximates the prior. Figure 1 shows our results. The normalizing flow quickly adapts to meet the importance sampling results, and after 30 iterations we reach = 0.008, taking roughly 0.1 minutes. Visually, the samples lie close to the exact posterior support described above.

Model
Consider a M/G/1 queuing model of a single queue of customers. Times between arrivals at the back of the queue are Exp(θ 1 ). On reaching the front of the queue, a customer's service time is U (θ 2 , θ 3 ). All these random variables are independent. We consider a setting where only inter-departure times are observed: times between departures from the queue.

Existing Methods
The M/G/1 model with inter-departure observations is a common benchmark for likelihood-free inference, which can often provide fast approximate inference. See Papamakarios et al. (2019) for a comparison of methods for an M/G/1 example. A potential disadvantage of existing likelihood-free methods is that they typically require using low dimensional summary statistics to be efficient, and this can lose some information from the data. As a likelihood-free inference baseline, we investigate a ABC-PMC algorithm. This is a modification of existing ABC-PMC methods to use the unnormalised target (6). See Section C of the supplement for full details. We also include results from a sophisticated MCMC scheme (Shestopaloff and Neal, 2014) for this model. This provides near-exact samples from the posterior, which is a useful gold standard comparison for the approximate inference methods. A limitation of this MCMC scheme is the difficulty in adapting it to other observation regimes -e.g. observing the queue length at various times (Pickands and Stine, 1997) -which would be relatively easy for likelihood-free methods.

DIS Implementation
We introduce ϑ and x, vectors of length 3 and 2 dim(y), and take ξ as the collection (ϑ, x). Our simulator transforms these inputs to θ(ϑ) and y(ξ), with the property that ξ ∼ N (0, I) outputs a sample from the prior and the model. See the supplement (Section B) for details. We pretrain q to approximate this distribution. This tuning study suggests taking N and M/N as small as possible, while ensuring M is at least a few hundred to avoid high variance in the importance sampling estimates. We use this guidance here and in the next section's example. In both these examples, the cost of a single simulator run is low. For more expensive models efficient tuning choices may differ. Figure 2 (right) shows that the final DIS results with N = 5000 and M = 250 are a reasonably close match to near-exact MCMC output using the algorithm of Shestopaloff and Neal (2014). The DIS results for θ 2 and θ 3 are more diffuse than the MCMC results. Also the θ 2 DIS results lacks the sharp truncation of the right tail present in the MCMC results. (Truncation occurs at the minimum observed inter-departure time of 4.01, as the minimum service time cannot be greater than this.)

Results
We also compare DIS to our ABC-PMC baseline, under a similar runtime. For full details of methods and results see the supplement (Section C). Table 1 summarizes the results. ABC-PMC without summary statistics targets (6), just as DIS does, but cannot reach as low an value in the same time, so it produces inaccurate estimates for all parameters. To improve ABC-PMC performance, we consider replacing the data with summaries: quartiles of the inter-departure times, as used in Papamakarios et al. (2019). This improves results for θ 1 and θ 2 , but still produces severe bias for θ 3 .

Example: Susceptible-Infective Process on a Random Network
This section illustrates how DIS can be used to infer discrete variables. The key idea is to represent them as transformations of continuous latent variables. Such a mapping must be non-injective, and this is supported by the DIS algorithm without a need for any modifications. We demonstrate this in the context of a spreading process model of an infectious disease on a random network, specifically a susceptible-infective (SI) compartmental model on a Erdős-Rényi network. See Dutta et al. (2018) for a discussion of similar models, although not the exact one we consider.

Model
Our model represents a population of m individuals as a network. Each individual corresponds to a node. The individuals are labelled 0, 1, . . . , m − 1. An edge indicates contact between the corresponding individuals. The network is assumed to be an instance of the Erdős-Rényi random network (Erdős and Rényi, 1959), so each pair of nodes form an edge with probability θ 1 .
The spreading process of the infectious disease is modelled as a discretetime process. We consider a compartmental model where at any time each individual is in one of three states: susceptible, infective or immune. Initially there is a single infective node, individual 0. Susceptible neighbours of an infective individual at time t are exposed and become infective at time t + 1 with probability θ 2 . Exposed individuals who are not infected instead become immune. (All infection and edge formation events are assumed independent.) Parameters θ 1 and θ 2 have independent U (0, 1) prior distributions. We assume that the observations are a binary vector indicating infective status for each node/time combination.

Likelihood-based Inference
The likelihood function of the model can be calculated by summing contributions from all possible networks, as detailed in the supplement (Section D). This is computationally feasible for small m, enabling near-exact posterior inference. Below, we use likelihood-based importance sampling as a benchmark for an example with m = 5. However, likelihood calculations become infeasibly expensive for larger m, as the number of networks to sum is 2 m(m−1)/2 . For instance below we also investigate an example with m = 10, giving 2 45 ≈ 3 × 10 13 total networks.
Our simulator transforms these inputs to θ(ϑ) and y(ξ). As in Section 7, the simulator has the property that ξ ∼ N (0, I) produces samples from the prior and model. We pretrain q to approximate this distribution. Full details of the simulator transformation are in the supplement (Section D). In brief, x edge has an entry for each possible edge. An edge occurs when the corresponding entry falls below a threshold controlled by ϑ 1 . Similarly x infect has an entry for each individual. When that individual is exposed, then infection occurs if the corresponding entry falls below a threshold controlled by ϑ 2 . (Only one random variable is needed per individual as they can only be exposed once. Afterwards they are either infective or immune.) To summarize, as well as continuous parameters θ we also wish to infer discrete variables: presence of edges and infection status. As noted at the start of Section 8, the discrete variables are inferred by representing them as a mapping of continuous latent variables ξ.

Results
We first consider a small network with m = 5 nodes, observed for 5 time steps. This has dim(ξ) = 17. Following Section 7 we use N = 5000 and M = 250 in DIS. As discussed in Section 3.2, it is plausible here for DIS to reach = 0, since the data are discrete. Our experiment reaches = 0 after 1.5 minutes. Figure 3 shows posterior histograms for θ match those for likelihood-based importance sampling, which produces 100, 000 samples in only 4 seconds.
Next we consider a larger network with m = 10 nodes, observed for 10 time steps. Now dim(ξ) = 57. As described in Section 8.2, exact likelihood calculation is infeasible here. However DIS with N = 5000, M = 250 can perform exact inference, reaching = 0 after 387 minutes. Figure 4 (left) shows the resulting posterior histograms for θ. DIS also outputs posterior distributions of the latent variables x edge and x infect , controlling the network structure and whether individuals become immune on exposure. These are summarized in Figure 4 (right). In the observed data, only individuals 1 and 7 do not become infected. The figure shows these individuals have a low posterior probability of becoming infected on exposure. Individual 0 has a moderate probability of infection on exposure. This is because they are already infective initially, so there is little information in the data on what would happen to them on exposure. The remaining individuals have a high posterior probability of becoming infected. In the data, individuals 3,6,8 are infected in the first time period. Consequently the edges from these to individual 0 have the highest posterior probabilities, as this is the only possible route of infection. Edges representing plausible routes to individuals infected later have lower probabilities, presumably as there are more possibilities for how this happens. The lowest probability edges are those which would result in individuals being infected before this is observed to happen in the data e.g. the edge (0, 2).

Conclusion
We present distilled importance sampling for likelihood-free inference, and show its application in several examples. An M/G/1 queuing example demonstrates that the method produces approximate results which are much more accurate than ABC, without needing the use of summary statistics. An epidemic model example shows that the method can also produce inference for discrete data. Indeed it demonstrates that for discrete data, it is possible for DIS to target the exact posterior. We also investigate tuning choices, and propose some default choices. The supplement (Section E) has a summary of tuning recommendations, and discussion of some alternative possibilities. The remainder of the section discusses possible extensions of our methodology, as well as limitations.
Likelihood-Based Inference This paper concentrates on likelihood-free inference. DIS can also be applied to some likelihood-based inference problems. However, exploratory work shows that here it perform less well than the competing methods listed at the start of Section 1.1. See the supplement (Section F) for further discussion.
More Efficient Training Every iteration of Algorithm 1 generates new model simulations, and uses them for a few iterations of training. This produces a computationally feasible algorithm for our examples, as the simulators are reasonably fast. However for more expensive simulators, it would be desirable to use simulations more efficiently in the training process. For instance, training could continue with the same training data until convergence -although this risks overtraining. Alternatively, training data from all previous iterations could be reused for training -although ensuring the correct target distribution may be difficult. We plan to explore these approaches in future work.
Amortised Flows Throughout the paper, q(ξ; φ) is a generic normalizing flow producing the vector ξ. We refer to this as a black-box approach, since it involves no knowledge of how x is used by the simulator. This is less practical for large dim(ξ), as an increasingly large amount of simulations are required to learn such a high dimensional distribution.
An alternative is leveraging amortization in the flow, using knowledge about the simulator. Suppose a simulator requests a random variable. An amortized flow would determine the distribution to sample from using knowledge of: previously returned random variables (as for a black-box flow); the current simulator state; the line of code making the request. The complexity of the flow now depends on the number of sampling statements in the code, rather than the total number of samples made. This can be a big reduction in complexity when the simulator makes many iterations over a few sample statements. Similar ideas appear in inference networks (Le et al., 2017).

Overriding Random Number Generators
For the examples in this paper, we wrote custom simulator code which allowed x values sampled from q to be supplied as input. For large simulator codebases, it would be ideal to work with the original code, without needing to develop a custom version. Following Baydin et al. (2019), a possible approach is to override the random number generator that the code uses to instead provide x values proposed by q.
Use with CDE As mentioned in Sections 1 and 2.1, one approach to likelihood-free inference is conditional density estimation (CDE). Unlike DIS, this allows amortized inference: after training it can be used for inference of many different observed datasets. Future work could combine the methods: use CDE to produce an initial approximate posterior estimate then refine further using DIS. This makes use of complementary advantages of the methods: CDE is amortized, while DIS relies less on summary statistics.
Acknowledgments The authors gratefully acknowledge: Alex Shestopaloff for providing MCMC code for the M/G/1 model; Sammy Ragy for spotting several typos; Andrew Golightly and Chris Williams for helpful suggestions.

Supplementary Material A Truncating Importance Weights
Importance sampling estimates can have large or infinite variance if the proposal density λ is a poor approximation to the target density p. The practical manifestation of this is a small number of importance weights being very large relative to the others. To reduce the variance, Ionides (2008) introduced truncated importance sampling. This replaces each importance weight w i withw i = min(w i , ω) given some threshold ω. The truncated weights are then used in estimates (3) or (4) (from the main paper). Truncating in this way typically reduces variance, at the price of increasing bias.
Gradient clipping (Pascanu et al., 2013) in stochastic gradient optimization operates by truncating large gradient estimates. It is common practice to prevent occasional large gradient estimates from destabilizing optimization. Truncating importance weights in Algorithm 1 (the DIS algorithm in the main paper) has a similar effect of reducing the variability of gradient estimates. A potential drawback of either method is that gradients lose the property of unbiasedness, which is theoretically required for convergence to an optimum of the objective. Pascanu et al. (2013) give heuristic arguments for good optimizer performance when using truncated gradients, and we make the following similar case for using truncated weights. Firstly, even after truncation, the gradient is likely to point in a direction increasing the objective. In our approach, it should still increase the q density at ξ (i) values with large w i weights, which is desirable. Secondly, we expect there is a region near the optimum for φ where truncation is extremely rare, and therefore gradient estimates have very low bias once this region is reached. Finally, we also observe good empirical behaviour in our examples, showing that optimization with truncated weights can find good importance sampling proposals.
Note that we could use gradient clipping directly in Algorithm 1. We prefer truncating importance weights as there is an automated way to choose the threshold ω, as follows. We select ω to reduce the maximum normalized importance weight, max iw i N i=1w i , to a prespecified value: throughout we use 0.1. The required ω can easily be calculated e.g. by bisection. (Occasionally no such ω exists i.e. if most w i s are zero. In this case we set ω to the smallest positive w i .)

B M/G/1 Model
This section describes the M/G/1 queuing model, in particular how to simulate from it.
The model involves independent latent variables x i ∼ N (0, 1) for 1 ≤ i ≤ 2m, where m = dim(y) is the number of observations. These generate Inter-departure times can be calculated through the following recursion (Lindley, 1952) where A i = i j=1 a j (arrival times) and D i = i j=1 d j (departure times).
Numerical Stability To avoid occasional numerical stability problems, our implementation replaces the inter-arrival times equation above with We expect this to have negligible impact on the final results.

C ABC Analysis of M/G/1 Example
As a baseline comparison, we perform inference for the M/G/1 example using approximate Bayesian computation (ABC). The algorithm and results are detailed here.

C.1 Algorithm
We implement ABC using Algorithm B below, a modified version of the popular ABC-PMC approach (Toni et al., 2009;Sisson et al., 2009;Beaumont et al., 2009). These algorithms target where || · || is the Euclidean norm, and π(x) is the density of the x variables used in the simulator. A novelty of Algorithm B is that it instead targets p (θ) = π(θ) exp − 1 2 2 ||y(ξ) − y 0 || 2 π(x)dx.
This is the θ marginal of the joint target used by DIS, equation (6) of the main paper. Each iteration of Algorithm B produces a weighted sample (θ t i , w t i ) 1≤i≤N . This targetsp t (θ) in the same way as importance sampling output. Over the course of the algorithm, the value of t is reduced, and the number of simulated datasets required for an iteration tends to increase.

4:
while i < N do

10:
With probability α accept: let θ t i = θ * , d t i = d * , w t i = π(θ * )/λ t (θ * ) and increment i by 1. Calculate t+1 (see below). 14: end for In the first iteration λ t (θ) is the prior. After this, a kernel density estimate is used based on the previous weighted sample. We follow Beaumont et al. (2009) where ϕ is the density of a normal distribution and Σ t−1 is the empirical variance matrix of (θ t−1 i ) 1≤i≤N calculated using weights (w t−1 i ) 1≤i≤N To implement step 14 of the algorithm, we select t+1 so that the acceptance probability of a typical member of the previous weighted sample is reduced by a prespecified factor k. Specifically, we defined as the median of the d t i values, and find t+1 by solving

C.2 Results
We ran Algorithm B on the M/G/1 example with k = 0.7 and N = 250. Runtime is checked every time step 2 is reached, and the algorithm terminates if this exceeds 1 hour: the time used by DIS for this example in the main paper.
Our implementation details give some modest advantages to ABC-PMC. Firstly, under our stopping rule, the final iteration of ABC-PMC typically continues several minutes beyond 1 hour, providing it with more computational resources than DIS. Secondly, the ABC-PMC algorithm can performs more simulations per minute than DIS in this example. This is because later ABC-PMC iterations require large numbers (millions) of simulations, and our implementation was able to perform these in large parallel batches.
Despite these advantages, the final ABC-PMC value is 6.31, which is much larger than the final value for DIS, = 2.30. Recall that the approximate posterior under is the exact posterior for data observed with independent Gaussian errors of scale (Wilkinson, 2013). At = 6.31 this extra error is large compared to the data (whose values range from 4 to 34), hence it seems likely to produce significant approximation error. This is reflected in very wide posterior marginals -see Figure 5. Further, the asymptotic cost per output sample of ABC is O( − dim(y) ) (Prangle et al., 2018), suggesting that reaching = 2.30 using ABC requires a factor of (2.30/6.31) −20 ≈ 5.8 × 10 8 more simulations, which is computationally infeasible.
To reduce the computational cost of ABC, the data are often replaced by low dimensional summary statistics. We investigate using quartiles of the inter-departure times as summaries, following Papamakarios and Murray (2016). This was done by adding a final step to the simulator which converts the raw data to quartiles, and letting y 0 be the observed quartiles. We ran ABC-PMC with quartile statistics using the same tuning details as above. The final value is now 0.17. Figure 5 shows that the posterior marginals are much improved. However the maximum service time marginal is still a poor approximation of the near-exact MCMC results, suggesting that the summaries have lost information from the raw data.

D SI Process Example Further Details
In this section, we describe further details of our analysis of a SI process on a network.

D.1 Simulator
In this model, the parameters θ 1 and θ 2 represent the probability of contact and the probability of infection. They are a priori uniformly distributed on the unit interval. Let ϑ 1 and ϑ 2 be two independent N (0, 1) random variables. We then take θ 1 = Φ(ϑ 1 ) and θ 2 = Φ(ϑ 2 ) where, as earlier, Φ is the N (0, 1) cumulative distribution function.
Recall that m is the number of nodes in the network. The latent variables x infect,1 , . . . , x infect,m determine whether a node becomes infected or immune on its first exposure to infection. The i-th node is infected at the first time period when it has a link to an infective node if Φ(x infect,i ) < θ 2 (or equivalently x infect,i < ϑ 2 ). The latent variables x edge,1 , . . . , x edge,m(m−1)/2 control network creation: there are m(m − 1)/2 possible edges and the j-th edge (according to some enumeration of edges) is added when Φ(x j ) < θ 1 (or equivalently x j < ϑ 1 ).
Clearly when ϑ, x edge , x infect have a N (0, I) distribution, then the simulator produces a sample from the prior and model.

D.2 Likelihood-based Analysis
Here we detail how the likelihood function for this model can be computed, and outline why this is only computationally feasible for small m.
From (13) it follows that with a small number of nodes the likelihood evaluation has an affordable computational cost. However, the cost increases exponentially in m because (13) involves a summation over G having cardinality 2 m(m−1)/2 .

E Tuning Recommendations
Here we summarize recommendations on tuning DIS which appear throughout the paper, and add some further comments.

E.1 Hyperparameters
For the M/G/1 and SI network examples we recommend using: • N = 5000 (importance sampling sample size) • M = 250 (target effective sample size) • n = 100 (training batch size) • B = M/n = 3 (number of batches) The sinusoidal example is much simpler, and we expect a wide range of tuning choices to work well. In the paper we use N = 4000, M = 2000 to allow easy visualization.
In general we expect optimizing these hyperparameters for particular tasks is likely to be useful. In particular, this paper focuses on simulators which are computationally cheap to run. For more expensive models, the optimal tuning choices could be qualitatively different.

E.2 Normalizing Flow Architecture
For all our examples we use an autoregressive neural spline flow for q(ξ; φ) with 5 bins for each variable, and bounding box [−10, 10]. The spline details are output by an autoregressive residual neural network (Nash and Durkan, 2019) with 3 blocks, 20 hidden features and ReLU activation. We recommend this as a default choice of architecture. However we expect more complex and higher dimensional targets to require larger architectures e.g. with more layers and hidden units.
An indication that a larger architecture is needed in the DIS algorithm is that t stops decreasing at some * > 0. This can occur when no density q(ξ; φ) from the current architecture is a sufficiently good approximation of p * . Therefore it becomes extremely unlikely for the target ESS value to be attained, which is required to reduce t further.
In exploratory work we also investigated the real NVP ("Non-Volume Preserving") normalizing flow (Dinh et al., 2016). We found this flow could be much faster to use than a spline flow (i.e. more iterations of Algorithm 1 can be run per unit time). However it sometimes produced numerical errors in the M/G/1 example, and struggled to approximate the target for low in the SI network example. We speculate that the latter issue is because this posterior requires a complex dependency structure -e.g. the presence of some contact edges rule out other edges -that is easier to capture using an autoregressive flow. Due to these issues, we recommend spline flows as a default choice.

E.3 Selecting
The method outlined in the main paper to select is free of tuning parameters. However, alternative methods to select could be used, for instance by using variations on the standard effective sample size (see e.g. Elvira et al., 2022).

F Likelihood-based inference
Here we expand on comments from the main paper's conclusion, Section 9, on applying DIS to likelihood-based inference. This is possible when a full data likelihood L(θ, x) is available given some latent variables x, but integration to L(θ) is infeasibly computationally expensive. In this case the unnormalised DIS targetp would be a tempered form of π(ξ)L(θ, x), for instancep (ξ) = π(ξ)L(θ, x) 1− (taking 0 = 1).
Earlier drafts of this paper included such applications to likelihood-based inference. A reviewer suggested investigating how useful tempering was. Exploratory work found it was often of little use: simply initialising 0 = 0 in our algorithm produced good results. The resulting algorithm has little novelty over other work listed at the start of Section 1.1. Therefore we have concentrated on LFI applications, where initialising 0 = 0 is generally not feasible as it would result in all importance sampling weights being zero (almost surely for continuous data, or with high probability for high dimensional discrete data).
Our exploratory findings about the poor performance of tempering for likelihood-based applications of DIS may not apply to other related methods. For instance, a reviewer commented that there is some similarity between DIS for likelihood-based inference and recent work of Surjanovic et al. (2022). This also uses a tempering scheme and estimates densities by minimising the inclusive Kullback-Leibler divergence (equation (8) from the main paper). Their work attains good empirical results. However its goal is to create an efficient parallel tempering MCMC scheme, rather than an importance sampling proposal as in DIS.