Mining gold from implicit models to improve likelihood-free inference

Significance In scientific fields as diverse as particle physics, genetics, and epidemiology, complex computer simulations provide the most accurate description of phenomena, but the corresponding likelihood function cannot be computed. This is a major challenge for statistical inference. We present inference techniques for this case that combine the insight that additional latent information can be extracted from the simulator with the power of neural networks in regression and density estimation tasks. In toy problems and a real-life particle physics problem, these techniques lead to better sample efficiency than established methods, demonstrating the potential to enable more precise scientific results from large-scale experiments across many disciplines. Simulators often provide the best description of real-world phenomena. However, the probability density that they implicitly define is often intractable, leading to challenging inverse problems for inference. Recently, a number of techniques have been introduced in which a surrogate for the intractable density is learned, including normalizing flows and density ratio estimators. We show that additional information that characterizes the latent process can often be extracted from simulators and used to augment the training data for these surrogate models. We introduce several loss functions that leverage these augmented data and demonstrate that these techniques can improve sample efficiency and quality of inference.


Introduction
In many areas of science, complicated real-world phenomena are best described through computer simulations.Typically, the simulators implement a stochastic generative process in the "forward" mode based on a well-motivated mechanistic model with parameters θ.While the simulators can generate samples of observations x ∼ p(x|θ), they typically do not admit a tractable likelihood (or density) p(x|θ).Probabilistic models defined only via the samples they produce are often called implicit models.Implicit models lead to intractable inverse problems, which is a barrier for statistical inference of the parameters θ given observed data.These problems arise in fields as diverse as particle physics, epidemiology, and population genetics, which has motivated the development of likelihood-free inference algorithms such as Approximate Bayesian Computation (ABC) [1][2][3][4] and neural density estimation (NDE) techniques .While many of these techniques can be exact in the limit of infinite training samples, real-world simulators are computationally expensive, and sample efficiency is immensely important.
We present a suite of new techniques that can dramatically improve the sample efficiency for training neural network surrogates that estimate the likelihood p(x|θ) or likelihood ratio r(x|θ 0 , θ 1 ) = p(x|θ 0 )/p(x|θ 1 ).This provides the key quantity needed for both frequentist and Bayesian inference procedures.Our approach involves extracting additional information that characterizes the latent process from the simulator, as we explain in Sec. 3. In Sec. 4 we introduce the loss functions that utilize this augmented data.In Sec. 5 we demonstrate through a range of experiments that these new techniques provide a significant increase in sample efficiency compared to techniques that do not leverage the augmented data, which ultimately increase the quality of inference.systems [35,36].Here we focus on a second category, in which the simulator is used to generate training data for a tractable surrogate model that is used during inference.There are rich connections between simulator-based inference and learning in implicit generative models such as GANs, with a considerable amount of cross-pollination between these areas [9].
Neural density estimation (NDE).More recently, several methods for conditional density estimation have been proposed, often based on neural networks .They can be used to train a surrogate for the likelihood p(x|θ) [17,19,34] or, in a Bayesian setting, the posterior p(θ|x) [29,33,37].
Novel contributions.The most important novel contribution that differentiates our work from the existing methods is the observation that additional information can be extracted from the simulator, and the development of loss functions that allow us to use this "augmented" data to more efficiently learn surrogates for the likelihood function.In addition, we show how the augmented data can be used to define locally optimal summary statistics, which can then be used for inference with density estimation techniques or ABC.We playfully introduce the analogy of mining gold as this augmented data requires work to extract and is very valuable.In experiments we demonstrate that these approaches can dramatically improve sample efficiency and quality of likelihood-free inference.
Concurrently, the application of these methods to a specific class of problems in particle physics has been discussed in Refs.[38,39].The present manuscript is meant to serve as the primary reference for these new techniques and is addressed to the broader physical science and machine learning communities.Most importantly, it generalizes the specific particle physics case to almost any scientific simulator, requiring significantly weaker assumption than those made in the physics context.We also introduce an entirely new algorithm called SCANDAL, for which we provide the first experimental results.
3 Extracting more information from the simulator We consider a scientific simulator that implements a stochastic generative process that proceeds through a series of latent states z i ∈ Z i and finally to an output x ∈ R dx .The latent space structure Z can involve discrete and continuous components and is derived from the control flow of the (differentiable or non-differentiable) simulation code.Based on the mechanistic model implemented by the simulator, each latent state is sampled from a conditional probability density z i ∼ p i (z i |θ, z <i ) and the final output is sampled from x ∼ p x (x|θ, z).The likelihood is then given by p(x|θ) = dz p(x, z|θ) = dz p x (x|θ, z) Often the likelihood is intractable exactly because the latent space z is enormous and it is unfeasible to explicitly calculate this integral.In real-world scientific simulators, the trajectory for a single observation can involve many millions of latent variables.
In this paper we consider the problem of estimating the likelihood p(x|θ) or the likelihood ratio r(x|θ 0 , θ 1 ), which for the practical purpose of inferring parameter values θ can be used almost interchangably, based on the data available from N runs of the simulator.
Typically, the setting of likelihood-free inference assumes that the only available output from the simulator are samples of observations x ∼ p(x|θ).But in real-life simulators, more information can usually be extracted.We typically have access to the latent variables z, and the distributions of each latent variable p i (z i |θ, z <i ) and p x (x|θ, z) are tractable.
The key observation that is the starting point of our new inference methods is the following: While p(x|θ) is intractable, for each simulated sample we can calculate the joint score t(x, z|θ 0 ) ≡ ∇ θ log p(x, z|θ) by accumulating the factors ∇ θ log p(z i |θ, z :i ) as the simulation runs forward through its control flow conditioned on the random trajectory z.It can be insightful to think of the mechanistic model in the simulator as defining a policy π θ and t(x, z|θ 0 ) as analogous to the policy gradient used in REINFORCE [40].However, instead of trying to optimize θ via a stochastic gradient estimate of some reward function, we will simply augment the data generated by the simulator with the joint score.
Similarly, we can extract the joint likelihood ratio r(x, z|θ 0 , θ 1 ) ≡ p(x, z|θ 0 ) p(x, z|θ 1 ) = p x (x|θ 0 , z) p x (x|θ 1 , z) The joint score and joint likelihood ratio quantify how much more or less likely a particular simulated trajectory through the simulator would be if one changed θ.Below, the distribution for θ0 = −0.8 and θ1 = −0.6.An example empirical distribution from 100 runs for θ0 shows that the sample variance is much larger than the differences from θ0 vs θ1.
As a motivating example, consider the simulation for a generalization of the Galton board, in which a set of balls is dropped through a lattice of nails ending in one of several bins denoted by x.The Galton board is commonly used to demonstrate the central limit theorem, and if the nails are uniformly placed such that the probability of bouncing to the left is p, the sum over the latent space is tractable analytically and the resulting distribution of x is a binomial distribution.However, if the nails are not uniformly placed, and the probability of bouncing to the left is an arbitrary function of the nail position and some parameter θ, the resulting distribution requires an explicit sum over the latent paths z that might lead to a particular x.Such a distribution would become intractable as the size of the lattice increases.Figure 1 shows an example of two latent trajectories that lead to the same x.In this toy example, the probability p(z h , z v , θ) of going left is given by , where f (z v ) = sin(πz v ), σ is the sigmoid function, and z h and z v are the horizontal and vertical nail positions normalized to [0, 1].This leads to a non-trivial p(x|θ), which can even be bimodal.Code for simulation and inference in this problem is available at Ref. [41].
Figure 1 shows that a large number of samples from the simulator are needed to reveal the differences in the distribution of x for small changes in θ -the number of samples needed grows like (p/∆p) 2 .Moreover, this toy simulation is representative of many real-world simulators in that it is composed of nondifferentiable control-flow elements.This poses a difficulty, making methods based on ∇ z x [42] and ∇ θ x inapplicable, which previously motivated techniques such as Adversarial Variational Optimization [32].But the joint score in Eq. ( 3) can easily be computed by accumulating the factors ∇ θ log p(z h , z v |θ 0 ), and we can calculate the joint likelihood ratio by accumulating factors p(z h , z v |θ 0 )/p(z h , z v |θ 1 ).In analogy to the Galton board toy example, even complicated real-world simulators often allow us to accumulate these factors as they run, and to calculate the joint score and joint likelihood ratio conditional on a particular stochastic execution trace z.We will demonstrate this with two more examples in Sec. 5.
For simulators written in an automatic differentiation framework, the calculation of the joint score and joint likelihood ratio can be entirely automatic and does not require any changes to the simulator code and output.As a proof of principle, at Ref. [43] we provide a framework that automates these calculations for any simulator in which all stochastic steps are implemented with the PYRO library [44].
4 Learning from augmented data

Key idea
How can the "augmented data" consisting of simulated observations x i , the joint likelihood ratio r(x i , z i |θ 0 , θ 1 ), and the joint score t(x i , z i |θ 0 ) be used to estimate the likelihood p(x|θ) or likelihood ratio r(x|θ 0 , θ 1 )?The relation between r(x, z|θ 0 , θ 1 ) and r(x|θ 0 , θ 1 ) is not trivial -the integral of the ratio is not the ratio of the integrals!Similarly, how can the joint score be used to estimate the intractable score function The integral of the log is not the log of the integral!Consider the squared error of a function ĝ(x) that only depends on the observable x, but is trying to approximate a function g(x, z) that also depends on the latent variable z, The minimum-mean-squared-error prediction of ĝ(x) is given by the conditional expectation Identifying g(x, z) with the joint likelihood ratio r(x, z|θ 0 , θ 1 ) and θ = θ 1 , we define and find that this functional is minimized by r * (x) = arg min r L r = E p(z|x,θ1) [ r(x, z|θ 0 , θ 1 ) ] = r(x|θ 0 , θ 1 ).Similarly, by identifying g(x, z) with the joint score t(x, z|θ 0 ) and setting θ = θ 0 , we define which is minimized by t * (x) = E p(z|x,θ0) [ t(x, z|θ 0 ) ] = t(x|θ 0 ).
These loss functionals are immensely useful because they allow us to transform t(x, z|θ 0 ) into t(x|θ 0 ) and r(x, z|θ 0 , θ 1 ) into r(x|θ 0 , θ 1 ): we are able to regress on these two intractable quantities!This is what makes the joint score and joint likelihood ratio the gold worth mining.

Learning the likelihood ratio
Based on this observation we introduce a family of new likelihood-free inference techniques.They fall into two categories.We first discuss a class of algorithms that uses the augmented data to learn a surrogate model for any likelihood p(x|θ) or likelihood ratio r(x|θ 0 , θ 1 ).In Section 4.3 we will define a second class of methods that is based on a local expansion of the model around some reference parameter point.
The simulators we consider in this work do not only implicitly define a single density p(x), but a family of densities p(x|θ).The parameters θ may potentially belong to a high-dimensional parameter space.For inference models based on surrogate models, there are two broad strategies to model this dependence.The first is to estimate p(x|θ) or the likelihood ratio r(x|θ 0 , θ 1 ) for specific values of θ or pairs (θ 0 , θ 1 ).This may be done via a pre-defined set of θ values or on-demand using an active-learning iteration.We follow a second approach, in which we train parameterized estimators for the full model p(x|θ) or r(x|θ 0 , θ 1 ) as a function of both the observables x and the parameters θ [6,45].The training data then consists of a number of samples, each generated with different values of θ 0 and θ 1 , and the parameter values are used as additional inputs to the surrogate model.When modeling the likelihood ratio, the reference hypothesis θ 1 in the denominator of the likelihood ratio can be kept at a fixed reference value (or a marginal model with some prior π(θ 1 )), and only the θ 0 dependence is modeled by the network.This approach encourages the estimator to learn the typically smooth dependence of the likelihood ratio on the parameters of interest from the training data and borrow power from neighboring points.
An expressive regressor r(x|θ 0 , θ 1 ) (e. g. a neural network) is trained by minimizing the squared error loss Here and in the following the θ dependence is implicit to reduce the notational clutter.
Both terms in this loss function are estimators of Eq. ( 7) (in the second term we switch θ 0 ↔ θ 1 to reduce the variance by mapping out other regions of x space).As we showed in the previous section, this loss function is, at least in the limit of infinite data, minimized by the true likelihood ratio r(x|θ 0 , θ 1 ).A regressor trained in this way thus provides an estimator for the likelihood ratio and can be used for frequentist or Bayesian inference methods.

RASCAL (Ratio And SCore Approximate Likelihood ratio):
If such a likelihood ratio regressor is differentiable (as is the case for neural networks) with respect to θ, we can calculate the predicted score t(x|θ 0 ) = ∇ θ0 log r(x|θ 0 , θ 1 ).For a perfect likelihood ratio estimator, t(x|θ 0 ) minimizes the squared error with respect to the joint score, see Eq. ( 8).Turning this argument around, we can improve the training of a likelihood ratio estimator by minimizing the combined ratio and score loss with a hyper-parameter α CASCAL (CARL And SCore Approximate Likelihood ratio): The same trick can be used to improve the likelihood ratio trick and the CARL inference method [6,8].Following the discussion around Eq. ( 1), a calibrated classifier trained to discriminate samples {x i } ∼ p(x|θ 0 ) and {x i } ∼ p(x|θ 1 ) provides a likelihood ratio estimator.For a differentiable parameterized classifier, we can calculate the surrogate score t(x|θ 0 ) = ∇ θ0 log[(1 − ŝ(x|θ 0 , θ 1 ))/ŝ(x|θ 0 , θ 1 )].This allows us to train an improved classifier (and thus a likelihood ratio estimator) by minimizing the combined loss SCANDAL (SCore-Augmented Neural Density Approximates Likelihood): Finally, we can use the same strategy to improve conditional neural density estimators such as density networks or normalizing flows.If a parameterized neural density estimator p(x|θ) is differentiable with respect to θ, we can calculate the surrogate score t(x) = ∇ θ log p(x|θ) and train an improved density estimator by minimizing Unlike the methods discussed before, this provides an estimator of the likelihood itself rather than its ratio.Depending on the architecture, the surrogate also provides a generative model.

Locally sufficient statistics for implicit models
A second class of new likelihood-free inference methods is based on an expansion of the implicit model around a reference parameter point θ ref .
Up to linear order in θ − θ ref , we find with some normalization factor Z(θ).This local approximation is in the exponential family and the score vector t(x|θ ref ), defined in Eq. ( 5), are its sufficient statistics.
For inference in a sufficiently small neighborhood around a reference point θ ref , a precise estimator of the score t(x|θ ref ) therefore defines a vector of ideal summary statistics that contain all the information in an observation x on the parameters θ [see also 3,46,47].The joint score together with a minimization of the loss in Eq. ( 8

SALLINO (Score Approximates Likelihood Locally IN One dimension):
The SALLY inference method requires density estimation in the estimated score space, with typically dim t ≡ dim θ ≪ dim x.But in cases with large number of parameters, it is beneficial to reduce the dimensionality even further.In the local model of Eq. ( 13), the likelihood ratio r(x|θ 0 , θ 1 ) only depends on h(x|θ 0 , θ 1 ) ≡ t(x|θ ref ) • (θ 0 − θ 1 ) up to an x-independent constant related to Z(θ).Any neural score estimator lets us also estimate this scalar function, which is a sufficient statistic for the 1-dimensional parameter space connecting θ 0 and θ 1 .We can thus estimate the likelihood ratio through univariate, rather than multivariate, density estimation on ĥ.
The SALLY and SALLINO techniques are designed to work very well close to the reference point.
The local model approximation may deteriorate further away, leading to a reduced sensitivity and weaker bounds.These approaches are simple and robust, and in particular the SALLINO algorithm scales exceptionally well to high-dimensional parameter spaces.
For all these inference strategies, the augmented data is particularly powerful for enhancing the power of simulation-based inference for small changes in the parameter θ.When restricted to samples x ∼ p(x|θ), the variance from the simulator is a challenge.The fluctuations in the empirical density scale with the square root of the number of samples, thus large numbers of samples are required before small changes in the implicit density can faithfully be distinguished.In contrast, each sample of the joint ratio and joint score provides an exact piece of information even for arbitrarily small changes in θ.On the other hand, the augmented data is less powerful for deciding between model parameter points that are far apart.In this situation the joint probability distributions p(x, z|θ) often do not overlap significantly, and the joint likelihood ratio can have a large variance around the intractable likelihood ratio r(x|θ 0 , θ 1 ).In addition, over large distances in parameter space the local model is not valid and the score does not characterize the likelihood ratios anymore, limiting the usefulness of the joint score.

Experiments
Generalized Galton board.We return to the motivating example in Sec. 3 and Fig. 1 and try to estimate likelihood ratios for the generalized Galton board.We use the likelihood ratio trick and a neural density estimator as baselines and compare them to the new ROLR, RASCAL, CASCAL, and SCANDAL methods.As the simulator defines a distribution over a discrete x, for the NDE and SCANDAL methods we use a neural network with a softmax output layer over the bins to model p(x|θ).All networks are explicitly parameterized in terms of θ, the parameter of the simulator that defines the position of the nails (i.e. they take θ as an input).We use a simple network architecture with a single hidden layer, 10 hidden units, and tanh activations.The left panel of Fig. 2 shows the mean squared error between log r(x|θ 0 , θ 1 ) and the true log r(x|θ 0 , θ 1 ) (estimated from histograms of 2 • 10 4 simulations from θ 0 = −0.8 and θ 1 = −0.6),summing over x ∈ [5,15], versus the training sample size (which refers to the total number of x samples, distributed over 10 values of θ ∈ [−1, −0.4]).We find that both SCANDAL and RASCAL are dramatically more sample efficient than pure neural density estimation and the likelihood ratio trick, which do not leverage the joint score.ROLR improves upon pure neural density estimation and achieves the same asymptotic error as SCANDAL, though more slowly.
Lotka-Volterra model.As a second example, we consider the Lotka-Volterra system [48,49], a common example in the likelihood-free inference literature.This stochastic Markov jump process models the dynamics of a species of predators interacting with a species of prey.Four parameters θ set the rate of predators and prey being born, predators dying, and predators eating prey.
We simulate the Lotka-Volterra model with Gillespie's algorithm [50].From the time evolution of the predator and prey populations we calculate summary statistics x ∈ R 9 .Our model definitions, summary statistics, and initial conditions exactly follow Appendix F of Ref. [29].In addition to the observations, we extract the joint score as well as the joint likelihood ratio with respect to a reference hypothesis log θ 1 = (−4.61,−0.69, 0.00, −4.61) T from the simulator.On this augmented data we train different likelihood and likelihood ratio estimators.As baselines we use CARL [6,8] and a conditional masked autoregressive flow (MAF) [17,51].We compare them to the new techniques introduced in section 4.2, including a SCANDAL likelihood estimator based on a MAF.For MAF and SCANDAL we stack four masked autoregressive distribution estimators (MADEs) [23] on a mixture of MADEs with 10 components [17].For all other methods, we use three hidden layers.In all cases, the hidden layers have 100 units and tanh activations.Code for simulation and inference is available at Ref. [52].
For inference on a wide prior in the parameter space, the different probability densities often do not overlap.As discussed above, the augmented data is then of limited use.Instead, we focus on the regime where we try to discriminate between close parameter points with similar predictions for the observables.We generate training data and evaluate the models in the range log(θ 1 ) i − 0.01 ≤ θ i ≤ log(θ 1 ) i + 0.01 with a uniform prior in log space.In the middle panel of Fig. 2 we evaluate the different methods by calculating the mean squared error of estimators trained on small training samples.Since the true likelihood is intractable, we calculate the error with respect to the median predictions of 10 estimators1 trained on the full data set of 200 000 samples.
Our results indicate a trade-off between the performance in likelihood (density) estimation and likelihood ratio estimation.For density estimation, the MAF performs well.The variance of the score term in the SCANDAL loss degrades the performance, especially for larger values of the hyperparameter α.However, for statistical inference the more relevant quantity is the likelihood ratio.Here the new techniques that use the joint score, in particular SCANDAL, are significantly more sample efficient.
Particle physics.Finally we consider a real-world problem from particle physics.A simulator describes the production of a Higgs boson at the Large Hadron Collider experiments, followed by the decay into four electrons or muons, subsequent radiation patterns, the interaction with the detector elements, and the reconstruction procedure.Each recorded collision produces a single high-dimensional observable x ∈ R 42 , and the dataset consists of multiple iid observations of x.The goal is to infer the likelihood of two parameters θ ∈ [−1, 1] 2 that characterize the effect of high-energy physics models on these interactions.We consider a synthetic observed dataset with 36 iid simulated observations of x drawn from θ = (0, 0).
The new inference techniques can accommodate state-of-the-art simulators, but in that setting we cannot compare them to the true likelihood function.We therefore present a simplified setup and approximate the detector response such that the true likelihood function is tractable, providing us with a ground truth to compare the inference techniques to.As simulator we use a combination of MADGRAPH 5 [53] and MADMAX [54][55][56].The setup and the results of this experiment are described at length in Ref. [39], which is attached as supplementary material.
In the right panel of Fig. 2 we show the expected mean squared error of the approximate log r(x|θ 0 , θ 1 ) for the different techniques as a function of the training sample size.We take the expectation over random values of θ 0 , drawn from a Gaussian prior with mean (0, 0) and covariance matrix diag(0.2 2 , 0.2 2 ).We compare the new techniques to the standard practice in particle physics, in which the likelihood is estimated through histograms of two hand-picked summary statistics.
All new inference techniques outperform the traditional histogram method, provided that the training samples are sufficiently large.Using augmented data substantially decreases the amount of training data required for a good performance: the RASCAL method, which uses both the joint ratio and joint score information from the simulator, reduces the amount of training data by two orders of magnitude compared to the CARL technique, which uses only the samples x ∼ p(x|θ).The particularly simple local techniques SALLY and SALLINO need even less data for a good performance.However, their performance eventually plateaus and does not asymptote to zero error.This is because the local model approximation breaks down further away from the reference point θ ref = (0, 0) T , and the score is no longer the sufficient statistics.In the supplementary material we show further results and demonstrate how the improved likelihood ratio estimation leads to better inference results.

Conclusions
In this work we have presented a family of new inference techniques for the setting in which the likelihood is only implicitly defined through a stochastic generative simulator.The new methods estimate likelihoods or ratios of likelihoods with data available from the simulator.Most established inference methods, such as ABC and techniques based on neural density estimation, only use samples of observables from the simulator.We pointed out that in many cases the joint likelihood ratio and the joint score -quantities conditioned on the latent variables that characterize the trajectory through the data generation process -can be extracted from the simulator.While these additional quantities often require work to be extracted, they also prove to be very valuable as they can dramatically improve sample efficiency and quality of inference.Indeed, we have shown that this additional information lets us define loss functionals that are minimized by the likelihood or likelihood ratio, which can in turn be used to efficiently guide the training of neural networks to precisely estimate the likelihood function.A second class of new techniques is motivated by a local approximation of the likelihood function around a reference parameter value, where the score vector is the sufficient statistic.In the case where the simulator provides the joint score, we can use it to train a precise estimator of the score and use it as locally optimal summary statistics.
We have demonstrated in three experiments that the new inference techniques let us precisely estimate likelihood ratios.In turn, this enables parameter measurements with a higher precision and less training data than with established methods.This approach is complementary to many recent advances in likelihood-free inference: the augmented data can be used to improve training for any neural density estimator or likelihood ratio estimator, as we have demonstrated for Masked Autoregressive Flows and CARL.It can also be used to define locally optimal summary statistics that can be used for instance in ABC techniques.
Finally, these results motivate the development of tools that provide a nonstandard interpretation of the simulator code and automatically generate the joint score and joint ratio, building on recent developments in probabilistic programming and automatic differentiation [35,36,[57][58][59][60].We have provided a first proof-of-principle implementation of such a tool based on PYRO [43].

Figure 1 :
Figure1: A toy simulation generalizing the Galton board where the transitions are biased left (blue) or right (red) depending on the nail position and the value of θ.Two example latent trajectories z are shown (blue and green), leading to the same observed value of x.Below, the distribution for θ0 = −0.8 and θ1 = −0.6.An example empirical distribution from 100 runs for θ0 shows that the sample variance is much larger than the differences from θ0 vs θ1.

Figure 2 :
Figure 2: Left: Galton board example.MSE on log r vs. training sample size.We show the mean and its error based on 15 runs.Middle: Lotka-Volterra example.MSE on log p (top) and log r (bottom) vs. training sample size.We show the median and the standard deviation of 10 runs.Right: Particle physics example.MSE on log r vs. training sample size.
) allows us to extract sufficient statistics from an intractable, non-differentiable simulator, at least in the neighborhood of θ ref .Moreover, this local model can be estimated by running the simulator at a single value θ ref -it does not require scanning the θ space, and thus avoids the curse of dimensionality.Based on this observation, we introduce two further inference strategies: SALLY (Score Approximates Likelihood LocallY): By minimizing the squared error with respect to the joint score, see Eq. (8), we train a score estimator t(x|θ ref ).In a next step, we estimate the density p( t(x|θ ref )|θ) through standard multivariate density estimation techniques.This calibration procedure implicitly includes the effect of the normalizing constant Z(θ).