Black-box Bayesian inference for economic agent-based models

Simulation models, in particular agent-based models, are gaining popularity in economics. The considerable flexibility they offer, as well as their capacity to reproduce a variety of empirically observed behaviours of complex systems, give them broad appeal, and the increasing availability of cheap computing power has made their use feasible. Yet a widespread adoption in real-world modelling and decision-making scenarios has been hindered by the difficulty of performing parameter estimation for such models. In general, simulation models lack a tractable likelihood function, which precludes a straightforward application of standard statistical inference techniques. Several recent works have sought to address this problem through the application of likelihood-free inference techniques, in which parameter estimates are determined by performing some form of comparison between the observed data and simulation output. However, these approaches are (a) founded on restrictive assumptions, and/or (b) typically require many hundreds of thousands of simulations. These qualities make them unsuitable for large-scale simulations in economics and can cast doubt on the validity of these inference methods in such scenarios. In this paper, we investigate the efficacy of two classes of black-box approximate Bayesian inference methods that have recently drawn significant attention within the probabilistic machine learning community: neural posterior estimation and neural density ratio estimation. We present benchmarking experiments in which we demonstrate that neural network based black-box methods provide state of the art parameter inference for economic simulation models, and crucially are compatible with generic multivariate time-series data. In addition, we suggest appropriate assessment criteria for future benchmarking of approximate Bayesian inference procedures for economic simulation models.


. Introduction
Simulation models are increasingly employed across the sciences.Their primary advantage is the flexibility they afford modellers: models may be specified mechanistically and at the microscopic level without concern for analytic tractability of the full system behaviour.This flexibility is embraced in the agent-based modelling paradigm wherein individual agents can be described by realistic and heterogeneous decision and interaction rules.Examining the macroscopic properties of the system is then done simply through simulation, freeing the modeller from the burden of reasoning about the global behaviour of the system themselves and facilitating the study of radically more complex systems and emergent phenomena than is otherwise possible.
In general, agent-based models (abms) take as input some parameter vector θ, and return a (possibly multivariate) time-series x.Typically, abms are stochastic simulators, with the property that running the simulator repeatedly with a fixed parameter θ will produce different outputs with each run.The probability (density) with which the simulator produces a particular output x, conditional on an input parameter vector θ, is given by its likelihood function, written p(x | θ), which in some sense encapsulates the statistical properties of the abm.We denote simulation of a dataset x from the abm as a draw from its probability density function, x ∼ p(x | θ).
An essential prerequisite for the successful application of abms in real-world decision making scenarios is the ability to perform parameter estimation, that is, to tune θ such that the simulated data are good matches to observed data, which we denote with y.Traditional statistical workhorses like maximum likelihood estimation or Bayesian inference rely on pointwise evaluations of the likelihood function.In practice, p(x | θ) cannot be obtained or evaluated in a reasonable time for arbitrary simulation models, and is thus only implicitly defined by the software specifying the behaviour of the simulator.This presents a barrier to the use of abms in real-world use.
To overcome this, a number of approaches to performing statistical inference have been developed in which exact density evaluations are replaced by evaluations of approximate densities or cost functions that are constructed using simulations from the model.Some of these simulationbased inference (sbi) methods have been explored within the abm community.Perhaps the most prevalent of them is simulated minimum distance (smd), in which an estimate θ is obtained by minimising some loss function f (y, θ) between the observed data y and simulated data x ∼ p (• | θ) over some search space Θ: θ = arg min θ∈Θ f (y, θ) . (1) This general class of estimators includes the maximum likelihood estimator -corresponding to choosing, for example, f (y, θ) = − log p (y | θ) or, where the likelihood is intractable, some estimate thereof (see e.g.Diggle and Gratton, 1984;Kukacka and Barunik, 2017).It also includes the Method of Simulated Moments (msm) (Franke, 2009) -in which f takes the form f (y, θ) = (g(y) − ĝθ ) W (g(y) − ĝθ ) , where g(y) denotes a set of moments derived from y, ĝθ denotes the same set of moments derived from R ≥ 1 simulations at θ, W is a suitably chosen weight matrix, and denotes the transpose.As a final example, it includes Indirect Inference (ii) (Gourieroux et al., 1993), which follows a similar approach to msm but replaces the moments with estimated parameters of a tractable auxiliary model.Some further loss functions have been proposed more recently in the context of abm estimation (see Platt, 2020, for a recent review).
A major drawback of smd and optimisation-based approaches more generally is that only parameter point estimates are produced, while in many contexts it is desirable to obtain meaningful uncertainty quantification regarding appropriate values for θ.Bayesian inference is an alternative inferential paradigm which naturally provides this meaningful notion of uncertainty, and has seen some recent but otherwise limited attention within the abm community (see e.g.Grazzini et al., 2017;Lux, 2018;Platt, 2020Platt, , 2021;;Lux, 2021).Here, the primary object of interest is the parameter posterior distribution p(θ | y), obtained via Bayes theorem: In Equation (3), p (θ) is referred to as the prior distribution over parameters θ ∈ Θ, which encodes the experimenter's initial beliefs regarding appropriate values for θ before data is observed, while the simulation model itself appears via its associated data likelihood function p (y | θ).Importantly, Bayesian inference procedures seek to accurately approximate the entire distribution p (θ | y), rather than any single point estimate derived from this distribution.In this way, accurate and meaningful uncertainty estimates may also be obtained which correctly account for the experimenter's initial beliefs, as well as the evidence provided by the data y.
A well-known investigation into Bayesian estimation methods for abms is Grazzini et al. (2017), which explores various means of constructing a surrogate likelihood p (y | θ) to account for the intractability of the true likelihood p (y | θ).Two immediate drawbacks of the methods Grazzini et al. (2017) discuss are that they (a) are difficult to reconcile with the fact that abms are frequently dynamical models generating time-series output, and (b) entail a huge computational burden, since each likelihood evaluation in the employed posterior sampling algorithm requires at least one simulation from the model, and it is typically necessary to make many hundreds of thousands of such evaluations.Recent work by Platt (2021) suggests estimating the model's transition density via a mixture density network.While the approach offers some improvements through the use of a more flexible density estimator and the incorporation of some temporal dependencies, it suffers from similar drawbacks, assuming time-homogeneity and requiring a computationally expensive estimation step for every single sample from the approximate posterior distribution.In summary, there is a need for parameter inference methods that fulfil two main criteria: 1. they must be simulation-efficient in order to remain applicable to large-scale simulation models such as abms; 2. and they must be able to deal with non-homogenous/non-stationary temporal data, both simulated and observed.
To this end, we seek with this paper to analyse the utility of two classes of parameter inference methods that have seen significant activity within the computational statistics and probabilistic machine learning literature in recent years: neural posterior estimation (Papamakarios and Murray, 2016;Lueckmann et al., 2017;Greenberg et al., 2019), and neural density ratio estimation (Thomas et al., 2021;Hermans et al., 2020;Durkan et al., 2020).Such methods have been employed successfully in a variety of applied domains, including high-energy physics (Brehmer et al., 2018), cosmology (Alsing et al., 2019), and neuroscience (Gonçalves et al., 2020).In addition, as we will demonstrate below, they work flexibly with potentially multivariate time-series data, requiring minimal model assumptions, and typically generate more accurate parameter inferences with a significantly reduced simulation budget in comparison to common alternatives, such as those reported in Grazzini et al. (2017).
We further seek to establish a higher standard of assessment criteria to be used in future benchmarking experiments for approximate Bayesian parameter inference for economic simulation models through the use of common integral probability metrics and inference validation procedures used widely in the statistics and machine learning communities.In Bayesian inference, the object of interest is the posterior density, whose entire shape is important since this captures degrees of beliefs regarding the unknown θ.However, the literature on Bayesian estimation of abms has largely forgone attempts to examine the degree of overlap between the estimated and ground-truth posterior densities or the extent to which the inference procedure has learnt the dynamics of the data, often in favour of a simple comparison of certain point estimates of θ (Grazzini et al., 2017;Platt, 2021;Shiono, 2021).Since the objective of Bayesian inference is to derive a posterior probability density function which naturally quantifies uncertainty, it is important to adopt assessment criteria that quantify the degree of closeness between the estimated and target posterior densities in benchmarking experiments.
In summary, our contributions with this article are to provide: 1. an overview of recent advances in black-box simulation-based Bayesian inference for simulation models, and a motivation for their use in the specific case of agent-based models; 2. a systematic benchmarking of these state-of-the-art methods against popular alternatives within the literature on parameter inference for agent-based economic simulation models; 3. a discussion and demonstration of approaches to validating approximate Bayesian inference algorithms, and their feasibility for agent-based models under different simulation-based inference schemes.

. Review of Bayesian inference methods for agent-based models
Bayesian inference has seen relatively little attention from the economic abm community as a means to parameter estimation.In this section, we summarise the key literature in this space.Importantly, we emphasise throughout that a common feature of these approaches is that the task of inference -that is, of constructing the posterior distribution -is inherently linked to the act of simulating from the abm and building an additional approximate generative model (sometimes also called an emulator ), in the sense that an approximation to the likelihood function is constructed.We will see in later chapters that this contrasts with a new generation of estimation methods in which inference is decoupled from the act of simulating and can be performed in a discriminative manner, typically resulting in more efficient inferences.

. . Approximate Bayesian computation
Bayesian estimation of economic abms has gained popularity with the work of Grazzini et al. (2017), in which the authors discuss three approaches to approximate Bayesian parameter inference.In this section, we briefly outline the three approaches, and show that each method can seen as explicitly or implicitly approximating the unknown likelihood function associated with the abm by where K is a (possibly unnormalised) kernel function, with parameter(s) , providing a measure of "distance" between observed data y and simulated data x.This permits an approximation p (θ | y) of the posterior density as Such approaches -targeting (5) -are comprehensively subsumed under "approximate Bayesian computation (abc)" which, as noted by Grazzini et al. (2017), is a technique that has received significant attention within the computational statistics community over the last two decades as a means to constructing approximate posterior densities for models with intractable likelihood functions (see e.g.Tavaré et al., 1997;Pritchard et al., 1999;Beaumont et al., 2002;Sunnåker et al., 2013;Schmon et al., 2020).The approach has been successfully applied within a variety of fields, most prominently in ecology, epidemiology, and systems biology (Toni et al., 2009;Liepe et al., 2014;Beaumont, 2010;Ju et al., 2021).While the approaches discussed by Grazzini et al. (2017) can be cast in this general form, a judicious choice of a particular kernel K based on reasonable assumptions is crucial.In particular, different kernel choices are called for in different scenarios.Some possibilities are outlined below.

. . . Parametric density estimation
The simplest approach discussed by Grazzini et al. (2017) involves assuming that the abm has entered a statistical equilibrium, such that the observations y t ∈ R d in the time-series y := (y 1 , y 2 , . . ., y T ) ∈ R d×T are fluctuations around some stationary value m * : where (ε t ) are iid noise terms with density g and parameters .Although more complex distributions are available, g may for example be a zero-mean Gaussian distribution, and the elements of the covariance matrix.Under these assumptions, recalling that the elements of the time-series are assumed to be independent, the parameters m * and can be estimated by the means and covariances of the elements of the time-series to give m * and ˆ .More precisely, first fixing θ and drawing a sample x ∼ p(x | θ), one can calculate the estimates m * (x) and ˆ (x) using, for example, maximum likelihood estimation.Adopting as the measure of distance the probability of the true data being observed under this approximated process yields the choice of kernel Finally by taking R ≥ 1 iid simulated datasets1 x (r) ∼ p (x | θ), r = 1, . . ., R, and calculating their associated estimators m * x (r) and ˆ x (r) , the likelihood at θ, i.e.Equation ( 4), is approximated with the Monte Carlo average where we write pθ instead of p(• | θ).Alternatively, the R simulations may be pooled to generate single estimates m * x (1) , . . ., x (R) and ˆ x (1) , . . ., x (R) at θ.This is closely related to synthetic likelihood approaches (Wood, 2010;Price et al., 2018) which use similar Gaussian distributions, but usually rely on summary statistics.In either case, the resulting approximate likelihood pθ from Equation (8) can then be used downstream for parameter estimation, either directly via e.g.maximum likelihood or through the corresponding approximate posterior, simulated e.g. through Markov chain Monte Carlo (mcmc).This approach suffers from clear limitations, however.Firstly, it treats the data points in the observed time-series as independent -that is, as lacking a natural ordering -which destroys important information when the observed y and simulated x are time-series.Secondly, choosing an appropriate and sufficiently flexible family of parametric densities g to construct the likelihood approximation is non-trivial, with poor choices leading to erroneous Bayesian inference.

. . . Non-parametric density estimation
An alternative approach, which partially addresses the second limitation described in Section 2.1.1,is to forgo the assumption of a parametric family of densities and instead use a nonparametric method for density estimation.Grazzini et al. (2017) describe the use of kernel density estimation (kde) for this purpose.Here, the data points in the time-series are once again assumed to be independent and fluctuating about some stationary value m * as in the parametric approach described above.Then, an estimate of the likelihood function is obtained by applying providing an unbiased estimate of the approximated likelihood function in Equation (4) as where p y t | θ, x (r) is the estimate of the conditional density p y t | θ, x (r) obtained via kde: for some probability kernel κ with bandwidth parameter(s) .In particular, the authors employ a Gaussian kernel with bandwidth chosen using Silverman's method (Silverman, 1986), such that Alternatively, as in Section 2.1.1,the R simulations may be pooled and a single kde model fit to the combined dataset to obtain p(y | θ).While these non-parametric approaches to density estimation are arguably more flexible than the assumption of a parametric family, they are well known to suffer from the curse of dimensionality, limiting their applicability to low dimensional data only, even under the unrealistic assumption that the y t are assumed independent.

. . . Classical approximate Bayesian computation
The authors discuss a third common approach to simulation-based Bayesian inference for intractable simulation models, namely classical abc.A typical abc algorithm proceeds by proposing a parameter value θ from some proposal distribution, for example the prior density, and determining whether to accept or reject θ on the basis of some meaningful notion of distance d(x, y) between a simulation x ∼ p(x | θ) and the observed data y.A common choice for the distance d is the Euclidean distance between suitable summary statistics of the data; that is, d(x, y) := s(x) − s(y) where s is a vector of, for example, the mean, standard deviation, and lag-1 autocorrelation of the data.While many versions of abc exist, the simplest -Rejection abc -involves repeatedly proposing candidate parameter values θ ∼ p(θ) and rejecting in the event that d (x, y) > , where x ∼ p(x | θ).Rejection abc can therefore be cast in the form of Equation ( 5) by taking providing an estimate of the (unnormalised) approximate likelihood function in Equation (4) as r) , y ≤ . (13) As discussed by Grazzini et al. (2017), abc enjoys a number of desirable properties, in particular (under mild regularity conditions) consistency with the true posterior distribution as → 0, provided no information loss is incurred by the choice of distance d.However, a major challenge to its use in practice is designing such a distance function: the success or failure of the approach often hinges on the engineering of summary statistics that are close to sufficient2 .A number of works have investigated strategies for summary statistic selection in abc (e.g.Fearnhead and Prangle, 2012;Blum et al., 2013;Wiqvist et al., 2019;Chen et al., 2020) and there has been some success in alternative, summary-free distances (see e.g. Park et al., 2016;Bernton et al., 2019;Dyer et al., 2021a).It is however in general not possible to derive a low-dimensional sufficient statistic for arbitrary simulation models, and a poor choice of summary statistics can lead to an unacceptable loss of information from the original data, resulting in a posterior approximation of little value.

. . . Unifying the approaches
In Algorithm 1, we show how the approximate posterior (5), with any of the three possibilities for K discussed by Grazzini et al. (2017), may be targeted using a variant of the popular Metropolis-Hastings (mh) algorithm (e.g.Hastings, 1970).If the likelihood estimate pθ (y) is an unbiased, non-negative estimate of some desired target p θ (y), then Algorithm 1 is an example Algorithm 1: Metropolis-Hastings sampling scheme for abc Input: Prior distribution p(•), observation y, proposal distribution q(• | θ), initial value θ 0 , number of iterations n; Evaluate pθ (y) using {x (r) } R r=1 according to either Equation ( 8), ( 9), or (13) as desired; (Andrieu and Roberts, 2009).Note that while the sampling procedure in Algorithm 1 exactly targets the posterior distributions described in Grazzini et al. (2017), many alternative posterior sampling algorithms exist (see e.g.Beaumont et al., 2009).
We once again emphasise that a major disadvantage to applying the methods described by Grazzini et al. (2017) to abms is that estimating the posterior involves simulating R ≥ 1 times from the abm at each proposed parameter value (θ i ) i=1,...,n .Since abms are typically expensive to simulate, and n is required to be many hundreds of thousands in order to accurately estimate the targeted posterior, such approaches can rapidly lead to a prohibitively large number of required simulations.

. . Latent variable models
A class of models closely related to approximate Bayesian computation is the class of so-called latent variable models.Here, the real data, y, is assumed to be a noisy observation of an unobserved (latent) process x.Thus, in contrast to previous approaches, some discrepancy between y and x is to be expected.
Under the most basic scenario, the data obtained from the simulator are identically The observed data is then assumed to have an error distribution, g ε , say, such that the contribution made by one observation y t to the likelihood function is thus We draw the attention of the read to the conceptual similarity to (4); however, ( 14) is no longer an approximation but is the exact likelihood under the assumption of measurement error.(Similarly, abc can be seen as exact if K ε describes the observational error, see e.g.Wilkinson 2013.)A simple unbiased estimator of ( 14) is for some R ≥ 13 , which can then also be used in Algorithm 1, for example, to target the associated approximate posterior.The independence assumption underlying the simulator data is, however, not realistic for abms and other time-series models, and incorporating the temporal aspect of data requires further structural assumptions, such as those of hidden Markov models.

. . . Hidden Markov Models and particle filters
Particle filters, an instance of a broader class of sequential Monte Carlo (smc) methods (see Ju et al., 2021, for a recent overview of smc and their application to epidemiological abms) relax the independence assumption imposed on the output of the simulation model and have previously been employed for Bayesian parameter inference for economic abms (Lux, 2018(Lux, , 2021)).However, such methods also involve making assumptions about the structure of the model -specifically, the assumption of a state space structure with a particular error distribution.Furthermore, previous works have noted the significant computational burden required for smc methods (Malleson et al., 2020;Lux, 2021), which can arise due to the fact that a number of particles that grows exponentially with the model dimensions is often required to avoid failure modes such as particle collapse.

. . Neural methods
As an alternative to kernel density estimation, neural density estimators have previously been employed to perform Bayesian estimation of abms.Platt (2021) presents a method which diverges from the approaches of Grazzini et al. (2017) in two main regards.Firstly, in contrast to the previous independence assumption, it assumes an autoregressive structure for the time-series model to better capture temporal correlations in the data.In particular, the approach assumes a time-series model that is Markov of order L, that is, p( where x 1:t = (x 1 , . . ., x t ).In words, the distribution of the state at time t depends only on the previous L states (and θ).In light of this assumption, the full likelihood can be factorised as Secondly, as a consequence of the autoregressive model structure, the method employs a conditional density estimator q φ to approximate the transition density function p(x t | x t−L:t−1 , θ).
The resulting likelihood approximation can be written where the contribution of the first L terms is ignored.A mixture density network (Bishop, 1994) assumes the role of the conditional density estimator q φ(θ) with parameters φ(θ) trained for each θ.Denoting the L states preceding y t as y <t for brevity, the particular form of the mixture density network is where α k are the mixing coefficients, and µ k , Σ k are the mean and covariance matrix respectively, all of which are parameterised by neural networks and trained via maximum likelihood on R iid model samples x (r) ∼ p(x | θ), r = 1, . . ., R. The resulting likelihood estimate can then once again be used also with Algorithm 1, for example, to target the associated approximate posterior distribution.While the model structure described above accounts for the sequential nature of simulations generated by abms to some extent, the computational burden associated with the simulation of sufficient training data and the training of a new conditional density estimator at each θ renders it largely infeasible (Shiono, 2021).In other work, Shiono (2021) examines a particular neural approach, with an invertible architecture and the ability to accommodate time-series data, for the specific purpose of estimating the posterior density for a New-Keyensian abm.In contrast to both Platt (2021) and Shiono (2021), we take in this work a broad perspective on the subject of neural simulation-based inference for abms, by thoroughly contextualising the problem of Bayesian parameter estimation for economic abms and by incorporating both benchmarking tasks and an exploration of suitable assessment criteria.

. A new generation of simulation-based inference methods
In this section, we provide an overview of the two main simulation-based inference (sbi) methods we make use of in this paper: neural posterior estimation (npe) and neural ratio estimation (nre), both of which -as their names suggest -employ neural networks4 to obtain an estimate of the posterior density p(θ | y).We motivate the use of these methods for agent-based models (abms) by framing them as discriminative approaches to simulation-based parameter inference, and contrasting them with the generative approaches described in Section 2 which seek to approximate the model's distribution (likelihood) using some form of probabilistic model.We then present the core elements of these methods, and further provide a discussion on how they may be leveraged flexibly and automatically to accommodate inference tasks involving high-dimensional and potentially multivariate time-series data, as is often the case in economic applications.

. . Motivating black-box discriminative approaches to parameter inference
The methods we endorse in this articlenpe and nre -can be motivated by the fact that they are: 1. simulation-efficient alternatives to more traditional approaches to sbi, such as the approximate Bayesian computation (abc) approaches described in Section 2; 2. generic sbi methods that treat the simulator as a black-box, thus making minimal assumptions about the structure and output of the simulation model; 3. discriminative approaches to sbi, in the sense that inference does not require a probabilistic model for the data x: instead of explaining the mechanisms that create the data, we only need to distinguish between realisations that arise from different parameter values.

Simulation efficiency
The algorithms described in Section 2 share a common pattern.For a fixed parameter θ, iid simulations x (r) ∼ p(x | θ), r = 1, . . ., R, are sampled to produce a proxy likelihood p(x | θ).Then, to generate n approximate posterior samples via Markov chain Monte Carlo (mcmc), this procedure is performed at n different values for θ, where n must typically be a large number -often a few hundred thousand, if not orders or magnitude larger -to ensure a low Monte Carlo error.Consequently, at least nR simulations from the abm are required in total for inference under these algorithms.Since abms can be very expensive to simulate, and the act of simulating from the abm remains the primary bottleneck in Bayesian estimation procedures for abms, this simulation demand can quickly become infeasible.
In contrast, the methods for which we advocate here differ by eliminating the need to simulate when sampling from the posterior.Instead, they decouple the act of simulating from the task of constructing the posterior by employing powerful function approximators to learn -in essence -global posterior density estimators on the basis of a limited number of simulations from across the parameter space of the abm; that is, they learn functions h : Y × Θ → R which approximate the posterior density p(θ | y) across the space of all possible values for θ and y.This has the potential to significantly reduce the simulation burden associated with approximate Bayesian inference procedures, because the pointwise estimates of p(θ | y) can borrow strength from, and share information between, one another; in contrast, the algorithms described in Section 2 consider each pointwise evaluation of p(θ | y) as standalone density estimation tasks, such that no information can be shared between them.In this way, a large number of approximate posterior samples can be generated from an algorithm trained on what can in practice be a far smaller number of model samples than is required for the algorithms described in Section 2.
Finally, the approaches we endorse here are equipped with a further efficiency benefit: amortisation.This phrase captures the fact that the global density estimators can be used to generate samples from an approximate posterior p(θ | y) for any data y without the need to further simulate from the abm, which results from the fact that they are trained on the full space of possible values for θ and y.In contrast, applying the methods described in Section 2 to a new dataset y would entail another nR simulations from the abm, multiplying the already high simulation costs.

Black-box inference methods
We saw in Sections 2.1.1 and 2.1.2that many inference approaches come with restrictive assumptions, for instance, stationarity of the simulated and observed timeseries.In general, it is useful to dispense with these assumptions, since they can be difficult to verify and limit the applicability of these methods for arbitrary simulators.Instead, it can be preferable to employ black-box methods that make minimal assumptions about the model and are therefore generically applicable to arbitrary simulators.Doing so enables the modeller to concentrate resources on model design and implementation, rather than on developing bespoke inference algorithms for each new simulator.Furthermore, assumptions such as stationarity are known to be particularly poorly suited to certain economic simulation models such as abms, since these models are known and are even designed to produce non-equilibrium dynamics.Employing inference procedures that are able to handle such dynamics is therefore essential to the task of estimating generic abms.
Discriminative approaches to parameter inference Discriminative tasks in machine learning are typically simpler than generative tasks, since generative tasks address larger problems than pure discrimination.This is intuitive: for example, it is typically easier for humans and computers alike to distinguish between images of cats and dogs than to generate them.Analogously, it is generally a simpler task to discriminate between complex time-series data than it is to generate such time-series.The approaches described in Section 2 adopt a generative approach: they each -either explicitly or implicitly -seek to derive an approximation to the simulator's likelihood function using a probabilistic model, and thus seek to model the (probability density function of the) simulation output itself.Formally, this may be understood as learning the (stochastic) map θ → x.Npe and nre, in contrast, do not seek to model the simulation output: instead, they map from instances of the simulation output to certain target values which we will describe in more detail below.This can be formally thought of as learning the map x → θ.Such an approach to parameter estimation therefore embodies a fundamental departure from the approaches described in Section 2. It has the potential to be particularly beneficial for abms, which are known to be able to produce complex, non-equilibrium dynamics that are especially difficult to model and generate.

. . Neural posterior estimation
In this section, we provide an introduction to neural conditional density estimators and their use in posterior estimation tasks.The core idea underlying this class of simulation-based Bayesian inference techniques is to use such conditional density estimators to model the parameter posterior density directly.While various conditional density estimators can be used for this purpose, e.g.mixture density networks (Bishop, 1994;Papamakarios and Murray, 2016), we focus here on the case of normalising flows (Tabak and Vanden-Eijnden, 2010;Tabak and Turner, 2013;Rezende and Mohamed, 2015) due to their widespread use in various density estimation tasks, including in the areas of image generation (e.g.Kingma and Dhariwal, 2018) and physics (e.g.Noé et al., 2019).

. . . Normalising flows
Normalising flows involve transforming a simple base distribution into a more complicated one, often with the use of neural networks.Consider a random variable U ∼ p U , where p U is a probability distribution chosen to be a "simple" base distribution, in the sense that it is easy to both generate samples from p U and to evaluate p U (u) for any u.Now consider the transformed random variable X = g(U), where g is a differentiable, invertible function with differentiable inverse f := g −1 .Then, by the change of variables formula, we have where J f is the Jacobian of f .Sampling from p X then simply involves sampling a value u from the base distribution p U and immediately obtaining x = g(u).Evaluating p X (x) also then simply involves evaluation of the right-hand side of Equation ( 17).The above may be extended easily to a composition, or flow, and invertible functions g i with inverses f i , for which we now have that Given samples from p X , the question of how to choose a base distribution p U and a series of functions g 1 , . . ., g n such that the distribution of the samples is modelled accurately arises.The idea behind normalising flows is to learn this sequence of transformations, that is, to have each f i be a flexible transformation f φ i with trainable parameters φ i .The resulting density estimator q φ can be written The simple base distribution is then often taken to be a standard normal distribution5 of the same dimension as the target distribution, and the flexible transformations f φ i are typically neural networks with a structure designed to guarantee the required invertibility and differentiability.
The trainable parameters φ = {φ i : 1 ≤ i ≤ n} are optimised by maximising the log-likelihood of the data x (r) iid ∼ p X , r = 1, . . ., R: Normalizing flows are typically implemented in machine learning libraries that support automatic differentiation such as pytorch (Paszke et al., 2019) or TensorFlow Abadi et al. ( 2016) allowing the effective use of gradient-based optimisation techniques, such as Adam (Kingma and Ba, 2014).We provide a schematic of the sampling (flow, right-pointing arrows) and density evaluation (normalising flow, left-pointing arrows) processes in Figure 1, in which the simple, left-most distribution is morphed over successive steps into the more complex, right-most distribution.

. . . Normalising flows for neural posterior estimation
Normalising flows may also be extended to the case of conditional density estimation (see e.g.Papamakarios and Murray, 2016;Lueckmann et al., 2017;Papamakarios et al., 2019;Greenberg et al., 2019).In this way, we can use a normalising flow to estimate the posterior density p(θ | x) associated with a simulation model by following the same procedure as above and finding the neural network parameters as where (x (r) , θ (r) ) iid ∼ p(x, θ), r = 1, . . ., R. Importantly, this framework allows us to learn a single conditional density estimator for the posterior density function across data x, rather than having to undergo the expensive procedure of training a separate density estimator for each point evaluation of the posterior density as in e.g.Platt (2021).

. . Neural density ratio estimation
In this section, we describe an alternative common approach to sbi termed neural ratio estimation (nre).The core idea underlying this method contrasts with npe in that while npe seeks to model the posterior density directly, nre instead seeks to estimate the following ratio: This ratio is frequently referred to as the likelihood-to-evidence ratio.In the event that Bayesian inference is the goal, this then permits evaluation of the posterior density as In the following, we discuss how this ratio may be efficiently estimated with well-calibrated probabilistic classifiers trained on simulated data-parameter pairs. .

. . Likelihood-to-evidence ratio estimation via probabilistic binary classification
We outline the approach to approximate Bayesian inference via density ratio estimation as described by Hermans et al. (2020).Here, a probabilistic binary classifier is trained to distinguish between two sets of training examples: The function of a probabilistic binary classifier trained on such data is to model the class probability d(x, θ) := p(c = 1 | x, θ) ∈ [0, 1]; hard classification labels (i.e.decisions regarding the predicted value of c) are obtained when the continuous-valued d(x, θ) is combined with a decision rule, for example the rule that c = 1 should be predicted whenever d(x, θ) > 0.5.One can show (see Appendix B of Hermans et al., 2020) that the optimal estimate of d(x, θ) is the value Thus, a good probabilistic classifier trained to distinguish between the two possible types of pairs (x, θ) will learn a good estimate d(x, θ) of this ratio.Such a probabilistic classifier will allow us to evaluate the posterior density in this way: by noticing that one can rearrange Equation ( 24) as where r * (x, θ) is the corresponding estimate of the likelihood-to-evidence ratio p(x | θ)/p(x).
In practice, of course, only an approximation d(x, θ) will be obtained to the ratio in Equation ( 24), yielding a correspondingly imperfect estimate r(x, θ) of the likelihood-to-evidence ratio.Nonetheless, it is known that training expressive, high-capacity neural network classifiers via the cross-entropy loss can yield classifiers with good probability estimates, and thus good estimates of r * (x, θ).

. . . A generalisation to multi-class classification
In more recent work, Durkan et al. (2020) demonstrate that the above approach to density ratio estimation can be generalised to the problem of training a probabilistic classifier to identify the correct (x, θ (i) ) pair from a batch {(x, θ (b) )} B b=1 of B pairs that otherwise contain only "incorrect" pairs, where "correct" and "incorrect", respectively, correspond to having been drawn from the joint distribution p(x, θ) and from the product of the marginal distributions p(x)p(θ).In this case, the optimal estimate of the correct class probability is Thus, by comparison with Equation ( 27), training a neural network f φ on the loss function for each (x, θ) ∼ p(x | θ)p(θ) will induce f φ (x, θ) to learn the value f φ (x, θ) = log(p(θ | x)/p(θ)), thus recovering an estimate of the desired density ratio.This can be extended to a batch of R "correct" data-parameter pairs as where the terms in the denominator are labelled with b r to account for the possibility that the contrasting ("incorrect") set of parameters may be different for different "correct" pairs (x (r) , θ (r) ).The authors further demonstrate that this can yield more accurate density ratio estimators.For this reason, we adopt this approach throughout this article.

. . Sampling from the neural posterior
After successful training, the posterior density estimator q φ(θ | x) obtained from npe is a parametric approximation of the true posterior distribution, which can then be used to generate iid samples θ ∼ q φ(θ | x) or to evaluate the posterior density on a pointwise basis in the ways described in Section 3.2.1.While ratio estimation similarly permits a pointwise evaluation of the posterior distribution as p(θ) exp (f φ(x, θ)), it requires a further inference step using, for example, Metropolis-Hastings to generate samples from the same posterior.However, the upfront training of the ratio estimator eliminates the need for expensive sampling from the abm, reducing the run-time significantly in comparison to alternative approaches such as those described in Section 2.

. . Round-based training
The procedures described in Sections 3.2 and 3.3 are framed as single density (ratio) estimation tasks.This means that a single dataset of R data-parameter pairs (x (r) , θ (r) ), r = 1, . . ., R is created in the following way: θ (r) ∼ p(θ), sample from the prior; x (r) ∼ p(x | θ (r) ), sample from the model using the draws from the prior.
Subsequently, nre and npe algorithms are trained on the whole dataset to find either the likelihood-to-evidence ratio or the posterior directly following, respectively, ( 21) or (29).Density estimators trained in this way are referred to as amortised density estimators, which reflects the fact that they may be used to construct posterior distributions for any data x without further simulation from the abm or the need to train multiple density estimators.
The disadvantage of this one-stage approach is that all training simulations from the abm are generated by parameters drawn from the prior distribution.In many cases, however, interest lies only in the posterior for the particular observation y, which can be concentrated on specific subregions of the entire space covered by the prior density.It can then be preferable to narrow down the search space of good candidate values for the parameter θ to subregions of the parameter space that could most plausibly have generated y.By doing so, the density (ratio) estimator will be presented with a less varied range of dynamics between which it must learn to distinguish, allowing it to develop a more refined approximation of the density (ratio) in regions of high posterior density.This can facilitate more rapid learning and potentially reduce the number of training examples that must be simulated by the abm.
Significant effort has thus been extended towards the constructions of "round"-based approaches to training density (ratio) estimators for sbi (see e.g.Papamakarios et al., 2019;Greenberg et al., 2019;Hermans et al., 2020;Durkan et al., 2020).The idea is to split the total budget of R simulations into subsets of size N 1 , N 2 , . . .such that i N i = R.In the first step, N 1 data points are created as before and a posterior approximation q φ θ | x is constructed.Subsequently, in round i ≥ 2, we create new data (x (r) , θ (r) ), r = 1, . . ., N i using sample from round-(i − 1) posterior; x (r) ∼ p(x | θ (r) ), sample from the model using the draws from the round-(i − 1) posterior.
The posterior q φ θ | x may now be retrained using (a combination of the old and) the new samples, and the process can be repeated for as many rounds as necessary.An identical process can be followed for density ratio estimation, with the exception that parameters in round i are drawn using the likelihood-to-evidence ratio obtained from round i − 1 and, for example, Metropolis-Hastings.
While the round-based approach is appealing, it does not come without additional challenges.In particular, the joint distribution of the example pairs (x (r) , θ (r) ) is no longer p(x, θ) for data sampled after the first round.In practice, this needs to be accounted for during training by the use of additional weighing factors, such as those appearing in Equation (30) (cf.Equation ( 21)).We refer the interested reader to the following papers for further details on this matter: Papamakarios and Murray (2016);Lueckmann et al. (2017);Greenberg et al. (2019).
Training schemes for npe and nre which make use of this round-based training approach are referred to as sequential neural posterior estimation (snpe) and sequential neural ratio estimation (snre), respectively, in which the prefix "sequential" reflects the round-based training design of the now non-amortised density (ratio) estimator.

.6. Summarising the simulation output
As described above, npe and nre take as input either x alone or (x, θ) pairs.Given that x is a (possibly multivariate) time-series for many abms and is thus a high-dimensional object, it can be beneficial to incorporate useful inductive biases6 into the network architecture that account for the sequential nature of x and reduce its dimensionality.To this end, the density (ratio) estimator can be prefixed with a so-called embedding network with trainable parameters ϕ, whose function is to consume the original high-dimensional dataset x and express this as a lower-dimensional summary statistic vector s ϕ (x).The parameters of the embedding network and of the density (ratio) estimator may then be learned concurrently using the same loss function, offering a means to automatically learn descriptive low-dimensional features of the raw input data during the same posterior or density ratio estimation task in an end-to-end fashion.
This quality is a further attractive feature of npe and nre.Finding low-dimensional summary statistics of the data is necessary in many approaches to sbi 7 , and the question of how to summarise data naturally arises in response to this.Popular solutions that have been explored to-date include: the experimenter themselves devising a collection of hand-crafted summary statistics that they believe adequately describes the data, which is an arduous task that can lead to a difficult-to-quantify loss of information; or learning summary statistics by constructing a separate learning task to the existing problem of density (ratio) estimation (see e.g.Fearnhead Algorithm 2: Training snpe (see Greenberg et al., 2019, Algorithm 1).
Input: prior distribution p(θ), simulator p(x | θ), observation y, conditional density estimator q φ , number of rounds M , number of simulations per round N ; Result: Trained conditional density estimator Set p0 (θ) = p(θ), dataset D = {}; for m = 0, . . ., M − 1 do n) , n = 1, . . ., N ; Append the dataset: until convergence do Evaluate the loss function Update trainable parameters φ on the basis of L(φ) end Set pm+1 (θ) := q φ (y, θ).end and Prangle, 2012; Chen et al., 2020), imposing an additional burden on the experimenter.Npe and nre, in contrast, offer a convenient approach to incorporating inductive biases and to learning summary statistics in an end-to-end fashion without the additional complications associated with alternative sbi techniques.

. . Training procedures for density (ratio) estimators
To summarise and conclude this section, we follow Greenberg et al. ( 2019) and Durkan et al. (2020) and provide in Algorithms 2 and 3 procedures for training neural posterior and density ratio estimators, respectively, for sbi over multiple rounds.In both cases, we assume that the density (ratio) estimator includes any embedding network used to learn summary statistics concurrently with the density (ratio) estimate, and thus consider φ to contain the parameters of both the estimator and the embedding network.

. Experiments for tractable examples
In this section, we present experiments in which we compare the ability of npe and nre to estimate parameter posterior distributions for economic simulation models with tractable likelihood functions against the non-parametric density estimation method described in Grazzini et al. (2017), which we term kde and outline in Section 2.1.2.We provide details of the neural network architecture and training hyperparameters in Appendix B.

. . Performance metrics
Each example presented in this section will be equipped with a tractable transition density, and therefore a tractable likelihood function.Such models are thus amenable to Bayesian inference Algorithm 3: Training snre (see Durkan et al., 2020, Algorithm 1).
To assess the accuracy of the estimated posteriors in this case, we compare the full groundtruth posterior density with the full posterior estimated with each of the implemented simulationbased approaches.This contrasts with previous studies of Bayesian parameter estimation for abms, in which point estimates alone are often used to assess the quality of the tested inference procedures.For example, Platt (2021) uses a prior-weighted Euclidean distance between the estimated posterior mean and generating parameters (that is, the θ that generated the pseudoobservation y) as a performance metric.However, such metrics provide a very limited and at times misleading view on the outcome of the inference process, because (a) "closesness" is not measured in the geometry of the target distribution, as visualised in Figure 2; (b) it is possible that an approximate posterior can produce a posterior mean close to the generating parameter but simultaneously under-or over-estimate the width of the distribution or otherwise yield poor uncertainty quantification, and (c) finite datasets are not guaranteed to yield posterior densities that concentrate on the generating parameter.
To compare the ground-truth and simulation-based posteriors, we compute two integral probability metrics which each correspond to a notion of dissimilarity between the ground-truth and estimated posteriors: Figure 2.: Visualisation: why standard Euclidean distances are misleading when gauging the performance of Bayesian inference algorithms.The parameter θ 2 is ostensibly closer to the"true parameter" (grey) in this toy posterior than θ 1 .However, θ 1 has higher posterior density, i.e. it is a more credible parameter given the observed data.
Wasserstein distance This is a distance measure derived from optimal transport theory (Kantorovich, 1960) and used widely in various machine learning contexts, e.g.generative adversarial networks (Arjovsky et al., 2017).For some distance ρ 0 on Θ and two probability measures µ and µ , the p-Wasserstein metric between two sets of samples where Γ n,m is the set of n × m matrices with non-negative entries, columns summing to m −1 , and rows summing to n −1 .Throughout, we use the Euclidean distance for ρ 0 .

Maximum mean discrepancy (MMD)
This metric is once again a metric on probability distributions that draws from the theory of reproducing kernel Hilbert spaces (Gretton et al., 2006(Gretton et al., , 2012) ) and is used widely within the machine learning and simulation-based inference community to assess the dissimilarity between probability distributions (Papamakarios et al., 2019;Lueckmann et al., 2021).Here, under a suitable choice of kernel 9 k chosen by the experimenter, the discrepancy between two probability distributions P and Q is taken to be where H is the reproducing kernel Hilbert space (rkhs) associated with k and E θ∼P [k(θ, •)] is the so-called mean embedding of P via kernel k.When only samples {θ i } n i=1 iid ∼ P and {θ i } m i=1 iid ∼ Q are available, an unbiased estimator of this metric can be computed as Throughout, we use a Gaussian kernel as k, where the scale parameter σ 2 = median{ θi − θj are the samples drawn from the ground-truth posteriors.
9 That is: a symmetric, positive semi-definite function on Θ.

8)
We consider a variant of the model proposed by (Brock and Hommes, 1998), which has previously been used in abm calibration experiments (Platt, 2020).The model dynamics can be expressed as the following system of coupled equations: where R, β, σ are parameters.We follow Platt (2020) and assume that H = 4, R = 1.0, σ = 0.04, g 1 = b 1 = b 4 = 0 and g 4 = 1.01.β will be 120 or 10, depending on the experiment, and we note below which value is used.We consider the task of estimating the posterior p (θ | y), where θ = (g 2 , b 2 , g 3 , b 3 ), y := (y 1 , y 2 , . . ., y T ) ∼ p(x | θ * ) is the pseudo-observation, T = 100, and θ * is the parameter setting used to generate y.We consider two experimental setups which we describe further in the following two subsections.For each experiment, we note that by rewriting the above system of equations, we are able to find the transition density for observation y t+1 as where In this way, we are able to obtain approximate ground truth posteriors with standard mcmc techniques such as Metropolis-Hastings (mh).
For both npe and nre, we use a round-based training approach i.e. snpe and snre: we train over 10 rounds and generate 1000 simulations in each round.For kde, we take R = 1 and sample from the posterior with mh (see Appendix A.1 for further details).
Summarising data As discussed in Section 3.6, there exists a number of approaches to representing high-dimensional time-series data x as a low-dimensional vector s(x) of summary statistics to facilitate sbi.In the experiments for the Brock & Hommes model, we demonstrate npe and nre using summary statistics obtained in two different ways: 1. by summarising them manually as the mean value, variance, maximum, minimum, median, 25th quantile, 75th quantile, and the autocorrelations of the x t to lags 1, 2, and 310 .We denote these summary statistics with s and refer to them as naive or hand-crafted summary statistics; 2. by learning them as part of the training process with an embedding network s ϕ with trainable parameters ϕ (see Section 3.6).Through the experiments we present below, we will demonstrate that npe and nre can be used flexibly with various embedding networks, and that the experimenter is free to choose from the plethora of candidate networks that incorporate useful inductive biases for time-series data (e.g.Wong et al., 2018;Dyer et al., 2021b;Kidger et al., 2020).Below, we refer to summary statistics obtained in this fashion as learned summary statistics.
In Figure 3, we show the posteriors obtained using different posterior estimation methods: Figure 3a shows the approximate ground-truth posterior density obtained with the true likelihood function and mh; Figure 3b shows the posterior obtained with kde and mh; Figures 3c and 3d show the posterior estimated via snpe with naive and learned summary statistics, respectively; and Figures 3e and 3f show the posterior obtained via snre and naive and learned summary statistics, respectively, which were also sampled using mh.Here, we use an embedding network consisting of two stacked Elman recurrent units with hidden state of size 32, followed by a single linear layer of size 16.In each figure, the marginals and joint bivariate densities are located on the diagonal and upper diagonal, respectively, while the red lines/dots locate the mean of the true posterior.
We see from the approximate ground-truth in Figure 3a that the marginal posteriors are sharply peaked on the generating parameters for b 2 , g 3 , and b 3 , while the ground truth marginal for g 2 is shifted towards higher values.While these features are recovered reasonably well for Figures 3c-3f (the most notable exception being the posteriors for g 2 ), we see that they are recovered poorly with kde.This performance gap is also manifested in the Wasserstein and mmd metrics reported in Table 1, where lower values indicate better estimates of the posterior.In summary, kde both requires a far larger simulation budget to estimate a single posterior, while simultaneously generating worse estimates of that posterior.In contrast, snpe and snre is able to achieve superior estimates of the ground-truth posterior distribution, despite the 10-fold reduction in the simulation budget they have been afforded.In Figure 4, we show the posteriors obtained using different posterior estimation methods: Figure 4a shows the approximate ground-truth posterior density obtained with the true likelihood function and mh; Figure 4b shows the posterior obtained with kde and mh; Figures 4c and 4d show the posterior estimated via snpe with naive and learned summary statistics, respectively; and Figures 4e and 4f show the posterior obtained via snre and naive and learned summary statistics, respectively, which were also sampled with mh.In this experiment, we now use an embedding network consisting of two stacked gated recurrent units (grus) with hidden state of size 32, followed by a single linear layer of size 16.This results in an embedding network with approximately 10,000 trainable parameters.
We see from the approximate ground-truth in Figure 4a that the marginal posteriors for b 2 and b 3 remain sharply peaked on the generating parameters, but that the ground truth marginals for g 2 and g 3 are diffuse and sloped.The diffuseness of the posterior with respect to these two dimensions is also visually apparent in the joint density plots in the upper diagonal.Upon inspection of the estimated posteriors, we see that the overall shape is recovered well by snpe and snre with learned summary statistics, but less accurately by kde and snpe and snre when using the naive hand-crafted summary statistics described in Section 4.2.This is once again reflected by the Wasserstein and mmd scores, reported in Table 1, in which we see significantly decreased values for snpe and snre with learned summary statistics with respect to the alternatives, despite a 10-fold decrease in the simulation budget they have been afforded.It is nonetheless noteworthy that even with the naive hand-crafted summary statistics, snpe and snre achieve comparable and slightly favourable performance than kde, again with a 10-fold decrease in the allotted simulation budget.

. . Example : Multivariate Geometric Brownian Motion
In Section 4.2, we demonstrated that npe and nre can be used to generate Bayesian parameter posteriors that better match the ground truth posterior than kde for the univariate Brock & Hommes model, despite the fact that the latter method entailed 10 times as many simulations as the former two methods based on neural networks.In this section, we seek to demonstrate that this remains the case even for models that generate multivariate time-series as output.
To this end, we consider a multivariate geometric Brownian motion (mvgbm), which is used in a variety of applications in financial time-series modelling.The model is a stochastic differential equation in which each of the d components, labelled i ∈ {1, 2, . . ., d}, of the path where the b i are drift coefficients, the σ ij are volatility coefficients, and W i t is a Brownian motion.We consider the case of d = 3 and the task of estimating the posterior for the parameters θ = (b 1 , b 2 , b 3 ) given an observation y ∼ p (x | θ * ) of T = 100 points spaced equally with spacing ∆t = 1/(T − 1), where θ * = (0.2, −0.5, 0.0).We take priors b i ∼ U(−1, 1) for each i = 1, 2, 3. We note that this model once again permits both exact simulations and samples from the exact posterior since the transition density is once again tractable11 and can be written as where In our experiments, we take   In Figure 5a we show the approximate ground truth posterior obtained with mh and the transition density described above, while in Figures 5b, 5c, and 5d we show the posteriors obtained with kde and mh, npe with learned summary statistics, and nre with mh also with learned summary statistics, respectively.The corresponding simulation budgets are 10 6 , 10 3 , and 10 3 , respectively.To learn the summary statistics, we once again use the embedding network described in Section 4.2.2 and learn summary statistics and the density (ratio) estimator concurrently.
We see that the approximate ground-truth posteriors are relatively diffuse, with peaks approximately coinciding with the true posterior mean, shown with red lines/dots.The shape and degree of diffuseness is captured accurately by npe and nre.In contrast, the posterior obtained with kde is insufficiently diffuse and biased, and thus a significantly worse estimate of the ground-truth posterior.These observations are corroborated by the corresponding Wasserstein and mmd metrics, reported in Table 2 along with the corresponding simulation budgets.In summary, npe and nre achieve significantly more accurate posterior estimates here with a 1000-fold decrease in the simulation budget, and are able to flexibly accommodate multivariate time-series as input for the inference problem.

. Validating approximate Bayesian inference
As previously mentioned, a major benefit of the Bayesian inferential paradigm is the fact that uncertainty quantification is a built-in feature captured via the posterior distribution.Its utility relies, however, on the ability to capture the correct degree of diffuseness in the posterior, such that the uncertainty quantification in the recovered posterior is meaningful.This presents a challenge in approximate Bayesian inference, since by definition the experimenter does not know the correct shape of the posterior when exact inference is intractable.In this section, we address this issue by outlining one widely used approach to verifying the accuracy of approximate Bayesian inference pipelines in the absence of any ground-truth posterior densities.

. . Simulation-based calibration
Simulation-based calibration is a general purpose method for validating approximate Bayesian inference pipelines.The core idea is to use the fact that the data-averaged posterior should be identical to the prior distribution p(θ).That is, a perfect Bayesian inference pipeline should uphold the following equality: A deviation from this equality signifies some error in the posterior sampling procedure; thus, it has been proposed by Talts et al. (2020) that testing how close our Bayesian workflow is to furthermore and conversely, particular patterns of deviation from the desired uniformity are interpretable and provide insight into the specific way in which the obtained posteriors are inaccurate (Talts et al., 2020).Performing this procedure, and inspecting the resultant rank histograms, is referred to as performing simulation-based calibration (sbc).While this procedure is in principle possible for kde, the computational burden is immense even for cheap simulation models, and becomes infeasible for large-scale abms.This is because the number of simulations required to generate n samples from the posterior would, in general, be at least12 nR, such that the total number of simulations required to perform sbc for kde increases to P nR.Even for a moderately sized sbc task with P 10 3 and the most optimistic R = 1, the total number of simulations required can easily reach the order of 10 8 , since it is typically necessary to take n to be many hundreds of thousands to obtain a reasonably accurate estimate of the posterior.In fact, larger values of n are likely to be necessary, since abms are usually highly parameterised and thus can have large parameter spaces.
In contrast, such an approach to verifying the accuracy of posteriors derived from npe and nre is feasible due to the fact that such approaches involve the prior training of a global density (ratio) estimator, obviating any simulations during the inference (posterior sampling) phase.Thus the strength of npe and nre derives from not only their enhanced performance and improved simulation-efficiency, but also from the fact that such methods permit accuracy checks such as sbc that are otherwise infeasible for alternative parameter estimation techniques when the simulator is expensive, as is often the case for large-scale abms.

. . . Example: Franke and Westerhoff ( )
To further demonstrate the ability of npe and nre to recover accurate posterior estimates and informative summary statistics in a highly automated manner, we perform sbc using a posterior and density ratio estimator trained on the Franke & Westerhoff model (Franke and Westerhoff, 2012), which has previously been used in benchmarking experiments for abm estimation methods (Platt, 2021).In particular, we consider the model version referred to as the "Wealth & Predisposition" model which, similarly to the Brock & Hommes model (see Section 4.2), may be written as a system of coupled equations which models the asset price dynamics that result from a heterogeneous system of traders: such statements in the case of more traditional techniques such as kde would quickly become prohibitively large.

Conclusion
In this paper, we investigated the use of two recently developed approaches to Bayesian parameter estimation for intractable simulation models: neural posterior estimation (npe) for approximating the posterior density directly, and neural ratio estimation (nre) for approximating the likelihood-to-evidence ratio.We motivated the use of these parameter inference methods as simulation-efficient black-box discriminative approaches to Bayesian estimation for complex time-series simulators such as agent-based models in economics, which contrast with existing methods that (a) entail an often prohibitively large simulation burden, (b) can make restrictive assumptions about the form of the simulator, for example by ignoring important structural features such as temporal dependencies, and (c) are inherently generative, and thus assume a task that is typically more difficult than discrimination alone.We argue that this latter point is likely to be particularly pertinent for the case of agent-based models in the social sciences, since they are known to be able to generate complex and stochastic non-equilibrium dynamics that can be difficult to model and generate.We further reviewed existing alternatives, and benchmarked npe and nre against the most popular of these.In these examples, we see that npe and nre generally achieve superior recovery of the approximate ground-truth posterior distributions, despite requiring simulation budgets that are orders of magnitude lower than traditional alternatives.In addition, we demonstrated that a verification of the accuracy of the density (ratio) estimators is possible with the introduced methods via simulation-based calibration (sbc), due to the amortisation of the estimators.In contrast, such checks would necessitate an unreasonably large number of simulations for existing methods in the literature for Bayesian estimation of economic simulation models, for example the methods discussed by Grazzini et al. (2017).For these reasons, we argue that such simulationefficient black-box Bayesian inference as npe and nre may enable economists and social scientists alike to more readily exploit the potential benefits of the agent-based modelling paradigm.

Figure 1 .:
Figure 1.: Schematic of a normalising flow.Sampling and density evaluation are performed via the processes illustrated at the top and bottom of the figure, respectively.

1
. a set of "genuine" examples drawn from the joint distribution, (x, θ) iid ∼ p(x | θ)p(θ), which are assigned a class label c = 1; 2. a set of "false" examples drawn from the product of the marginals, (x, θ) iid ∼ p(x)p(θ), with class label c = 0.The difference between the two cases is that the x are generated by the θ to which they are paired in the former set of examples, while in the latter set of examples the x bear no relation to the θ they are paired with.
kde posterior, 10 5 simulations (c) Sequential neural posterior estimation with naive summary statistics, 10 4 simulations (d) Sequential neural posterior estimation with learned summary statistics, 10 4 simulations (e) Sequential neural ratio estimation with naive summary statistics, 10 4 simulations (f ) Sequential neural ratio estimation with learned summary statistics, 10 4 simulations

Figure 4 .
Figure 4.: (Brock & Hommes, parameter set 2) Posteriors obtained with each posterior estimation method.The marginal posterior distributions are located on the diagonals, while the bivariate joint distributions for each parameter pair are located on the upper diagonal.Red lines/dots indicate the mean of the true posterior.
Figure 5.: (Multivariate geometric Brownian motion) Posteriors obtained with each estimation method.The marginal posterior distributions are located on the diagonals, while the bivariate joint distributions for each parameter pair are located on the upper diagonal.Red lines/dots indicate the mean of the true posterior.

Table 1 .
:(Brock & Hommes)Discrepancies between the approximate ground-truth posterior and the posteriors estimated with kde, snpe, and snre.Bold and italics indicate best and secondbest, respectively.For the neural methods, * indicates that the naive hand-crafted summary statistics described in the main text were used, otherwise summary statistics are learned from the simulated data.

Table 2 .
: (Multivariate geometric Brownian motion) Discrepancies between the approximate ground-truth posterior and the posteriors estimated with npe, nre, and kde.Smaller values indicate more accurate posteriors; bold and italics indicate best and second-best, respectively.