Bayesian estimation of large-scale simulation models with Gaussian process regression surrogates

Large scale, computationally expensive simulation models pose a particular challenge when it comes to estimating their parameters from empirical data. Most simulation models do not possess closed-form expressions for their likelihood function, requiring the use of simulation-based inference, such as simulated method of moments, indirect inference, likelihood-free inference or approximate Bayesian computation. However, given the high computational requirements of large-scale models, it is often diﬃcult to run these estimation methods, as they require more simulated runs that can feasibly be carried out. The aim is to address the problem by providing a full Bayesian estimation framework where the true but intractable likelihood function of the simulation model is replaced by one generated by a surrogate model trained on the limited simulated data. This is provided by a Linear Model of Coregionalization, where each latent variable is a sparse variational Gaussian process, chosen for its desirable convergence and consistency properties. The eﬀectiveness of the approach is tested using both a simulated Bayesian computing analysis on a known data generating process, and an empirical application in which the free parameters of a computationally demanding agent-based model are estimated on US macroeconomic data.


Introduction
The increasing availability of computing power has led over time to simulation methods becoming part of the standard toolbox of researchers.Gilbert and Troitzsch (2005) argue that their appeal for the social sciences lie in their enabling a better understanding and formalization of the non-linearities or emergent phenomena pervasive in social structures.Conditional on the simulations being a valid representation of the social phenomenon of interest, this allows scenario analysis or simulated experiments to be carried out.However, establishing this precondition is challenging for social science simulations, particularly for agent-based models (ABMs), where aggregate properties of observable variables emerge from the bottom-up simulation of individual agent interactions (Fagiolo et al., 2007).Because ABMs form the hardest case of this validation problem, they will form the focus of the empirical application, however, the methodology proposed here is designed to be broadly applicable to any model producing simulated data with a timeseries dimension.Gouriéroux andMonfort (1993, 1996) provide an early overview of simulation-based inference methods, while Fagiolo et al. (2019) review how those approaches have been applied in the context of ABM estimation.Existing applications fall into two categories, moment-based and likelihood-based, both of which are special cases of the more general indirect inference framework E-mail address: s.barde@kent.ac.uk.https://doi.org/10.1016/j.csda.2024.107972Received 21 August 2023; Received in revised form 17 April 2024; Accepted 17 April 2024 (Gouriéroux et al., 1993;Smith, 1993).These differ in the metric chosen for the distance between empirical and simulated auxiliary parameters, with the Wald metric leading to moment-based methods, and the likelihood ratio metric to simulated likelihood methods (See Smith, 2016, on this point).The method proposed here is also a form of indirect inference, using GP regression as the auxiliary model.Historically, the first method implemented on ABMs is the method of simulated moments (MSM), in Gilli and Winker (2003), which was extended by the simulated minimum distance (SMD) method of Grazzini and Richiardi (2015).Likelihood-based methods involve simulated likelihood (Kukačka and Baruník, 2017) and Bayesian estimation approaches (Grazzini et al., 2017;Delli Gatti and Grazzini, 2020), with kernel density estimation (KDE) of the simulated data providing a non-parametric estimate of the likelihood.Grazzini et al. (2017); Lamperti et al. (2018) make the point that these methods have so far been restricted to relatively simple ABMs, because the computational cost of running the simulations generates two problems when applied to large-scale models.First, the computational requirements of larger models impose a practical limit to the number of simulation runs that can be performed, assumed here to be on the order of 1000 runs, enough to run a Monte Carlo analysis and establish the statistical properties of a simulation.Second, their high parameter dimensionality, which usually underpins the computational requirements, also complicates the exploration of the parameter space through the dilution of the already limited compute budget in the high-dimensional sampling design, as well as the reduction in efficiency when using MCMC to sample from a high-dimensional posterior (Gelman et al., 1996(Gelman et al., , 1997).An additional issue, particularly acute for social science ABMs, is the compounding of this computational cost brought on by the frequent need to estimate models on multiple empirical datasets (e.g. for multiple countries, time-periods, etc.).
This paper proposes a Bayesian estimation framework with a Gaussian process regression surrogate (BEGRS) that is specifically designed to be tractable on computationally demanding models.It includes a demonstration on the Caiani et al. (2016) model, an ABM known for being computationally expensive.The strategy adopted is to proceed in two sequential stages.First, we generate a surrogate likelihood function for the model, based on Gaussian process (GP) regression, using the limited number of simulated runs to generate training data that spans the parameter space.In the second stage, this surrogate likelihood is used as part of a standard Bayesian framework to estimate the parameters of the simulation model from empirical data.
The use of a GP surrogate to reduce the computational burden of working with simulation models, especially ABMs, is not novel in itself: the seminal work of Kennedy andO'Hagan (2000, 2001) already focuses on reducing the computational expense of emulating expensive computer simulations with GP regression.A large and relatively recent literature on likelihood-free inference (LFI) also advocates the use of GPs as surrogates for models lacking a tractable likelihood, for example Meeds and Welling (2014), Gutmann and Corander (2016), Järvenpää et al. (2021) or Aushev et al. (2024), often based on the synthetic likelihood approach of Wood (2010).Similarly, within the economics ABM literature, Salle and Yıldızo glu (2014), Bargigli et al. (2020) and Chen and Desiderio (2022) all provide examples of using GP regression as a surrogate for an ABM, typically using the GP to model simulated moments that can enter an MSM estimation.
The main divergence from these previous contributions is the fact that the GP surrogate used here does not rely on discrepancies between empirical and simulated data, which means that the training stage only requires simulated data, and the trained surrogate model can be reused 'as-is' when estimating on distinct empirical datasets.This design choice helps amortize the computational cost in settings where models require estimation on many datasets, at the cost of not being able use Bayesian optimization to actively acquire training parameter samples from regions with high empirical likelihood.Instead, we show that estimation is possible even when the entire training data is simulated in one preliminary step.Given this design choice, the efficiency improvements come instead from leveraging the time-series nature of the simulated data and including lagged values of the model variables in the GP regression inputs.This turns the surrogate into a one-step-ahead predictor, thus greatly increasing the effective amount of training data available.The second improvement is to rely on variational approximations to the GP, which allows the surrogate to scale more efficiently.
This proposed two-step approach is most closely related to several existing contributions.First is the KDE-based Bayesian estimation framework Grazzini et al. (2017) and Delli Gatti and Grazzini (2020).The key departure here lies in choosing a different non-parametric framework for generating the surrogate likelihood, in order to increase the efficiency of the methodology with respect to the number of simulations required.The second contribution is the recent work of Hooten et al. (2020), who use a GP surrogate to estimate a wildlife population ABM and an epidemiological ABM.The difference here lies both in the nature and scale of the ABMs involved, with ABMs in the social sciences being more highly parametrised and there being much more uncertainty over the deep rules that govern agent behaviour.Finally, this methodology is conceptually related to the literature using neural networks as the basis of surrogate models in LFI applications, for instance Lueckmann et al. (2019);Papamakarios et al. (2017Papamakarios et al. ( , 2019)), or even Platt (2021) in the context of ABM estimation.As will be discussed below, they are in a theoretical sense the closest to the methodology proposed here, due to the universal approximation property of neural networks that is shared with GPs.
The remainder of the paper is organized as follows.Section 2 details the variational GP regression framework used to generate the surrogate likelihood, section 3 presents the wider Bayesian estimation framework for simulation models.Two applications are provided in section 4, and section 5 concludes.

One-step-ahead multivariate GP: setup and properties
The notation below follows the GP regression literature (specifically Hensman et al., 2015;Seeger et al., 2008;Burt et al., 2019) in order to facilitate presentation, but is adapted to deal both with the multivariate nature of the problem as well as the embedding of the GP regression into a wider Bayesian estimation framework.The former involves vectorizing over the observable variables and adopting a block-diagonal structure for the covariance matrices.By casting the multivariate GP as a larger univariate process one can show that the properties of the quantities of interest in univariate GP regression extend to the multivariate case.To address the second problem, and limit the potential for confusion between the Bayesian estimation of the GP surrogate from the model simulation versus the wider Bayesian estimation of the simulation model itself, variables and parameters used in the latter estimation will be labelled with a superscript * .Tables 3-5 in appendix A provide a summary of the notation used.
There are  empirical observable variables, modelled with  latent GP variables.Where necessary, these are indexed respectively using a subscript  and .There are  > 1 time series observations for each variable, indexed with subscript .The simulated data is generated using  samples   drawn from a parameter space Θ.Lower case bold , , , etc. indicate vectors, assumed to be column vectors unless stated otherwise, while uppercase bold , , , etc. refer to matrices. will specifically refer to the variancecovariance matrix produced by a kernel function, and superscripts attached to kernel matrixes, e.g. , , will refer to the inputs of the kernel function.Braces are used to refer to the full set of vectors or matrices attached to observable and latent variables, e.g.
{  } = { 1 ,  2 , … ,   } is the set of all   matrices attached to the latent variables.Two vectorization operations are required to convert these sets into single objects.

vec({𝐟
The first is the standard vectorization operation which stacks multiple column vectors into a single vector, the second constructs a block diagonal matrix from a set of matrices, with  being an appropriately sized null matrix. Let  * be a  ×  matrix of empirical data and let  = {  1 ,  2 , … ,   } be a set of  simulated  ×  real-valued matrices, each obtained by simulating the model with a row vector of  Θ parameters   ∈ Θ.It is assumed w.l.o.g. that both the simulated and empirical data have  time-series observations.The values in the columns of   and  * are assumed to be centred.Let  * be a vector of simulation model parameters to be estimated from the empirical data  * using Bayesian methods.Then, ignoring the marginal data density ( * ), the object of interest is the posterior density of  * given  * , i.e. ( * |  * ) ∝ ( * |  * )( * ).The prior probability ( * ) is known in advance, so the likelihood ( * |  * ) is the term of interest.One can take advantage of the time-series structure of the data to decompose the log-likelihood into a sum of individual one-step-ahead contributions for each time-series observation in the dataset: Where  *  = { .This is known in the literature as a Vecchia approximation, and explains why the methodology requires  > 1, to ensure at least one Markov transition is available.In principle, more lags can be included to generate a higher-order Markov approximation, at the cost of a higher dimensionality of the input space.
Two notational clarifications are needed.First, because GPs are always conditioned on ,  and , in line with the GP regression literature, explicit mention of these conditioning variables is dropped and p is used to indicate the use of a GP surrogate.Second, again to ensure consistency with the GP literature, the conditioning variables ,  * , ,  * are packaged into two sets of inputs ,  * .These are given by the following block matrices, with   −1 , a  − 1 length column vector of ones,   , the simulated observations generated by the model using parameter setting   , and , the first-order lag operator: is an  ×  matrix of training inputs, with  = ( − 1) training observations and  =  +  Θ dimensions.In the terminology of Kennedy and O'Hagan (2001) with the mean and variance given by the following expressions: Where  = vec({  }),  = bdiag ( {  } ) , and B =  ⊗   .The second equality for Ψ corresponds to the more traditional LMC representation of Alvarez et al. (2012).The measurement error on each observable  is assumed to be i.i.d.Gaussian with standard deviation   and uncorrelated across variables, so that the variance of the prediction error below is The prediction components   of the latent variables used in the proposed methodology are each assumed to follow a zeromean Gaussian distribution,   ∼  (0,  ,  ) where covariances [ ,  ] , =   (  ,   ) are modelled using a kernel function   (., .).Specifically, we use the following radial basis function (RBF) kernel, where   is the variable-specific kernel length scale: The LMC increases the flexibility of the model by having  distinct kernels, each capturing correlations in the training data at different length scales   .The set of length scales {  } will play an important role in the convergence properties of the GP surrogate, discussed below.In addition, because typically  < , the LMC also reduces the dimensionality of the surrogate model.Given (5) and with  , = bdiag ( { ,  } ) , the distribution of the LMC predictions on the training data  is f ∼  (0, K, ), with variance K, = B , B .Marginalizing out the vectorized GP latents  results in the following likelihood of the evidence for the LMC model: The log-likelihood of the evidence is a function of the underlying parameters  = {,   ,   }, which implies that optimal values for  can be obtained by maximization.
Once the optimal GP parameters  have been obtained from the training data, the LMC surrogate provides multivariate Gaussian predictions ( f * |  * ) =  ( f * | μ * , Ψ * ) for previously unseen inputs  * , with prediction mean and variance functions given by:

Theoretical properties of the multivariate GP surrogate
The use of GP regression for the surrogate model is motivated by several desirable theoretical properties, the first being universal approximation.Let  be the reproducing kernel Hilbert space (RKHS) of all functions  defined as linear combinations of the eigenfunctions of kernel ( 7), let  be a compact subset of  , and let () be the space of continuous functions over .Given  0 ∈ () Micchelli et al. (2006) shows that if the kernel is universal, then there exists a function  ∈  that approximates  0 arbitrarily well, i.e. ‖ −  0 ‖ ∞ < .In such cases, GP regression possesses the universal approximation property and  = ().
For the specific methodology proposed here: theorem 17 of Micchelli et al. (2006) proves that the RBF kernel ( 7) is universal, and theorem 12 of Caponnetto et al. (2008) proves that a multivariate kernel of the form  × (  ,   ) is universal if and only if  is itself universal and  is positive definite.This is the case for the LMC model ( 5) if  =  , as  =      is positive definite by construction.
If  < , then      is only positive semidefinite, however a Tikhonov regularization  =      +  can be performed, for small .This procedure, known as 'adding jitter', ensures universality of  × (  ,   ), but can change the GP estimates of the (regularized)   .Finally, Wynne et al. (2021) show that a closed bounded interval  ⊂ ℝ  meets the conditions required for GP regression to be consistent.As a result, the LMC kernel (5) possesses the universal approximation property on a bounded subset of ℝ  .
The second property relates to the rate at which the mean GP prediction (10) converges to the true function  0 as the size of training data set  increases.A wide range of results in the literature establishes asymptotic convergence in the univariate setting, divided between work that uses an integrated mean square error (IMSE) loss, such as Choi and Schervish (2007); Shi and Choi (2011); Le Gratiet and Garnier (2015); Koepernik and Pfaff (2021), and works that use a Kullback-Leibler (KL) loss, such as Seeger et al. (2008); Burt et al. (2019).The two approaches can in fact be reconciled, see Van Der Vaart and Van Zanten (2011).Le Gratiet and Garnier (2015) in particular prove almost sure convergence for non-degenerate Mercer kernels with bounded features, such as (7).
However, given the computationally constrained setting chosen here, the bigger concern is the rate of this convergence.Seeger et al. (2008) argue that in such a setting, the concept of interest is information consistency, where one examines the asymptotic behaviour of the expected KL divergence from the surrogate to the noise distribution ( 6).This can be bounded using the standard formula for the KL divergence between two Gaussians: The mean predictions μ are in the RKHS  of the LMC kernel corresponding to K, +  2 so the first term is simply the Hilbert norm of the prediction ‖ μ‖ 2  .The second term, known as the regret, is equal up to an additive constant to the last term in the log-likelihood (9): Corollary.Let the observations  be distributed with density () =  (0, 4 2   ) for a constant .Then the expected regret of the LMC (12) with RBF kernels (7 . This is immediate from the fact that the bound in Proposition 1 is the same, up to some multiplicative constants, as the one in Seeger et al. (2008) for the case of a single RBF kernel function.Assuming the same distribution as theirs for the inputs, the LMC regret with  RBF kernels will therefore scale with  at the same rate as that of a single RBF kernel.Note, however, that the regret of the LMC kernel (12) will be larger than the single kernel case of Seeger et al. (2008) due to the fact that  copies of the worst-performing kernel enter the regret term, which itself is multiplied by  .
A third desirable property of GP regression in the context of surrogate modelling is the auto-regularized nature of the loglikelihood (9) maximization, where the minimization of the squared deviation of the GP prediction from the data is penalized by the regret (12), enforcing smoothness of the GP and limiting the risk of overfitting the training data.In fact, Bishop (2006, sections 3.1.4 and 6.4) shows that for linear kernels (  ,   ) =      , GP regression is equivalent to ridge regression.This enables a comparison with the neural network approaches mentioned in the introduction, which also possess the universal approximation property, as proven by Hornik et al. (1989). Neal (1996) establishes that assuming a zero-mean Gaussian prior for the network parameters, a single hidden layer neural network will converge under Bayesian learning to a GP as the number of hidden units goes to infinity (see also Bishop, 2006;Rasmussen and Williams, 2006;Burt et al., 2019).This is functionally equivalent to the RKHS representation of GP regression, where any smooth function can be approximated arbitrarily well by an infinite weighted sum over kernel eigenfunctions.Neal (1996) points out that the width of a neural network is a hyperparameter that needs to be optimized over network designs, in order to avoid over-fitting the training data, unlike GP regression.What GPs lose in this trade-off is the extra flexibility available from using multiple hidden layers in a deep network, for instance being able to model discontinuities, which will impose small length scales   and high regret (12).Given the computationally constrained setting investigated here, however, the design choice is to prefer a simpler, auto-regularizing model that generalizes well from an ex-ante fixed amount of training data.Two additional properties are worth mentioning.First, the two-step design of the methodology minimizes the computationally expensive runs of the simulation model when estimating the model on multiple empirical datasets.Because the GP's training only relies on simulated data, and not empirical data, the same surrogate can be used when estimating model parameters on different empirical datasets, thus amortizing the computationally expensive training data.Second, the analytical tractability of the Gaussian predictions ensures differentiability of the surrogate likelihood.This specific aspect facilitates maximization of, and sampling from, the posterior through the use of gradient-based methods.
Finally, two caveats on the universality property and convergence need to be mentioned.First, the universality property only ensures that the one-step-ahead LMC predictions converge to the first-order Markov process assumed in the Vecchia approximation (3), and offers no guarantee that this approximation to the full likelihood (2) is valid in the first place.Second, a key practical implication of Proposition 1 is that the convergence of the LMC will tend to be determined by the regret of the kernel associated with the smallest length scale   * .Intuitively, if the output variable  changes rapidly over small ranges of  (for example because of a discontinuity), a small length scale will be required to accurately capture this, and for a given amount of training data  randomly spread over  , the prediction error will be larger than that of a smoother model, where the inputs can be captured with larger length scales.The practical implications of both issues are discussed further in section 3.1.

Scaling up: a variational approximation to the LMC
Obtaining exact GP predictions (10) for large  is not feasible in practice, due to the ( 3 ) matrix inversions required.Several approaches aim at improving on this bound, such as the Nyström sampling approach of Rudi et al. (2015); Lu et al. (2020), or the sparse variational approach of Titsias (2009) and Hensman et al. (2015).Both take advantage of redundant information in the training data, reducing the large number of training points to a smaller set of sufficient statistics that retain critical information, thus improving computational efficiency.Following the suggestion of Gutmann and Corander (2016), we use the latter of the two here, and a small number of additional inducing points are introduced to augment the training observations.In terms of notation, these inducing points correspond to a set of inducing locations {  } which generate inducing values {  }, both of which are additional parameters of the GP that need to be estimated during the training stage.The full derivation of the variational approximation to the LMC is provided in online appendix A.
The joint distribution of the training predictions   and inducing values   associated with a given latent variable is now given by the following multivariate Gaussian: The block structure of ( 13) means that Shur's complement can be used to obtain the latent predictions   conditional on the inducing values   : With: The key computational gain is that obtaining (14) now only requires inverting  ,  , which by design is much smaller than  ,  .The marginal distribution of the inducing values is simply: The conditional ( 14) and prior distributions ( 16) are assumed to be independent across latents , which means that the equivalent distributions for the vectorized values  = vec({  }) and  = vec({  }) can be expressed as follows, with  = bdiag ( The conditional distribution (,  | ) is approximated by the following variational distribution (, ), obtained by combining ( 14) with a variational prior on , with mean  = vec({  }) and covariance  = bdiag({  }): Formally, the variational approach aims to minimize the KL divergence from (, ) to (,  | ).When rearranged, this expression provides the evidence lower bound (ELBO) for the model, which is a function of the variational GP's parameters  = {, , , , , }.
Given that ln () does not depend on , maximizing the ELBO ( 19) is equivalent to minimizing the KL divergence term.In addition, assuming that the variational distribution (, ) successfully approximates (,  | ), the KL divergence will tend to zero, and maximizing () becomes equivalent to maximizing the exact GP likelihood (9).A more tractable expression for () can be obtained by evaluating the integral in (19).
Where  =   +  , −  , is the variational variance-covariance matrix of the vectorized latents  .As in Hensman et al. (2015), the ELBO () is optimized with stochastic gradient descent on random sub-samples of the training data {, }, further improving tractability.The consistency and convergence of variational approximations is proven by Burt et al.
S. Barde (2019,2020) by extending the analysis of Seeger et al. (2008) , which implies that the predictions of the variational GP converge to those of a full GP.
Once the LMC is trained on the simulated data, predictions for unseen input  * are given by the variational distribution of  * , obtained by marginalizing the inducing values  out of the joint variational distribution (18).Given the first-order Markov assumption in surrogate likelihood (3) and the fact that the predictions are already conditioned on lagged observations (4), the contribution of each transition should enter the likelihood independently.As a result, only the main diagonal of the variance-covariance matrix  is required to obtain the variational density of the LMC predictions f * .
Where given diag(), a diagonal matrix containing the main diagonal of , the mean and variance-covariance are: Given ( 21), the surrogate likelihood of the empirical  * data using the LMC is: It is straightforward to derive the gradient of the surrogate likelihood with respect to the underlying simulation model parameters  * .Because these form part of the inputs  * used in the GP prediction, they only enter the surrogate likelihood (23) through the RBF kernel (7) used to compute   * ,  .
The gradient of the surrogate likelihood ( 23) can be obtained by applying the chain rule on the standard derivative of a Gaussian likelihood: The derivatives of the LMC mean and variance functions ( 22) are: Given the definitions of  in (15), we finally have: In practice it is more efficient to use automatic differentiation to obtain the gradient, rather than attempting to implement these derivations directly, however they establish that the gradient (25) exists and is consistent.

Training sample design and GP surrogate training
The generation of the training data  from the simulation model needs to be planned carefully.We impose a computationally constrained setting that requires this training data be simulated prior to training the LMC surrogate, and as established in section 2.2, the latter's convergence depends on the properties of that pre-existing training data.This impacts on four decisions: setting bounds

𝜽, 𝜽
] on the simulation model's underlying parameter space Θ, choosing a design for drawing training samples   from Θ, setting the number of samples  to draw and the length  of the time-series simulation for each   .
The need for a set of bounds on Θ stems from the fact that the universal approximation property holds on a compact subset  of the full input space  , assumed to be ℝ  in our case.Because  contains Θ, picking  ⊆  involves picking a bounded interval for Θ.Choosing the bounds will require the researcher's domain knowledge of the simulation model, such as the nature of the parameter involved (such as a share, a ratio, etc.), or prior estimates from the literature.Similarly, a critical factor in the choice of design for drawing the training samples is knowledge of the variability of the mean one-step ahead behaviour of the model over .In practical terms, the convergence of the predictions is mainly controlled by   * , the smallest length scale from the  latent RBF kernels: if the one-step ahead model predictions are very sensitive to certain parameter values, a finer sample of points will be required to obtain a reliable surrogate, affecting either the number of samples required, or the chosen design.For instance, if model predictions are discontinuous in a given region of Θ, it might be beneficial to pick an adaptive design that samples those regions more densely.This is where Bayesian optimization, which enables online acquisition of the training samples as the GP is trained, can offer an advantage.
While a large literature already details the available design options and their relative benefits (see for instance Santner et al., 2018), a few aspects need to be detailed.First, the computational constraints on model simulations imply that the chosen design must have good space-filling properties (Lamperti et al., 2018).Santner et al. (2018) recommends the Latin hypercube design (LHD) for this context, while Salle and Yıldızo glu (2014) provide evidence that the additional orthogonality of the near-orthogonal Latin hypercube (NOLH) design (Cioppa and Lucas, 2007) improves prediction accuracy of GP surrogates.The drawbacks of the LHD and NOLH are their fixed size, limiting the ability to increase the size of the training data if required.In addition, even if samples  are orthogonal, because the training data (4) contains both  and lagged time-series data ,  will be not orthogonal in general, limiting the attractiveness of NOLH.Sobol sequences can instead be extended easily, and while Liefvendahl and Stocki (2006); Santner et al. (2018) show that they lead to less precise predictions compared to LHD, due to a greater range of inter-point distances, the practical difference between the two is minimal, especially as  increases.Finally, a Sobol design allows an initial simulation run that has insufficient  for LMC convergence to be augmented ex-post.This potentially allows for a more active sampling strategy, where additional training samples can be drawn from targeted locations in Θ, using for example Bayesian optimization.This is not implemented in this work, as the aim is to establish whether BEGRS can still perform in the most constrained case, but is discussed in the conclusion.
In addition to picking a design, one needs to choose the number of samples  and the length of the simulated time-series  , which together determine the number of training observations  = ( − 1).First,  should be as large as possible given the computational constraint, in order to improve the convergence of the surrogate.Second,  should also be as large as possible, to ensure good space-filling of Θ.The strategy adopted for the applications in section 4 is to pick  to match the number of observations in the empirical data, and infer  from the feasible number of  -length simulations given the computational budget.
Once the training data  is available, the next step is to pick the number of latent variables  to use in the LMC and the number of inducing points  in the variational approximation. can be set using principal component analysis (PCA) or a factor analysis of the simulated data  to determine the minimal number of latent variables required to summarize the data.Equation ( 6) implies that  contains additive variable-specific noise on top of the LMC predictions f , suggesting that factor analysis might be preferable to the strict orthogonal decomposition of  provided by PCA.Note that in either case, the loadings obtained will generally not match the LMC loadings .This is because the latent variables in the LMC capture the correlations in the training inputs  at different length scales   , a constraint that factor analysis or PCA do not possess.
Less guidance is available for picking the number of inducing points.The  ( (log )  ) bound of Burt et al. (2019Burt et al. ( , 2020) is a scaling guarantee, not a method of calculating the number of inducing points required.In addition, if the -dimensional  falls on a lower-dimensional manifold embedded in  , then the Burt et al. (2019Burt et al. ( , 2020) ) bound is driven by the dimensions of that manifold.This is likely here, given that  contains lagged values of , itself assumed to be reducible to a smaller number of latent variables  .
The variational approximation will converge once a sufficient number of inducing points has been included, therefore the practical process suggested in the literature is to run the GP regression several times with an increasing number of inducing points, and identify the threshold at which adding points no longer improves the ELBO (20).
Given that the LMC surrogate will only be useful if it converges to the predictions of the first-order Markov approximation (3), and given the various issues listed here that may affect this convergence, it is recommended that the convergence of the BEGRS posterior be assessed using the simulated Bayesian computation (SBC) approach of Talts et al. (2018), which itself extends the simulated validation method proposed by Cook et al. (2006).This enables to identify any critical convergence issue in the first (training) stage of BEGRS, prior to carrying out the estimation on empirical data.

Minimal prior, posterior estimation and identification
The purpose of the GP surrogate is to provide a low-cost approximation of the simulation model's likelihood ( * | * ), given empirical data  * and a candidate parameter vector  * , enabling the use of Bayesian methods.The availability of the surrogate likelihood's gradient (25) means that assuming a differentiable prior  * , the posterior can be explored with gradient-based methods, such as Hamiltonian Monte-Carlo (HMC).
While the specific choice of parameter prior ( * ) is up to the researcher, the use of a GP regression surrogate imposes a minimal requirement.GP regression converges to the true  0 in a compact subset  of the input space  , set using bounds

[ 𝜽, 𝜽
] on the parameter space Θ.The kernel (7), however, is defined over  and serves as the GP's prior over the space of all continuous functions ().This is illustrated in Fig. 1(a), where the GP prior is defined over ℝ, whereas in Fig. 1(d) the GP posterior only converges to the true data-generating process in the bounded interval containing the training data.Outside of that interval, the GP prior entirely determines the GP posterior.This results in a surrogate likelihood defined over all of ℝ  Θ but trained only for values of  ∈

]
. In principle, a continuous uniform prior over

[ 𝜽, 𝜽
] is sufficient to restrict estimates for  * within those bounds.In practice, however, the discontinuity at the boundary creates problems for gradient-based algorithms.Instead, to ensure that the gradient of the prior is defined at the boundary of the parameter space itself, we recommend using a smooth relaxation of the uniform distribution in the form of double sigmoid function, where  > 0 controls the slope at the boundary.This ensures that estimates of  * remain inside the bounds, while also ensuring that gradient-based algorithms do not stall on the bounds themselves.The following log prior and log prior gradient can be used alongside the log-likelihood and its gradient.
+ ln Even with a well-defined prior gradient on the parameter space boundaries, the fact that the LMC surrogate is defined on a subset  of the full input space  creates an identification issue specific to BEGRS, in addition to existing identification issues, such as flat More informative priors can also be used, the only requirement being that they are continuous across the bounds

[ 𝜽, 𝜽
] and vanishing outside of them.One must take care, however, when integrating prior information, that the resulting prior does not overpower the surrogate likelihood.This will potentially be flatter than the true, unobserved, likelihood of the model, due to the fact that the GP prediction contains a prediction error in addition to the standard noise term.This prediction error might be sizeable in the case where the computational constraint restricts the amount of training data available.

Applications
Two applications are provided to illustrate the Bayesian estimation with Gaussian regression surrogate (BEGRS) framework, both of which were carried out using the companion Python toolbox developed for the methodology.The toolbox uses the GPytorch implementation of GPs of Gardner et al. (2018) and is available from github .com/Sylvain -Barde /begrs.The code for replicating the VARMA and ABM exercises is available from github .com/Sylvain -Barde /begrs _varma and github .com/Sylvain -Barde /begrs _sfc respectively.

Parameter recovery on known data generating process
This first application aims to verify that the BEGRS framework can indeed produce consistent posteriors with a small number of simulation runs relative to the dimensionality of the parameter space.This is done by running a Monte Carlo exercise where BEGRS is used to estimate the  and  matrices from the following VARMA(1,1) specification: This specification is picked because VARMA models are known to be challenging to estimate in the first place (see for instance Lütkepohl, 2005;Dias and Kapetanios, 2018), and the Vecchia approximation (3) used in the BEGRS likelihood function will discard information compared with the full likelihood for (29), as this specification is not a first-order Markov process.In addition to the full VARMA(1,1) estimation, in a second setting  is set to a null matrix, reducing (29) to a VAR(1), which does possess the first-order Markov property.This results in two experimental settings, one where BEGRS is expected to do well by construction, and a second that is more challenging, as explicitly misspecified.
In order to keep the number of estimated parameters similar in both cases, we set the number of observable variables to  = 3 in the VARMA(1,1) case and  = 4 in the VAR(1) case, leading respectively to 18 and 16 parameters to be estimated.Only  and  are estimated via BEGRS, as the additive noise   is captured directly during the training of the GP regression.Instead, the level of correlation between variables  forms part of the setting of the Monte Carlo exercise, by verifying that the LMC can cope with correlated additive noise.In each case, the training dataset consists of  = 1000 series of  = 200 observations, resulting in  = ( − 1) = 199, 000 training observations.The training series were each generated with distinct   and   matrices, with individual   , and   , values drawn from the [−0.7, 0.7] range using of Sobol sequences.Each   and   was checked to ensure that their eigenvalues lie within the unit circle.1233 and 1166 Sobol draws were required to obtain 1000 stable parametrizations in the VAR(1) and VARMA(1,1) cases respectively.
A testing set was generated by drawing additional  = 1000 stable Sobol parameterizations and simulating them for  = 200 observations.The use of the same Sobol sequence as the training data ensures that the testing parameters samples set are located in between the training samples and therefore distinct from them.In a first analysis, one of these parameterizations was used as empirical data in a parameter recovery exercise, using both BEGRS and a standard benchmark, approximate Bayesian computation with sequential Monte Carlo (ABC-SMC).The ABC-SMC estimations all used 5000 particles iterated over 40 generations.The fastest estimation required 966,596 simulation calls, the slowest 1,561,513, to be compared with the 1,000 simulations used by BEGRS.As a second step a SBC diagnostic was run on the full testing set, in order to provide a more general evaluation.Both analyses used 60, 125 and 250 inducing points as a sensitivity check for the surrogate training.
The main results for both analyses are presented in Figs. 2 and 3 for the  = 0 case.Results for the  = 0.25 and  = 0.5 cases, which are very similar, are provided in online appendix B. In the case of the VAR(1), where the Vecchia approximation (3) to the likelihood is valid, Fig. 2(a) shows that BEGRS can perform on par with ABC-SMC, using a simulation budget that is three orders of magnitude smaller.Performance degrades in the 60 inducing point setting, but in a graceful way: rather than affecting all estimates, consistency fails for the last row of  but is maintained for others.This is symptomatic of the GP surrogate being unable to provide good predictions for the corresponding variable, suggesting that the LMC has not converged for that variable.The SBC analysis in Fig. 2(b) shows that this conclusion holds more generally for the full posterior.
As expected, the VARMA(1,1) setting is more challenging.Several key findings emerge from the posterior distributions in Fig. 3(a) and (b).First, the intrinsic difficulty in estimating the parameters is visible from the much wider distributions produced by ABC-SMC.Second, the BEGRS posterior distributions often recover the parameters and sometimes match those obtained with ABC-SMC.However, they also fail to recover the true parameter values in several cases, seemingly because they are narrower compared to the ABC-SMC benchmark.Again, the SBC analysis confirms this, as the very pronounced U-shape of the rank histograms, which remains even after thinning the posterior samples to eliminate autocorrelation, is characteristic of a posterior that is too narrow.

Empirical application to a large-scale ABM
The second application is an empirical estimation of the free parameters of the Caiani et al. (2016) model, offering a good illustration of the type of computationally constrained setting that BEGRS is designed for.This model was developed following the financial crisis of 2008-2009 to improve understanding of the endogenous emergence of financial shocks and their contagion to the real economy.The model combines a stock-flow consistent approach with fully-fledged commercial and financial networks between households, banks and vertically differentiated firms.While this creates high computational requirements, it enables both the replication on aggregate of the deep recessions that follow severe financial crises, as well as the analysis at a disaggregated level of the mechanisms that drive them.The model has been used to analyse the effect of fiscal policy design on inequality (Caiani et al., 2019), the effect of fiscal targets in a monetary union (Caiani et al., 2018) and the transmission of monetary policy (Schasfoort et al., 2017).
The model's computational requirements also make it a valuable testbed for validation methodologies aimed at large-scale models, for example simulated model selection (Barde, 2020).Full estimation of the model's free parameters from empirical data has never been carried out, instead the literature has relied on replication of stylized facts combined with coarse-grid sensitivity analyses on a small subset of the free parameters.By contrast, we show here that BEGRS can successfully estimate all the free parameters from US 11 S. Barde  macroeconomic time-series data, using only 1000 training simulation runs.In order to illustrate the ability of BEGRS to re-use the same surrogate in multiple empirical estimations, the analysis uses two distinct datasets.The first is the Smets and Wouters (2007) data set, with  = 160 quarterly observations for the deviation of labour hours from trend L, the real policy rate r, the rate of inflation , and the real first log difference of output Δ, consumption Δ, investment Δ and wages Δ from 1965 to 2004.The second is a shorter ( = 82) dataset covering the 1997-2017 period and containing the 1997, 2001 and 2008 crises, and the subsequent period of low inflation and near-zero interest rates.
The model has 12 free parameters, presented in Table 1, that are not set from direct observation or inferred from the steady-state constraints.The bounds on the first 5 parameters are those of the Caiani et al. (2016) sensitivity analysis, while the bounds on the remaining 7 parameters are set to allow reasonable variation around the value used in the original paper.In cases where the parameter is naturally restricted to [0, 1] (such as  and ), the bounds were offset from those natural bounds to avoid pathological behaviour in the simulations.Two simulated datasets were generated, one for training the GP surrogate and the other for a SBC diagnostic.These both contain  = 1000 series of  = 200 observations, each using samples drawn from the same 12-dimensional Sobol sequence.Each simulation run contains 500 observations, including a 300 observation burn-in.The average wall time per run was 44 minutes, and the total wall time for  = 1000 using a 36-core HPC node was 21 hours.The LMC surrogate was trained using  = 4 latent variables and 250 inducing points, and estimations used the minimal prior (28), with  = 20.
The last four columns of Table 1 show two sets of BEGRS estimates for each of the empirical datasets: the mode of the posterior, obtained using BFGS, and the mean of the posterior, estimated from 10,000 NUTS iterations.While the estimates for some parameters, The marginal frequencies of the NUTS iterations are shown in Figs. 4(a) and confirm that estimates for , and to a lesser extent and  1 and  2   lie very close to the boundaries of the parameter.The  2   parameter in particular probably suffers from the specific identification problem discussed in section 3.2.Fig. 4(b) presents the corresponding SBC analysis.The rank distributions for all parameters clearly depart from uniformity, with the characteristic U-shape that indicates under-dispersion of the posterior.This deviation is not as severe, however, as the one exhibited by the VARMA estimation in Fig. 4(c) and (d), which suggests that the surrogate, while imperfect, might nevertheless be useful.

S. Barde
Given these potential identification and convergence problems, as well as the relatively wide marginal distributions on some parameters, one may question whether the BEGRS estimates improve on the original calibration.In order to evaluate this we first replicate the analysis of Barde (2020) and use the multivariate Markov information criterion (MIC) of Barde (2017) for both the original calibration and the BEGRS estimates.The MIC is an extension of the Akaike information criterion (AIC) to simulation models, in that it provides an unbiased estimate of the cross-entropy between a set of simulated and empirical data series.Relative MIC scores across simulations thus provide a measurement of the relative KL divergence between the simulation models on a given empirical dataset.The scores obtained for each calibration, both at the variable and aggregate levels, are provided in Table 2, alongside the (negative) log-posterior provided by the BEGRS surrogate in the last column.The aggregate MIC score is in agreement with the log posterior, and confirms that the BEGRS estimates significantly improve the aggregate goodness-of-fit of the model on both the Smets and Wouters (2007) and crisis datasets compared to the original calibration.
At the variable level, Barde (2020) showed that for the Smets and Wouters (2007) dataset, the original calibration was characterized by reasonable performance on the aggregate quantities (L, Δ, Δ and Δ) but a very poor fit on the aggregate prices (r,  and Δ).Both sets of BEGRS estimates improve the fit on these variables, at the cost of a slightly worse performance on Δ.The situation is slightly different on the crisis dataset, where the fit on r and Δ still improves, but worsens on nearly all the other variables, particularly inflation .
In order to illustrate the origins of these trade-offs in fit, Fig. 5 plots the unconditional densities of the observable variables produced by the simulation model for the various parameterizations, below box plots of the corresponding empirical data.These reveal that the poor fit of the original ABM calibration on the Smets and Wouters (2007) data stems from the fact it produces distributions that are narrower than their empirical counterparts, particularly on r,  and Δ.By contrast, the BEGRS estimates produce wider distributions on all variables, improving the fit on the SW dataset with the exception of investment Δ.In this case, the original calibration already provides a relatively wide distribution, and the wider distribution brought on by the BEGRS estimates is detrimental to the fit.Similarly, the poorer fit of certain variables on the crisis dataset is itself explained by the fact that their empirical distributions are much narrower, to the point that for certain variables (such as  or Δ), the original calibration outperforms the wider BEGRS calibrations.Finally, the multimodality of some distributions reveal that these wider simulated distributions are achieved by the model switching between two regimes, one with higher inflation, growing investment and real wages, and one with falling investment and wages and zero or negative inflation.

Conclusion
This paper presents and tests a Bayesian estimation framework specifically designed for computationally demanding time-series simulation models, in order to make the best use of a potentially limited amount of training data.Core to this are the use of a one-step-ahead predictor, in the form of a first-order Markov approximation, which leverages the time-dimension of the simulated data to increase the number of training observations, and the use of GP regression as the surrogate model.This choice is driven by the desirable theoretical properties of the GP surrogate, notably the universal approximation property, proven converge with minimal assumptions on the underlying model, and the self-regularization property of the GP estimation, which limits overfitting on potentially limited training data.The methodology's two-stage nature ensures that the same surrogate can be re-used for different empirical datasets, further amortizing costly simulations.The paper extends a pre-existing result on GP regression to show that the method is consistent, and illustrates its functionality by providing the first existing empirical estimation of all the free parameters of a large-scale, computationally intensive ABM.limits merit highlighting.First, the BEGRS framework nests two distinct Bayesian estimations, the first estimating the GP parameters from the training data, the second using the resulting GP surrogate to estimate the simulation model parameters from the empirical data.In addition to the standard identification on the model parameters, one also needs to verify that the surrogate likelihood generated by the GP is itself valid.This involves checking that the first-stage estimation of the GP has converged, but also verifying the validity of the first-order Markov approximation.A key recommendation is to always run a SBC analysis as a diagnosis tool on the posterior, to assess whether the BEGRS surrogate accurately approximates the model's behaviour.Second, the BEGRS estimation framework does not offer a panacea: a large amount of simulation data is still required, increasingly so as the dimensionality of the parameter space  Θ grows beyond the baseline of 10-20 parameters used here.These caveats suggest two directions for future research.In principle, the convergence result obtained here extends to higherorder Markov processes, which means that the methodology could be extended to training data containing longer memories, at the cost of a larger input space for the GP.A first objective, therefore, is to investigate dimensionality reduction methods that can be applied to larger input spaces, caused either by larger parameter spaces or more lags of memory.A second direction is the inclusion of active acquisition strategies for drawing training samples from the parameter space, along the lines of Bayesian optimisation.In order to maintain the re-usability property on multiple empirical datasets, the acquisition function cannot depend on empirical data, for example focusing on regions with high likelihoods.Instead, an interesting option is that used in Lamperti et al. (2018), where additional training parameter samples are drawn from regions of the parameter space where the predictive performance of the surrogate on the training data itself is poor.This can potentially further reduce the simulation requirements, particularly when the simulation model's output response is discontinuous across the input space, slowing the convergence of the current approach.

Appendix B. Proof for the bound on LMC regret
Mercer's theorem establishes that a positive semi-definite kernel (  ,   ) possesses an eigenfunction expansion (30), where  0 ≥  1 ≥  2 ≥ ⋯ ≥ 0 are the eigenvalues associated to a set of orthogonal eigenfunctions  ℎ (.).This implies that the variancecovariance matrices produced by evaluating the kernel on the  input observations in  can be written as the limit of a sequence:

𝚿 𝐱
is an  ×  matrix where the columns are the values produced by the ℎ th eigenfunction on the  data points and   is an  ×  diagonal matrix containing the eigenvalues  ℎ .The kernel is assumed to be a Hilbert-Schmidt operator, therefore ∑ ℎ  2 ℎ < ∞, ensuring convergence of (30).Let  be the Hilbert space of all functions   defined as linear combinations of these eigenfunctions using 1 ×  weights   : Given a second function   ∈ , the inner product of this Hilbert space is defined as ⟨, ⟩  = lim →∞      .From this it is possible to see that  meets the two criteria required of a reproducing kernel Hilbert space (RKHS): The relevance of the RKHS to GP regression is that the predicted mean function of the LMC surrogate μ * (10) for an empirical input  * is a linear combination of the kernel eigenfunctions (31) and is therefore in the RKHS.The eigenexpansion (30) therefore underpins the result of Micchelli et al. (2006) that the RKHS is universal if the Kernel is universal, and the following convergence result of Seeger et al. (2008), which bounds the regret of a kernel: Lemma 1. (from Seeger et al., 2008) Let  be a kernel possessing an eigenexpansion (30) and  , be the covariance matrix resulting from applying the kernel to a dataset  containing  observations drawn from density function ().Then, given a constant  > 0: =   .The resulting bound is thus the log determinant of a diagonal matrix, which is just the sum of the log- arithm of the diagonal entries.□ In Seeger et al. (2008) this lemma directly provides the proof for the bound on regret.Changing the order of summation in the determinant using Sylvester's identity provides a bound expressed as an infinite sum over the eigenvalue spectrum of the kernel.This strategy does not work directly for the LMC kernel as its eigenexpansion (30) cannot be determined from the spectrum of the individual kernels in the sum.The following lemma shows, however, that the regret of the LMC kernel is bounded below a fixed multiple of the regret of worst-performing kernel in the linear combination: Lemma 2. Let K, = ∑ ∈    ,  be a linear combination of  distinct  ×  kernel matrices  ,  with weights   > 0, where  = {1, 2, ...,  } is the set of indices identifying each kernel matrix and weight.Given  > 0, the following bound exists:

| | |
First, the block diagonal structure of K, implies that Fischer's determinant inequality bounds  below the sum of the log determinants of the  main diagonal blocks, each a linear combination of  kernel matrices.In the second term, Lemma 2 bounds each log determinant by a multiple of the regret of the worst-performing kernel  * for that variable .Because  * can differ across the  variables, the third bound is obtained by identifying the variable  * with the largest regret term.At this point,  is bounded by an expression containing the regret of a single kernel matrix  ,  * .Lemma 1 provides the following bound on the expectation of that regret, completing the proof:

Fig. 1 .
Fig. 1.Illustration of surrogate GP likelihood.(For interpretation of the colours in the figures, the reader is referred to the web version of this article.)

.
or multi-modal likelihoods, or lack of convergenceduring LMC training.Figs.1(b)  and (e) illustrate the well-behaved case where an empirical observation is consistent with the range of simulated outputs in the training data.The resulting surrogate likelihood remains non-vanishing outside of the parameter boundaries, but the posterior has two clear modes consistent with the likelihood.In Figs.1(c) and (f) the empirical observation is instead out of line with the training data.Here the diffuse GP prior provides a better fit than the GP posterior, pulling the mode of the likelihood outside of the bounds [ The application of the minimal prior (28) results in a distinctive degenerate distribution at the parameter boundary itself, shown in Fig.1(f).

]
() [] =  ()The expected regret can now be bounded using Jensen's inequality and the concavity of the logarithm:The final expression for the bound relies on the fact that the orthogonality of the kernel eigenfunctions ensures that [  −1 (   )     ] As K, is a linear combination of  kernels, we can write:In the first bound    is the smallest eigenvalue of   . ,  is positive semi-definite, therefore the eigenvalues of   are real-valued and    ≥ 1∕ , which provides the second bound.Finally, if  * = arg max ∈ |  |, then: 2 are sufficient to prove the bound on the expected regret of the LMC.Proposition 1.The expected regret of the LMC has the following upper bound, where  * indicates the latent GP variable possessing the largest regret:  2  * , *  −2  *   * ,ℎ  ) Proof.Rearranging the regret explicitly reveals the block diagonal structure of K, , induced by the use of the Kronecker product on the latent  ,  : +   −2  *  2  * , *  ,  *

𝜽 * , 𝐘, 𝜽, 𝝓) is the one-step-ahead density for observation 𝐘 * 𝑡 , conditional on 𝐘 * 𝑡−1 and 𝜽 * , provided by a surrogate model with internal parameters 𝝓, optimized on a simulated training set 𝐘 generated from parameter samples
Conti and O'Hagan (2010) of calibration inputs   and variable inputs   .This maximizes the amount of training data available for the GP surrogate from the  simulation runs by treating each individual observation as a training data point, barring those lost by taking a time lag.Where necessary,   ,   , … and  *  ,  *  , … and will denote individual rows of  and  * respectively, i.e. individual observations, and  denotes the -dimensional input space from which   and  *  are drawn.It is also convenient to define  = vec( ⧵  1 ) and  * = vec( * ⧵  * 1 ) as the vectorizations of the simulated and empirical data, with the first row removed to allow for the time lag.The multivariate nature of the surrogate is handled using a Linear Model of Coregionalization (LMC)(Alvarez et al., 2012), which builds on the multi-output emulation ofConti and O'Hagan (2010).The LMC generates predictions for  observable variables based on  latent variables, each modelled by a standard univariate GP.In general, assuming a set of arbitrary Gaussian distributions   ∼  (  ,   ) for the  latent predictions and a  ×  weights matrix , the distribution of the LMC prediction is the expected KL divergence (11) per training draw from  goes to zero as the size of the training set increases.Crucially, because ‖ μ‖ 2  < ∞, the first term vanishes asymptotically and the scaling behaviour of the regret term (12) alone determines the information consistency.Several results have established an upper bound on  () [] of  ( (log ) +1 ) for univariate GP regression with a -dimensional training space  , including Seeger et al. (2008).Proposition 1, below, shows that this bound extends to the LMC setting.
Proposition 1.The expected regret of the LMC (12) has the following upper bound, where  * indicates the latent GP variable possessing the largest regret: . As long as the number of inducing variables scales as  ( (log )  )

Table 1
Caiani et al. (2016)free parameter estimates.− ln  is the negative log posterior provided by the surrogate for a parameterization.The aggregate MIC score over the dataset is not the sum of the variable-level scores, as the former discards the mutual information between variables to avoid double-counting, and its scale is not directly comparable to − ln  .such as the risk aversion   , the intensities of choice   , haircut parameter  or the unemployment threshold  are close to the values used in by the original authors, others diverge clearly, like the expectation parameter  or folded normal parameter  2