An Extended Empirical Saddlepoint Approximation for Intractable Likelihoods

The challenges posed by complex stochastic models used in computational ecology, biology and genetics have stimulated the development of approximate approaches to statistical inference. Here we focus on Synthetic Likelihood (SL), a procedure that reduces the observed and simulated data to a set of summary statistics, and quantifies the discrepancy between them through a synthetic likelihood function. SL requires little tuning, but it relies on the approximate normality of the summary statistics. We relax this assumption by proposing a novel, more flexible, density estimator: the Extended Empirical Saddlepoint approximation. In addition to proving the consistency of SL, under either the new or the Gaussian density estimator, we illustrate the method using two examples. One of these is a complex individual-based forest model for which SL offers one of the few practical possibilities for statistical inference. The examples show that the new density estimator is able to capture large departures from normality, while being scalable to high dimensions, and this in turn leads to more accurate parameter estimates, relative to the Gaussian alternative. The new density estimator is implemented by the esaddle R package, which can be found on the Comprehensive R Archive Network (CRAN).


Introduction
Synthetic Likelihood (SL) (Wood, 2010) is a simulation-based inferential procedure similar to Approximate Bayesian Computation (ABC) methods (Beaumont et al., 2002), but with the practical advantage of requiring much less tuning.In Wood (2010) tuning is avoided partly through a Gaussian assumption on the distribution of the statistics used to compare the data and the model output.This assumption can be problematic, as illustrated by the following very simple population dynamic model of an organism subject to boom and bust dynamics with stochastic external recruitment: where ǫ t ∼ Pois(β) is a stochastic arrival process, with rate β > 0, and t = 1, . . ., T. The population N t grows stochastically at rate r > 0, but it crashes if the carrying capacity κ is exceeded.The severity of the crash depends on the survival probability α ∈ (0, 1).Two fairly natural statistics for this model are the population mean and the number of periods during which N t ≤ 1, the latter being useful for identifying β.However, as Figure 1 shows, the distribution of these statistics is far from normal, which could affect the accuracy of the parameter estimates produced by SL.The purpose of this paper is to develop a version of SL that relaxes the normality requirement, while retaining the tuning free advantages of the original method.We do this by replacing the Gaussian assumption with a new density estimator: the Extended Empirical Saddlepoint (EES) estimator.We prove the consistency of the resulting parameter estimator, and illustrate that the method can yield substantial inferential improvements when multivariate Gaussianity is untenable.We also provide examples where ABC methods would require exceedingly low tolerances and low acceptance rates in order to achieve equivalent accuracy.
The most important commonality between SL and ABC methods is that both base statistical inference on a vector of summary statistics, s 0 = S(y 0 ), rather than on the full data, y 0 .However, while ABC methods explicitly aim at sampling from the approximate posterior p(θ|s 0 ), SL provides a parametric approximation to p(s 0 |θ).This synthetic likelihood, which we indicate with p SL (s 0 |θ), can then be used within a Bayesian or a classical context.Wood (2010) used a multivariate Gaussian density to approximate the distribution of the summary statistics.Under this distributional assumption, a pointwise estimate of the synthetic likelihood at θ can be obtained using Algorithm 1.
Algorithm 1 Estimating p SL (s 0 |θ) 1: Simulate datasets Y i , . . ., Y m from the model p(y|θ).2: Transform each dataset Y i to a vector of summary statistics S i = S(Y i ).
3: Calculate sample mean μθ and covariance Σθ of the simulated statistics, possibly robustly.4: Estimate the synthetic likelihood pSL (s where d is the number of summary statistics used. One advantage of SL, over most ABC methods, is that it does not require the user to choose a tolerance or an acceptance threshold and that the summary statistics are scaled automatically and dynamically by Σθ .In addition, Blum (2010) showed that the convergence rate of ABC methods degrades rapidly with d.This curse of dimensionality, brought about by the non-parametric nature of ABC, forces practitioners to use dimension reduction or statistics selection techniques, such as those described by Blum et al. (2013).SL is less sensitive to the number of statistics used, due to the parametric likelihood approximation.
The development of approximate inferential approaches, such as SL and ABC, has been driven by the increasing availability of computational resources and by the challenges to model-based inference emerging in computational biology, ecology and genetics.These methods address the issue that, for many scientifically motivated models, the likelihood function is intractable; it may be too expensive to evaluate, unknown, or too time-consuming to derive analytically.Furthermore, even when sophisticated integration approaches, such as particle filters (Doucet et al., 2000), could provide consistent likelihood estimators, using approximate methods might still be preferable in practice, because of speed, automation and robustness to implementation details.Particle filters often rely on the specific structure of the chosen model, hence their implementation may need sub- Both distributions are highly skewed, and the EES density achieves a much better fit than a Gaussian approximation.
stantial changes if a different model is considered.In contrast, SL and ABC treat the model as a black box, thus allowing practitioners to rapidly explore a variety of models.The performance of SL, ABC and particle filters has been compared in detail by Fasiolo et al. (2016) and Everitt et al. (2015), respectively in the context of parameter estimation for non-linear state space models and of Bayesian model selection for Markov random field models.It is not the purpose of this paper to provide another extensive comparison.Instead, we focus on SL, and we start from the observation that the above-mentioned properties of this method are not without cost.In fact, although the Central Limit Theorem assures asymptotic normality of many classes of statistics, improving the quality of the normal approximation is not easy in a multivariate setting.Finding a suitable normalizing transformation is particularly challenging in the context of parameter estimation, because such transformation would need to ensure normality across the parameter space.This motivates the main contribution of this work: we relax the multivariate normality assumption, while maintaining the ease-of-use and scalability of SL.We achieve this by proposing a flexible density estimator, namely the EES approximation.In addition to illustrating empirically that, when the distribution of the summary statistics is far from normal, the EES-based version of SL leads to more accurate parameter estimates than its Gaussian alternative, we prove that maximizing the synthetic likelihood produces consistent parameter estimators, under either the EES or the Gaussian density estimator.
The paper is organized as follows.We introduce the empirical saddlepoint approximation in Section 2 and we propose its extended version in Section 3. In Section 4 we clarify how the new density estimator can be used within the context of SL and we prove the consistency of the resulting parameter estimator.In Sections 5 and 6 we illustrate the method on model (1) and on another simple example, designed to show the potential limitations of ABC and of the Gaussian version of SL, and in Section 7 we apply the method to inference for a complex individual-based forest model, for which statistical inference is challenging without the use of summary statistics, while the model is sufficiently computationally costly that extensive method tuning is impractical.

Saddlepoint approximations
Recall that we are interested in using saddlepoint methods to closely approximate the statistics-based likelihood, p(s 0 |θ).However, the following discussion is valid beyond the context of SL, hence we temporarily suppress all dependencies on θ.We restore them in Section 4, which describes how the proposed density estimator can be used within SL.
We were led to saddlepoint approximations, among other multivariate density estimators, by the following considerations.While saddlepoint approximations are derived from asymptotic expansions, they are often very accurate even in small samples and, in contrast to Edgeworth approximations, they are strictly positive and do not show polynomial-like waves in the tails.In addition, their empirical version provides a close approximation to the density of widely used statistics, such as M - (Ronchetti and Welsh, 1994) and L-estimators (Easton and Ronchetti, 1986).
Saddlepoint expansions were introduced into the statistical literature by Daniels (1954) and can be used to approximate the density function of a random variable, starting from its moment or cumulant generating function.When S is a continuous d-dimensional random vector, its probability density function, p(s), is associated with the moment generating function while the cumulant generating function is defined as K(λ) = log M (λ).We indicate its gradient and Hessian with K ′ (λ) and K ′′ (λ).In the following we assume that M (λ) exists for λ ∈ I, where I is a non-vanishing subset of R d containing the origin.If S is a discrete random vector, the generating functions are obtained by substituting the integrals with summations over the support of S.
Saddlepoint approximations rely on the one-to-one correspondence between the cumulant generating function and the probability density function of S. For a continuous S, the saddlepoint density is where λ is such that Condition ( 2) is often called the saddlepoint equation.The saddlepoint density is defined only on the interior, J Vs , of the support, V s , of the original density, p(s).Another important property of p(s) is that it is generally improper.A proper density can be obtained through normalization p(s) = p(s) JV s p(s)ds .
For a discrete S analogous results hold and p(s) should be interpreted as an approximation to pr(S = s).For an introduction to saddlepoint approximations, see Butler (2007).

Empirical Saddlepoint approximation
Suppose that the analytic form of K(λ) is unknown, as it generally is for simulation-based methods such as SL.If we can simulate from p(s), then it is possible to estimate K(λ) using the estimator proposed by Davison and Hinkley (1988) where m is the number of simulations used.Derivative estimates of Km (λ) are These can be used to obtain an empirical saddlepoint approximation where λm is the solution of K′ m ( λm ) = s. (5) Notice that K′ m (λ) is a convex combination of the simulated vectors s i , hence (5) has no solution if s falls outside the convex hull of the s i s.This limitation is addressed in Section 3. Feuerverger (1989) provides asymptotic results regarding how well pm (s) approximates p(s) in a univariate setting.In the Supplementary Material we show how these carry over to the current multivariate setting.In particular, pm (s) converges to p(s) at parametric rate O(m −1/2 ) for λ ∈ I/2, where I/2 is the subset of I such that λ ∈ I/2 if 2λ ∈ I, while the convergence is slower outside this region.Regardless of the distribution of S, s = µ = E(S) corresponds to λ = 0 ∈ I/2, hence it might be advantageous to think of K′ m (I/2) as a region approximately centred around µ.In Section 3 we build upon this interpretation.

Extended Empirical Saddlepoint approximation
The aim of this work is to use the flexibility of the empirical saddlepoint approximation to estimate densities for which the normal approximation is poor.The asymptotic results of Feuerverger (1989) suggest that the saddlepoint approximation should perform reasonably well in the central part of the distribution, while its accuracy decreases in the tails.More importantly, as stated in Section 2.1, the empirical saddlepoint equation (5) has a solution only if s lies inside the convex hull of the simulated data, so the resulting empirical saddlepoint density is not defined outside this subset of R d .This is problematic in the context of SL because, whether we wish to estimate the unknown parameters by Maximum Likelihood or Markov chain Monte Carlo (MCMC), we cannot generally expect s 0 to fall inside the convex hull of the simulated statistics in early iterations.In addition, if the model of interest is unable to generate summary statistics that are close to the observed ones, its inadequacy should ideally be quantified by a low, rather than an undefined, value of the synthetic likelihood.Hence, we need a remedy that allows us to solve (5) for any s = s 0 .
To motivate our solution, notice that solving (5) is equivalent to minimizing which would be guaranteed to have a unique minimum, if strong convexity held.That is, if 6) then (5) could be solved for any s.Unfortunately, the following proposition states that this in not the case.
Proof.See Appendix A.
However, the fact that K(λ) is strictly convex assures that tilting this estimator with a strongly convex function will produce a modified estimator that is strongly convex itself, so that (5) could be solved for any s.Therefore, we propose to use a modified estimator where Ĝm (λ) is a strongly convex function, while g(s, γ) is a function of s, parametrized by γ, which determines the mix between the two functions.Furthermore, we require A natural choice for Ĝm (λ) is the following parametric estimator of K(λ) which is unbiased and consistent for multivariate normal random variables.This leads to where η = 1 − g(s, γ)/(m − 1) appears because here Σ is the standard unbiased covariance estimator, while K′′ m (λ = 0) = m Σ/(m − 1).Similarly, evaluating higher derivatives of Km (λ) at λ = 0 produces consistent, but biased, estimators of the corresponding cumulants.Unbiased cumulant estimators are the k-statistics (McCullagh, 1987).
Our solution is related to that of Wang (1992), who modified the truncated estimator of Easton and Ronchetti (1986), and to the proposal of Bartolucci (2007), in the context of Empirical Likelihood (Owen, 2001).We refer to the density obtained by using estimator (7) within (4) as the Extended Empirical Saddlepoint approximation (EES).In Section 3.1 we propose a particular form for g(s, γ).

Choosing and tuning the mixture function g(s, γ)
In Appendix B we derive the MSEs of estimators (3) and (9), under normality of S. We then base our choice of g(s, γ) on the relative MSE performance of the two estimators.In particular, we choose where γ > 0 is a tuning parameter, which determines the rate at which g(s, γ) varies from 1 to 0, as the distance between s and μ increases.Apart from fulfilling requirement (8), function (10) has the desirable property of being invariant under linear transformations.More precisely, if z = a + Bs and Using this fact, it is simple to show that EES is equivariant under such transformations, that is log pz m (z, γ) = log pm (s, γ) − log det(B).In practice, this allows us to normalize s and S 1 , . . ., S m before fitting, which generally enhances numerical stability.
Our choice (10) has two main shortcomings: it is based on a normality assumption for S and, most importantly, it does not take the sample size m into account.In regard to the first issue: using higher moments of the simulated statistics to determine (10) might be attractive, but our experience suggests that this would result in very unstable estimates.The second problem can be addressed by appropriately selecting the tuning parameter γ.Its value is critical for the performance of our method, and at first sight it not clear on what principle this choice should be based.However, saddlepoint approximations are exact for Gaussian densities (Butler, 2007), hence γ is fundamentally a complexitycontrolling parameter, which determines the balance between two density estimators: the empirical saddlepoint, which is characterized by higher flexibility and variance, and the , S * j ∼ q(s), for j = 1, . . ., l.
A reasonably efficient importance density q(s) can be obtained by fitting a multivariate normal density to the vectors in S−t .Notice that ( 8) and ( 9) assure the boundedness of the importance weights, under this choice of q(s).
• Using the normalized EES density, , evaluate the negative log-likelihood of the validation data St .
4: Select the value γ i that minimizes the negative validation log-likelihood, averaged across the k folds.
normal distribution, which generally has higher bias, but lower variance.Hence, we propose to select γ by k-fold cross-validation, as detailed in the Algorithm 2.
In the Supplementary Material we show that, as m and l → ∞, Algorithm 2 consistently selects the value of γ which minimizes the Kullback-Leibler divergence between p(s, γ) and p(s).The Gaussian case is recovered as γ → ∞.

Use within Synthetic Likelihood
We now describe how the proposed density estimator can be used within the context of SL, hence we restore all dependencies on the model parameters, θ.To obtain an initial estimate, θ I , of the unknown parameters it is reasonable to maximize the synthetic likelihood based on the Gaussian approximation, which is less computationally expensive.Then, γ can be selected using Algorithm 2, with p(s) = p(s|θ I ).Given γ, pointwise estimates of the synthetic likelihood can be based on the new density estimator by using a procedure analogous to Algorithm 1, which we describe in the Supplementary Material.
In terms of computational effort, if we assume that m, the number of summary statistics simulated from p(s|θ), is much larger than d, then the cost of evaluating the Gaussian synthetic likelihood is O(md 2 ), which is the cost of obtaining Σθ .Calculating K′′ m (λ) has the same complexity, but solving the empirical saddlepoint equation ( 5) numerically implies that K′′ m (λ) will be evaluated at several values of λ.The proposal described in Section 3 assures that the underlying root finding problem is strongly convex, hence few iterations of a Newton-Raphson algorithm are generally sufficient to solve (5) with high accuracy.The computational cost of a synthetic likelihood estimate is then O(lmd 2 ), if the normalizing constant is estimated using l importance samples.In practice, we have not yet encountered an example where the normalizing constant strongly depended on θ.However, the normalizing constant often varies significantly with γ.Hence, in the examples presented in this paper, we estimate the normalizing constant when selecting γ using Algorithm 2, but we use the unnormalized EES density during parameter estimation.
Before testing the ESS-based version of SL on the examples, we now prove that, under the conditions to be specified shortly, maximizing the synthetic likelihood leads to consistent parameter estimators.Here we denote the Gaussian-based synthetic likelihood with pG (s 0 |θ) and its EES-based version with pS (s 0 |θ).We firstly consider the Gaussian case and we prove identifiability, which means that the (scaled) synthetic likelihood converges to a function which is maximized at the true parameter vector, θ 0 .This is guaranteed under the following assumptions.
Assumption 1.The summary statistics depend on a set of underlying observations Y 1 , . . ., Y n , and have mean and covariance matrix In addition there exists δ > 0 such that, for any θ, we have μn θ → µ θ and n δ Σn θ → Σ θ , in probability, as m and n → ∞.

Proof. See Appendix C.
Here assumption 1 guarantees pointwise convergence to f θ0 (θ), while assumptions 2 and 3 assure identifiability.The fact that f θ0 (θ) is maximal at the true parameter is not itself sufficient to assure weak consistency, which is instead guaranteed under the additional condition that the convergence of the Gaussian synthetic likelihood is uniform ( Van der Vaart, 2000).To assure this, we make the following assumptions.Assumption 7. The derivatives of the asymptotic mean vector, µ θ , and (scaled) covariance matrix, Σ θ , are bounded.In particular, there exist two positive constants, M µ and M Σ , such that for any θ ∈ Θ. Newey (1991) shows that pointwise convergence in probability, proved as part of theorem 1, implies uniform convergence as long as: assumption 4 holds; the derivatives of n −δ log pG (s 0 |θ) are continuous and dominated by an O p (1) random sequence; f θ0 (θ) is equicontinuous.Hence, in the Gaussian case, proving consistency requires only assuring that the last two requirements are met.
Theorem 2. Let θ be the maximizer of the Gaussian synthetic likelihood.If assumptions 1 to 7 hold, then θ converges in probability to θ 0 , as m and n → ∞.

Proof. See Appendix D.
We now consider the EES-based synthetic likelihood, and we focus on the un-normalized density estimator, which is cheaper to compute in practice.To prove identifiability we require the two following conditions to hold, in addition to assumptions 1, 2 and 3. Assumption 8.For every n, the moment generating function of S n exists for λ ∈ I, where I is a non-vanishing subset of R d containing the origin.
Assumption 9. Let γn θI be the chosen tuning parameter, corresponding to simulation effort m and sample size n.As m and n → ∞, there exists a constant c > 0 such that Prob(γ n θI < c) → 0, for any initialization θ I .
Proof.See Appendix E.
Notice that the asymptotic function, f θ0 (θ), mentioned in theorems 1 and 3, is the same under either density estimator.As in the Gaussian case, weak consistency is guaranteed under identifiability and uniform convergence to f θ0 (θ).Given assumption 4, and the fact that the equicontinuity of f θ0 (θ) has already been proven in the proof of theorem 2, uniform convergence is guaranteed as long as the derivatives of n −δ log pS (s 0 |θ) are continuous and dominated by an O p (1) sequence.In the Gaussian case this was assured under assumptions on the derivatives of μn θ and n δ Σn θ , and on the eigenvalues of the latter.Given the complexity of the EES density, under this density estimator we prefer to impose conditions directly on n −δ log pS (s 0 |θ), rather than on K(λ), K ′ (λ) and K ′′ (λ).
Assumption 10.The derivatives of the synthetic log-likelihood based on the EES density are continuous and dominated by an O p (1) random sequence, v n,m , that is for k = 1, . . ., q and for any θ ∈ Θ.

Simple recruitment, boom and bust model
Figure 1a shows two trajectories simulated from model (1), using parameters r = 0.4, κ = 50, α = 0.09 and β = 0.05.To compare ABC with the Gaussian and EES-based version of SL, we simulate 100 pseudo-observed datasets of length T = 300 using the above parameters.Given that we are not interested in estimating N 0 , we discards the first 50 times steps to lose any transient behaviour from the system.We use the remaining 250 steps of each trajectory to compute the following summary statistics: mean and smallest population, number of times the population consists of one or less individuals, number of population peaks and square-root of the minimal time gap between two consecutive peaks (a peak is occurring at time t if N t+1 − N t ≤ 30).Under ABC, we obtain MAP estimates of the model parameters using the approach of Rubio et al. (2013).This consists of sampling the approximate posterior, and then maximizing a kernel density estimate of it.
We perform the sampling step using the MCMC approach of Marjoram et al. (2003), and we maximize the approximate posterior using the mean shift algorithm (Fukunaga and Hostetler, 1975).The ABC tolerance was chosen using the approach of Wegmann et al. ( 2009), which will be described in Section 6, using 10 6 simulations and target acceptance rate 10 −3 .We use the uniform priors r ∈ (0, 1), κ ∈ (10, 80), α ∈ (0, 1) and β ∈ (0, 1), so that MLE and MAP estimates are equivalent.For SL we use m = 5 × 10 3 and estimate γ = 4 × 10 −4 using Algorithm 2, with l = 10 4 .While the Supplementary Material gives further details about the simulation setting, it is important to point out here that we use the same number of simulations from the model under all methods.
Figure 2 compares the estimates obtained using ABC and EES-based SL, while Table 1 reports the true parameters, together with the means and RMSEs for all three methods.ABC struggles to identify the arrival rate, β, and it often grossly overestimates r and underestimates κ, so that its RMSE performances is substantially worse overall than that of either SL method.Comparing the synthetic likelihood methods to each other,  EES-SL has lower RMSE than Gaussian SL for all parameters.The mean for β is also substantially closer to the true parameter for EES-SL, as expected given the shape of the distribution of the most relevant statistic, shown in Figure 1c.This very simple example clearly illustrates that EES-SL offers non-negligible benefits when important statistics are highly non-Gaussian.

Multivariate shifted exponential distribution
Here we consider a toy example, whose purpose is illustrating how the performance of the Gaussian and EES versions of SL compares with that of tolerance-based ABC algorithms, as the dimensionality of the model increases.In particular, let S be a d-dimensional random vector, where each marginal follows a shifted exponential distribution The plot in Figure 3a contains the results of a 10-fold cross-validation run, obtained using d = 10, l = 10 3 , m = 10 4 , β = 0.5 and The cross-validation curve is minimized by γ = 5 × 10 −3 , and the plot in Figure 3b shows the true and approximate marginal densities of one component S k .The EES approximation to the marginal, obtained by marginalizing the d-dimensional fit, is clearly more accurate than a normal density.
To demonstrate the usefulness of EES in the context of SL, we use it to estimate the shifts θ 1 , . . ., θ d , all of which are equal to 0. In particular, we simulate a single vector of observed statistics, s 0 , from (11), and we maximize the resulting synthetic likelihood, using either a Gaussian density or EES.Given that SL and ABC algorithms are often motivated as approximations to full likelihood inference, all the MSEs reported in the remainder of this section quantify deviations from the full MLE, which is s 0 , not from the true parameters.Hence, the bias of Gaussian SL estimates is 1/β.By averaging the squared errors across the 10 dimensions, we obtain MSEs equal to 3.8 and 0.56, using the normal and the EES approximation respectively.In an analogous 20-dimensional run, using m = 5 × 10 4 , the MSE was reduced from 4.1 to 1.26.P-values from t-test for differences in log-absolute errors were around 10 −6 in both runs.It is possible to derive analytically how a tolerance-based ABC approximation would perform under this model.The details are reported in the Supplementary Material.Assume that the likelihood p(s 0 |θ) is approximated by p(||s 0 − s|| ∞ < ǫ|θ), where ǫ > 0 is the tolerance.Given that we are interested in deviations from the full MLE, we can impose s 0 = 0 without loss of generality.If we use independent uniform priors on [ψ, 0] for each parameter, where ψ < −ǫ, the posterior mode is at θ k = −ǫ, for k = 1, . . ., d.Hence, the MSE corresponding to the MAP estimate is equal to ǫ 2 .This implies that, to achieve MSEs equal to those of EES, ǫ would need to be set to √ 0.56 and √ 1.26, respectively in the 10 and 20-dimensional setting.The corresponding acceptance probabilities, obtained by simulating statistic vectors using parameters θ fixed to the MAP, are of order 10 −3 and 10 −4 .In 40 dimensions, obtaining an MSE equal to 2 would lead to an acceptance ratio at the MAP of order 10 −5 .Notice that these are upper bounds, because the acceptance probability is maximal at the MAP.This analysis suggests that, in order to match the MSE achieved by SL, the computational budget of an ABC algorithm would need to be increased rapidly as the number of dimensions grows.Further, the relation between ǫ and the MSE is generally not known in practice.A popular approach (see e.g.Wegmann et al. (2009)) is to simulate a large number of parameter vectors from the prior, then simulate a statistics vector from the model using each of them and select ǫ so that a small percentage of these are accepted.To quantify how the computational cost of this tuning phase depends on the prior, we reverse this process and assume that the values of ǫ are as given above.In 10 dimensions, ψ would need be in [−2.1, −ǫ], in order to achieve an acceptance probability of order 10 −4 , while in 20 dimension, ψ ∈ [−1.9, −ǫ] leads an acceptance probability of order 10 −5 .Hence, to obtain only just tolerable acceptance rates during the tuning phase, very accurate prior information must be available, especially in high dimensions.In an applied setting, prior information is often rather vague, hence ǫ needs to be tuned using more sophisticated approaches, such as the sequential algorithm of Toni et al. (2009).While such methods can alleviate the effort needed to select ǫ, performing extensive ABC tuning runs is still onerous when working with computationally intensive models, such as the one described in the next section.

The model
To test our proposal in a realistic setting, we consider Formind, an individual-based model describing the main natural processes driving forests dynamics.Here we describe its basic features, while we refer to Dislich et al. (2009) and to Fischer et al. (2016) for detailed descriptions of the model and of the scientific questions it can be used to address.
The model describes the growth and population dynamics of tree individuals in a simulation area that is divided in 20 × 20m patches, with individual trees being assigned explicitly to one patch.Tree species with similar characteristics are grouped into Plant Functional Types (PFTs).A constant input of seeds deposits on average s j seeds of the j-th PFT per hectare per year.The main factor determining both seed establishment and growth is the light climate in the patch.For example, pioneer types will establish only in patches relatively free of overshadowing trees, while late successional trees are able to grow below a dense canopy.Trees are subject to a baseline mortality rates m j , which is specific to each PFT.
In the context of Formind, the need for approximate simulation-based methods comes from the complexity of the model.Indeed, Formind was developed with a focus on ecological plausibility, rather than statistical tractability, and most of its submodels describe highly non-linear biological processes, containing one or more sources of randomness.Most importantly, the raw output of Formind is the collection of all the characteristics of individual trees in the simulations area, which obviously do not correspond to individuals present in the actual survey data.Hence, it is necessary to work with summary statistics.
Formind is computationally intensive even when few PFTs are included and, given initial conditions and parameters, the simulated forest needs to be run for several hundred years, before the distribution of the summary statistics reaches equilibrium.This means that, from a practical point of view, it is critical to avoid lengthy tuning runs, such as those needed to select the tolerance ǫ in ABC methods.

Simulation Results
We consider two PFTs, pioneer and late successional, and we reduce the model output to 6 summary statistics.In particular, to verify whether then new density estimator can deal with large departures from normality, we used the following transformed statistics where C jk is the number of trees of the j-th PFT falling in the k-th diameter class, while α jk , ψ jk , and σ jk are constants, whose values are reported in the Supplementary Material.The diameter categories used for each PFT correspond to trees with small, medium or large diameters.We simulated 24 datasets from the model and estimated the baseline mortality rates and seed input intensities of the two PFTs by maximizing the synthetic likelihood, using both the normal and the EES approximations.In both cases we used m = 10  summary statistics and, under EES, γ was fixed to 5.5 × 10 −3 , chosen using Algorithm 2 with l = 10 3 .Table 2 reports the true parameters, together with the means and RMSEs of the estimates, from the normal or the EES approximations.See the Supplementary Material for more details about the optimization setting.Using the EES, rather than the normal approximation, leads to lower MSEs for all model parameters.The plots in Figure 4 compare the marginal distributions of the summary statistics, simulated from the model using the true parameter values, with those obtained by simulating random vectors from EES, fitted to the simulated statistics using the same values of γ and m used during the optimization.EES gives a good fit to the marginal distributions of the summary statistics, all of which are far from normal.
densities for which a normal approximation is clearly inadequate.This in turn can lead to better accuracy in parameter estimation.
Importantly, using EES rather than a Gaussian density, does not add much to the tuning requirements of SL.In fact, the only parameter of EES, γ, can be selected using standard statistical tools, such as cross-validation.In the context of SL, and of approximate methods in general, having little tuning requirements is an important feature, since it allows practitioners to focus on identifying informative summary statistics, rather than on other aspects of the inferential procedure.
We have shown that, if fairly general conditions on the distribution of the summary statistics and on the underlying model hold, maximizing the synthetic likelihood function leads to consistent parameter estimators, under either EES or a Gaussian density estimator.Given the generality of the conditions assumed here, we have treated EES as a non-parametric density estimator, as suggested by Feuerverger (1989).However, the use of empirical saddlepoint approximations has previously been considered for particular classes of statistics, such as M-estimators (Monti and Ronchetti, 1993;Ronchetti and Welsh, 1994) and L-statistics (Easton and Ronchetti, 1986).Hence, it would be interesting to verify whether making additional assumptions on the summary statistics would allow us to assess the asymptotic efficiency of the parameter estimates produced by the EES-based version of SL.
From a practical point of view, the computational efficiency of SL is of critical importance.Gutmann and Corander (2016), Wilkinson (2014) and Meeds and Welling (2014) proposed using Gaussian Processes to increase the computational efficiency of SL and ABC methods.The first two proposals, being based on pointwise likelihood estimates, could be used in conjunction with EES.Meeds and Welling (2014) model only the first two moments of the simulated statistics, hence it is not clear whether their approach could be modified to take higher moments into account, as the new density estimator does.

Appendices A Proof of Proposition 1
Define and notice that K′′ (λ) is positive semi-definite for all z ∈ R d such that ||z|| > 0. In addition, define q i = s i − s and assume that Then K′′ (λ) is positive definite and K(λ) is strictly convex.In fact, suppose that there exists a non-zero vector z such that z T K′′ (λ)z = 0, which implies z T q i = 0 for i = 1, . . ., m.Given that z can be expressed as a linear combination of q 1 , . . ., q m , this would imply that which contradicts the fact that z is a non-zero vector.Now, define J ⊂ 1, . . ., m such that λ T s i = α > 0 for all i ∈ J, λ T s i < α for all i / ∈ J, examination of (12) shows that lim = 0, for all i, j such that j ∈ J, i / ∈ J, , for all i, j such that i, j ∈ J.
Finally, we choose z = λ and obtain which implies that K(λ) is not strongly convex.

B Mean squared errors of the CGF estimators
Firstly notice that, irrespective of the distribution of S, M (λ) is unbiased.If S is normally distributed, e λ T S follows a log-normal distribution and with the saddlepoint equation ( 2) being solved by In order to approximate the MSE of (3) as a function of λ, we firstly approximate its expected value by Taylor expansion around M (λ) Similarly we have that where the last equality holds due to (14).The O(m −2 ) term in (15) derives from where µ 3 (X) is the third central moment of a random variable X and the second equality is justified by independence.

D Proof of Theorem 2
Given assumption 5 and the fact that all the functions involved in n −δ log pG (s 0 |θ) are continuously differentiable, this function is continuously differentiable itself, due to the chain rule.We then have to show that its derivative is dominated by an O p (1) random sequence.Consider the partial derivative of the log-determinant w.r.t. the k-th parameter consider it is clear that the r.h.s. of ( 19) is dominated as well, provided that ||s 0 − μn θ || 2 is.Now where 2 is dominated, because the derivatives of μn θ are dominated by assumption 5 and the parameter space is compact by assumption 4. In addition, for any ǫ > 0, by Markov's inequality where * ψ n θ0 is the largest eigenvalue of n δ Σ n θ0 .The r.h.s. of ( 20) is O(n −δ ) by assumptions 1 and 6, hence ||z n θ0 || 2 2 is o p (1).An identical argument shows that ||µ n θ0 − μn θ0 || 2 2 is o p (1).This proves that the derivatives of n −δ log pG (s 0 |θ) are dominated by an O p (1) sequence.
As explained in Section 4, this implies the uniform convergence of n −δ log pG (s 0 |θ) to f θ0 (θ), provided that f θ0 (θ) is equicontinuous.It is easy to show to that, if the derivatives of f θ0 (θ) are bounded, then equicontinuity follows.But, under assumptions 2 and 7, it is possible to bound the derivatives of f θ0 (θ), as we have just done for n −δ log pG (s 0 |θ).This assures uniform convergence which, together with identifiability, guarantees weak consistency ( Van der Vaart, 2000).

Supplementary material to "An Extended Empirical Saddlepoint
Approximation for Intractable Likelihoods" Matteo Fasiolo, Simon N. Wood, Florian Hartig and Mark V. Bravington 1 Asymptotics of the multivariate empirical saddlepoint approximation Here we follow Feuerverger (1989) but develop the results in a multivariate setting, and with some changes in notation.For λ ∈ I, Mm (λ) converges to M (λ) almost surely.This convergence is uniform and extends to Km (λ): Proof: Due to the Strong Law of Large Numbers Mm (λ) converges to M (λ) almost surely, for all λ in any countable collection {λ i }.In addition Mm (λ) and M (λ) are both convex functions and, for such functions, convergence on dense subsets implies uniform convergence on compact subsets (Roberts and Varberg, 1973).This proves (S1), while (S2) follows by continuity of the logarithm.
For λ in the interior of I, these results extend to derivatives of both Mm (λ) and Km (λ): sup where i = i 1 , . . ., i d and: Proof: D i M (λ) is finite only for λ ∈ int(I).If all the elements of i are even, then D i Mm (λ) and D i M (λ) are convex and (S3) follows as before.Otherwise, indicate with λ o the elements of λ for which the corresponding element of i is odd.If there is an even number of components of λ o which are negative, D i M (λ) is still convex, otherwise −D i M (λ) is.Applying the uniform convergence argument for convex functions to the two sub-cases proves (S3).In addition, D i K(λ) has the form P (λ)/M (λ) 2 k with P (λ) being a polynomial function of D l K(λ), where l belongs to the set of all d-dimensional vector such that: Given that an analogous argument holds for D i Km (λ), (S4) is proved by continuity.After noticing that Mm (λ) and its derivatives are unbiased estimators of M (λ) and its corresponding derivatives, it is straightforward to show that: for λ 1 , λ 2 such that λ 1 + λ 2 ∈ I.This entails that, if we define I/2 to be the subset of I such that λ ∈ I/2 if 2λ ∈ I, than Mm (λ) is a √ m-consistent estimator of M (λ), for λ ∈ I/2.An analogous, but asymptotic, result for Km (λ) is the following: where λ 1 and λ 2 are further restricted to the interior of I/2 if any of the elements of i or j is greater than zero.Finally, after noticing that on I/2: we have that: by Taylor expansions, which are justified by the differentiability of all the functions involved.See Feuerverger (1989) for more details.
2 Optimality of the cross-validated Extended Empirical Saddlepoint Let p(s|θ) be the true density of the statistics and pS (s|θ, γ) be the EES density.Assume that we have a training set of size m, a test set of size n T and that we have used l simulations to normalize the density estimator.In this section we prove that, as m, n T and l → ∞, Algorithm 2 consistently selects the value of γ which minimizes the Kullback-Leibler divergence between pS (s|θ, γ) and p(s|θ).When two folds are used, cross-validation (Algorithm 2) selects γ as follows γ = argmin Hence p S (s|θ, γ) is the member of the p S (s|θ, γ) family with minimal Kullback-Leibler distance from p(s|θ).This result can easily be extended to k-fold cross-validation (k > 2).

Saddlepoint version of Algorithm 1
In this section we illustrate how a pointwise synthetic likelihood estimate can be obtained using the new density estimator, rather than a Gaussian density.
A reasonably efficient importance density q(s) is a Gaussian density with mean vector μθ and covariance Σθ .

Maximizing the synthetic likelihood
To maximize the synthetic likelihood we have used a special case of the Iterated Filtering procedure, firstly proposed by Ionides et al. (2006).Very briefly, suppose that θk is the estimate of the unknown parameters at the k-th step of the optimization routine.This estimate is updated as follows: 1. Simulate N parameter vectors θ 1 , . . ., θ N from a user-defined density p(θ k+1 | θk ) such that where σ 2 k is a cooling schedule and Σ is a covariance matrix.2. For each θ i , obtain an estimate pSL (s 0 |θ i ) of the synthetic likelihood, using either the multivariate normal density or the normalized EES.
The convergence properties of this procedure have been studied, in the context of Hidden Markov Models, firstly by Ionides et al. (2006) and more in details by Ionides et al. (2011).Doucet et al. (2013) explicitly pointed out that it can be used as a general likelihood optimizer.While those papers considered situations where the likelihood (p SL (s 0 |θ) in our context) can be evaluated exactly, we have verified empirically that the algorithm works well also when the likelihood is estimated with Monte Carlo error.For all the examples we used the following cooling schedule σ 2 k = σ 2k 0 , σ 2 0 = 0.95.
In the shifted exponential example we performed 4 separate runs of the optimizer, using either the normal or the EES approximation, in both the 10 and 20-dimensional setting.

Shifted exponential details
In one dimension, the ABC likelihood is where we used the change of variable x = s − θ.Similarly, if θ < −ǫ, the likelihood is Finally, p ǫ (s 0 |θ) = 0 for θ > ǫ.Under a uniform prior on [ψ, 0], where ψ < −ǫ, the MAP is due to the independence between the summary statistics.Hence, the likelihood at the MLE is F (2ǫ) d , which is also the maximal probability that a simulated statistics vector gets accepted.
In ABC the tolerance is often chosen so that a fraction α ∈ (0, 1) of the statistics simulated from the prior falls within the tolerance.In one dimension and for fixed ψ, the overall probability of acceptance during this process is Now, to select ψ so that we obtain an acceptance probability equal to φ, we need to solve p(|S| < ǫ|ψ) = φ, wrt ψ, numerically (e.g. using bisection).Due to the independence between the priors and between the statistics, in d dimensions the above probability becomes p(|S| < ǫ|ψ) d .

Unstable population model details
Under the Gaussian and EES version of SL, we maximize the synthetic likelihood using 100 iterations of Iterated Filtering, with N = 24 synthetic likelihood evaluations at each step.The optimizer is initialized at r = 0.3, κ = 30, α = 0.15 and β = 0.03.Given that m = 5 × 10 3 , estimating the model parameters costs 12 × 10 6 simulations from the model.In ABC we use 10 6 simulation to calibrate the tolerance ǫ, followed by 12 × 10 6 MCMC samples.Of these, we store only a thinned sub-sample of 22 × 10 3 parameter vector.Before obtaining MAP estimates, we discard the first 4 × 10 3 of these as burn-in.Then we maximize the posterior using the mean shift algorithm, where the approximate posterior is estimated using a Gaussian kernel density estimator.Following the rule of thumb of Silverman (1986) the covariance matrix, H, of the kernels is determined as follows where n = 18 × 10 3 , while H 1 2 and Σ 1/2 are matrix square roots of H and of Σ, the estimated covariance of the posterior samples.Given that the kernel density estimate of the posterior might have multiple local modes, we obtain 500 different MAP estimates by initializing the mean shift algorithm at a random posterior sample.We used the estimate corresponding to the highest (estimate) posterior density as our final MAP estimate.

Formind settings
The summary statistic were obtained using the following constants α 1,1 = α 1,3 = α 2,1 = α 2,3 = 1.5, α 1,2 = 2, α 2,2 = 2 while ψ jk and σ jk were estimates of mean and standard deviations of C jk , obtained by simulating tree counts at the true parameters.The 24 datasets were simulated from Formind using the same parameter values as in Table 1 in the supplementary material of Hartig et al. (2014).The chosen tree classes correspond to diameters at breast height d < 0.2m, 0.2m ≤ d < 0.6m, d ≥ 0.6m for pioneer and d < 0.5m, 0.5m ≤ d < 1.4m, d ≥ 1.4m for late successional trees.To generate the datasets the model was run for 10 5 years, and the final statistics vector was selected.The m = 10 4 summary statistics simulated to estimate p SL (s 0 |θ) have been generated by simulating the model for 5.1×10 4 years, where the first 10 3 years of simulation were discarded to avoid the transient, and by storing a vector of statistics every 5 years.
Starting from initial values µ pio = 0.03, µ suc = 0.003, s pio = 120 and s suc = 40, we ran the optimizations using N = 24 and 100 iterations.The estimates reported in Table 1 in the main text were obtained by using the averages of the last 10 iterations of each optimization run as point estimates.The whole experiment took around 10 days on a quad-core Intel i7 3.6 GHz processor.
4 Example: correlated multivariate shifted exponential distribution In the shifted exponential example included in the main text, the elements of the random vector S are independent.To show that EES can cope with correlated random variables we have introduced correlations, without altering the marginal densities, by using a copula model.In particular, we used a Gaussian d-dimensional copula, which has density where R is a d × d correlation matrix, I d is the identity matrix, q is a d-dimensional vector with q i = Φ −1 (u i ), where Φ is the Cumulative Distribution Function of a standard normal.The random vector {u 1 , . . ., u d } has marginals that are uniformly distributed on [0, 1].For an introduction to copulas see Cherubini et al. (2004).
To simulate unstructured, dense correlation matrices R, we have used the method proposed by Joe (2006).To set up the copula model and to simulate random variable, we have used the tools described by Yan et al. (2007).
We compare EES with a Gaussian and a kernel density estimator.In particular, we used m = 10 3 training samples and 5 × 10 3 test samples.The normalizing constant of the saddlepoint was estimated using l = 10 3 simulations and γ was estimated by cross-validation.For the kernel estimator we used a multivariate Gaussian kernel with covariance α Σ, where Σ is the empirical covariance matrix of the random vectors in training set and α is a scaling parameter, whose value was selected by cross-validation.Figure S1 shows how the estimated Kullback-Leibler divergence, between the true density and each density estimate, varies with the number of dimensions.The true density is very skewed in each dimension, hence the Gaussian estimator is highly biased.The kernel estimator does better than the Gaussian, even as the dimensionality increases.This is attributable to the fact that having a single bandwidth α is very helpful in this example, because all the marginal densities are identical.The new density estimator performs uniformly better than the alternatives.
As in the uncorrelated scenario (see the main text) we now estimate the shifts θ 1 , . . ., θ d , using the Gaussian and the new density estimator.We have considered a 10 and a 20dimensional scenario.In both cases γ has been selected by cross-validation.We have used β = 0.2 and θ 1 = • • • = θ d = 0. Given that the shape of the densities does not change with any of the θs we set l = 0, and we have not computed the normalizing constant.We have used m = 10 4 and m = 5 × 10 4 simulated vectors, respectively.By using EES, the Mean Squared Error was reduced from 21.9 to 4.8 in the 10-dimensional setting, and from 22.7 to 3.2 in the 20-dimensional setting.P-values from t-test for differences in log-absolute errors were lower than 10 −9 in both runs.

Figure 1
Figure1: a: Two sample paths, simulated from model (1).b and c: Marginal distributions of the population mean and of the number of periods t during which N t ≤ 1, when T = 250.Both distributions are highly skewed, and the EES density achieves a much better fit than a Gaussian approximation.

Assumption 4 .
The parameter space, Θ ⊂ R p , is compact and convex.Assumption 5.The derivatives of μn θ and n δ Σn θ are continuous and dominated by two O p (1) positive random sequences, a n,m and b n,m .More precisely ∂ μn θ ∂θ k 2 ≤ a n,m , and ∂n δ Σn θ ∂θ k 2 ≤ b n,m , for k = 1, . . ., q and for any θ ∈ Θ. Assumption 6.Let * ψn θ and * ψn θ be, respectively, the smallest and the largest eigenvalue of n δ Σn θ and assume that there exist two O p (1) positive random sequences, c n,m and u n,m , such that * ψn θ −1 ≤ c n,m , and * ψn θ ≤ u n,m , for any θ ∈ Θ.

Figure 2 :
Figure 2: MLE estimates obtained using EES SL versus MAP estimates produced by ABC, under model (1).The dashed lines and the black crosses indicate, respectively, the mean estimates under each method and the true parameter values.

Figure 4 :
Figure 4: Marginal distributions of summary statistics corresponding to small, medium and large pioneers (a, b, c) and successionals (d, e, f), in the Formind model.

Figure S1 :
Figure S1: Empirical Kullback-Leibler divergence between the three density estimators and the true density, as the number of dimensions increases.
2: Simulate m random vectors S 1 , . . ., S m from the true density p(s) and divide them into k folds.For simplicity, assume that m is a multiple of k.Indicate with St the vectors in the t-th fold, and with S−t the remaining r = m(1 − 1/k) vectors.Let p−t r (s, γ) be the EES density based on the vectors in S−t .3:

Table 1 :
True parameters, means and RMSEs (in parentheses) of the estimates using ABC and the two version of SL, for model (1).For each parameter, the lowest RMSE is underlined.

Table 2 :
Formind model: true parameters, means and RMSEs (in parentheses) of the estimates using the normal and the EES estimators.P-values for differences in log-absolute errors have been calculated using t-tests.
Algorithm 3 Estimating p SL (s 0 |θ) using the Extended Empirical Saddlepoint approximation 1: Simulate datasets Y i , . . ., Y m from the model p(Y |θ).2: Transform each dataset Y i to a vector of summary statistics S i = S(Y i ).3: Calculate sample mean μθ and covariance Σθ of the simulated statistics.
4: Estimate the synthetic likelihood pSL