Adapting the ABC distance function

Approximate Bayesian computation performs approximate inference for models where likelihood computations are expensive or impossible. Instead simulations from the model are performed for various parameter values and accepted if they are close enough to the observations. There has been much progress on deciding which summary statistics of the data should be used to judge closeness, but less work on how to weight them. Typically weights are chosen at the start of the algorithm which normalise the summary statistics to vary on similar scales. However these may not be appropriate in iterative ABC algorithms, where the distribution from which the parameters are proposed is updated. This can substantially alter the resulting distribution of summary statistics, so that different weights are needed for normalisation. This paper presents two iterative ABC algorithms which adaptively update their weights and demonstrates improved results on test applications.


Introduction
Approximate Bayesian computation (ABC) is a family of approximate inference methods which can be used when the likelihood function is expensive or impossible to compute but simulation from the model is straightforward. The simplest algorithm is a form of rejection sampling. Here parameter values are drawn from the prior distribution and corresponding datasets simulated. Each simulation is converted to a vector of summary statistics s = (s 1 , s 2 , . . . , s m ) and a distance between this and the summary statistics of the observed data, s obs , is calculated. Parameters producing distances below some threshold are accepted and form a sample from an approximation to the posterior distribution.
The choice of summary statistics has long been recognised as being crucial to the quality of the approximation (Beaumont et al., 2002), but there has been less work on the role of the distance function. A popular distance function is weighted Euclidean distance: where σ i is an estimate of the prior predictive standard deviation of the ith summary statistic.
In ABC rejection sampling a convenient estimate is the empirical standard deviation of the simulated s i values. Scaling by σ i in (1) normalises the summaries so that they vary over roughly the same scale, preventing the distance being dominated by the most variable summary.
This paper concerns the choice of distance in more efficient iterative ABC algorithms, in particular those of Toni et al. (2009), Sisson et al. (2009) and Beaumont et al. (2009). The first iteration of these algorithms is the ABC rejection sampling algorithm outlined above.
The sample of accepted parameters is used to construct an importance density. An ABC version of importance sampling is then performed. This is similar to ABC rejection sampling, except parameters are sampled from the importance density rather than the prior, and the output sample is weighted appropriately to take this change into account. The idea is to concentrate computational resources on performing simulations for parameter values likely to produce good matches. The output of this step is used to produce a new importance density and perform another iteration, and so on. In each iteration the acceptance threshold is reduced, resulting in increasingly accurate approximations. Full details of the Toni et al. (2009) implementation are reviewed later.
Weighted Euclidean distance is commonly used in these algorithms with σ i values determined in the first iteration. However there is no guarantee that these will normalise the summary statistics produced in later iterations, as these are no longer drawn from the prior predictive. This paper proposes two variant iterative ABC algorithms which update their σ i values to appropriate values at each iteration. It is demonstrated that these algorithms provide substantial advantages in applications. Also, they do not require any extra simulations to be performed solely for tuning. Therefore even when a non-adaptive distance performs adequately, there is no major penalty in using the new approach. (Some additional calculations are required -calculating more σ i values and more expensive distance calculationsbut these form a negligible part of the overall computational cost.) One of the proposed algorithms has similarities to the iterative ABC methods of Sedki et al. (2012) and Bonassi and West (2015). These postpone deciding some elements of the tuning of iteration t until during that iteration. Algorithm 5 also uses this strategy but for different tuning decisions: the distance function and the acceptance threshold. Another related paper is Fasiolo and Wood (2015) which contains an illustration of the difficulty of choosing ABC distance weights non-adaptively.
The remainder of the paper is structured as follows. Section 2 reviews ABC algorithms.
This includes some novel material on the convergence of iterative ABC methods. Section 3 discusses weighting summary statistics in a particular ABC distance function. Section 4 details the proposed algorithms. Several examples are given in Section 5. Section 6 summarises the work and discusses potential extensions. Finally Appendix A contains technical material on convergence of ABC algorithms. Computer code to implement the methods of this paper in the Julia programming language (Bezanson et al., 2012) is available at https://github.com/dennisprangle/ABCDistances.jl.
This section sets out the necessary background on ABC algorithms. Several review papers (e.g. Beaumont, 2010;Csilléry et al., 2010;Marin et al., 2012) give detailed descriptions of other aspects of ABC, including tuning choices and further algorithms. Sections 2.1 and 2.2 review ABC versions of rejection sampling and PMC. Section 2.3 contains novel material on the convergence of ABC algorithms.

ABC rejection sampling
Consider Bayesian inference for parameter vector θ under a model with density π(y|θ).
Let π(θ) be the prior density and y obs represent the observed data. It is assumed that π(y|θ) cannot easily be evaluated but that it is straightforward to sample from the model.
ABC rejection sampling (Algorithm 1) exploits this to sample from an approximation to the posterior density π(θ|y). It requires several tuning choices: number of simulations N , a threshold h ≥ 0, a function S(y) mapping data to a vector of summary statistics, and a distance function d(·, ·).
Algorithm 1 ABC-rejection The threshold h may be specified in advance. Alternatively it can be calculated following step 4. For example a common choice is to specify an integer k and take h to be the kth smallest of the d * i values (Biau et al., 2015).

ABC-PMC
Algorithm 2 is an iterative ABC algorithm taken from Toni et al. (2009). Very similar algorithms were also proposed by Sisson et al. (2009) and Beaumont et al. (2009). The latter note that this approach is an ABC version of population Monte Carlo (Cappé et al., 2004), so it is referred to here as ABC-PMC. The algorithm involves a sequence of thresholds, (h t ) t≥1 .
Similarly to h in ABC-rejection, this can be specified in advance or during the algorithm, as discussed below.
Algorithm 2 ABC-PMC (with the option of adaptive h t ) Initialisation 1. Let t = 1.

Main loop
2. Repeat following steps until there are N acceptances.
Denote the accepted parameters as θ t 1 , . . . , θ t N and the corresponding distances as 4. (Optional) Let h t+1 be the α quantile of the d t i values.

End of loop
The algorithm samples parameters from the importance density In the first iteration (and sometimes the second, as discussed shortly) q t (θ) is the prior.
Otherwise (2b) is used, which effectively samples from the previous weighted population and perturbs the result using kernel K t . Beaumont et al. (2009) show that a good choice of the latter is where φ is the density of a normal distribution and Σ t−1 is the empirical variance matrix of (θ t−1 i ) 1≤i≤N calculated using weights (w t−1 i ) 1≤i≤N As mentioned above, the sequence of thresholds can be specified in advance. However it is hard to do this well. A popular alternative (Drovandi and Pettitt, 2011a) is to choose the thresholds adaptively by setting h t at the end of iteration t − 1 to be the α quantile of the accepted distances (n.b. α < 1 is assumed throughout the paper). An optional step, step 4, is included in Algorithm 2 to implement this method. Alternative updating rules for h t have been proposed such as choosing it to reduce an estimate of effective sample size by a prespecified proportion (Del Moral et al., 2012) or using properties of the predicted ABC acceptance rate (Silk et al., 2013).
If step 4 is used this leaves h 1 and α as tuning choices. A simple default for h 1 is ∞, in which case all simulations are accepted when t = 1. In this case (2b) would give q 2 (θ) as simply a modified prior with inflated variance, which is not a sensible importance density.
Therefore (2) takes q 2 (θ) = π(θ) in this case. This is a minor novelty of this presentation of the algorithm.
A practical implementation of Algorithm 2 requires a condition for when to terminate.
In this paper the total number of datasets to simulate is specified as a tuning parameter and the algorithm stops once a further simulation is required. Some alternative are possible, such as stopping once the algorithm falls below a target value for h t or the acceptance rate.
Several variations on Algorithm 2 have been proposed which are briefly discussed in Section 6. Some of these are ABC versions of sequential Monte Carlo (SMC). The phrase "iterative ABC" will be used to cover ABC-PMC and ABC-SMC.

Convergence of ABC-PMC
Conditions C1-C5 ensure that Algorithm 2 converges on the posterior density in an appropriate sense as the number of iterations tends to infinity. This follows from Theorem 1 which is described in Appendix A. Although only finite computational budgets are available in practice, such convergence at least guarantees that the target distribution become arbitrarily accurate as computational resources are increased.
C1. θ ∈ R n , s ∈ R m for some m, n and these random variables have density π(θ, s) with respect to Lebesgue measure.
Bounded eccentricity is defined in Appendix A. Roughly speaking, it requires that under any projection of A t to a lower dimensional space the measure still converges to zero.
Condition C1 is quite strong, ruling out discrete parameters and summary statistics, but makes proof of Theorem 1 straightforward. Condition C2 is a mild technical requirement. Conditions C4 and C5 can be used to check which distance functions are sensible to use in ABC-PMC, usually by investigating whether they hold when h t → 0. For example it is straightforward to show this is the case when d(·, ·) is a metric induced by a norm.

Weighted Euclidean distance in ABC
This paper concentrates on using weighted Euclidean distance in ABC. Section 3.1 discusses this distance and how to choose its weights. Section 3.2 illustrates its usefulness in a simple example.

Definition and usage
Consider the following distance: If ω i = 1 for all i, this is is Euclidean distance. Otherwise it is a form of weighted Euclidean distance.
Many other distance functions can be used in ABC, as discussed in Section 2.3, for To the author's knowledge the only published comparison of distance functions is by McKinley et al. (2009), which found little difference between the alternatives. Owen et al. (2015) report the same conclusion but not the details. This finding is also supported in unpublished work by the author of this paper. Given these empirical results this paper focuses on (3) as it is a simple choice, but no claims are made for its optimality. Some further discussion on this is given in Section 6.
Summary statistics used in ABC may vary on substantially different scales. In the extreme case Euclidean distance will be dominated by the most variable. To avoid this, weighted Euclidean distance is generally used. This usually takes ω i = 1/σ i where σ i is an estimate of the scale of the ith summary statistic. (Using this choice in weighted Euclidean distance gives the distance function (1) discussed in the introduction.) A popular choice (e.g. Beaumont et al., 2002) of σ i is the empirical standard deviation of the ith summary statistic under the prior predictive distribution. Csilléry et al. (2012) suggest using median absolute deviation (MAD) instead since it is more robust to large outliers. MAD is used throughout this paper. For many ABC algorithms these σ i values can be calculated without requiring any extra simulations. For example this can be done between steps 3 and 4 of ABC-rejection. ABC-PMC can be modified similarly, resulting in Algorithm 3, which also updates h t adaptively. (n.b. All of the ABC-PMC convergence discussion in Section 2.3 also applies to this modification.)

Illustration
As an illustration, Figure 1 shows the difference between using Euclidean and weighted Euclidean distance with ω i = 1/σ i within ABC-rejection. Here σ i is calculated using MAD.
For both distances the acceptance threshold is tuned to accept half the simulations. In this example Euclidean distance mainly rejects simulations where s 1 is far from its observed value: it is dominated by this summary. Weighted Euclidean distance also rejects simulations where s 2 is far from its observed value and is less stringent about s 1 .
Which of these distances is preferable depends on the relationship between the summaries and the parameters. For example if s 1 were the only informative summary, then Euclidean distance would preferable. In practice, this relationship may not be known. Weighted Euclidean distance is then a sensible choice as both summary statistics contribute to the acceptance decision.
This heuristic argument supports the use of weighted Euclidean distance in ABC more generally. One particular case is when low dimensional informative summary statistics have Algorithm 3 ABC-PMC with adaptive h t and d(·, ·) Initialisation 1. Let t = 1 and h 1 = ∞.

Main loop
2. Repeat following steps until there are N acceptances.
Denote the accepted parameters as θ t 1 , . . . , θ t N and the corresponding distances as 6. Increment t to t + 1. The observed summary statistics are taken to be (0, 0) (black cross). Acceptance regions are shown for two distance functions, Euclidean (red dashed circle) and weighted Euclidean with MAD reciprocals as weights (blue solid ellipse). These show the sets within which summaries are accepted. The acceptance thresholds have been tuned so that each region contains half the points. been selected, for example by the methods reviewed in Blum et al. (2013). In this situation all summaries are known to be informative and should contribute to the acceptance decision.

End of loop
Note that in Figure 1 the observed summaries s obs lie close to the centre of the set of simulations. When some observed summaries are hard to match by model simulations this is not the case. ABC distances could now be dominated by the summaries which are hardest to match. How to weight summaries in this situation is discussed in Section 6.

Methods: Iterative ABC with an adaptive distance
The previous section discussed normalising ABC summary statistics using estimates of their scale under the prior predictive distribution. This prevents any summary statistic dominating the acceptance decision in ABC-rejection or the first iteration of Algorithm 3, where the simulations are generated from the prior predictive. However in later iterations of Algorithm 3 the simulations may be generated from a very different distribution so that this scaling is no longer appropriate. This section presents two versions of ABC-PMC which avoid this problem by updating the distance function at each iteration. Normalisation is now based on the distribution of summary statistics generated in the previous (Algorithm 4) or current (Algorithm 5) iteration. The proposed algorithms are presented in Sections 4.1 and 4.2.
An approach along these lines has the danger that the summary statistic acceptance regions at each iteration no longer form a nested sequence of subsets converging on the point s = s obs . To avoid this, the proposed algorithms only accept a simulated dataset at iteration t if it also meets the acceptance criteria of every previous iteration. This can be viewed as sometimes modifying the tth distance function to take into account information from previous iterations. Section 4.3 discusses convergence in more depth.

First proposed algorithm
Algorithm 4 is a straightforward modification of Algorithm 3 which updates its distance function at each iteration using scales derived from the previous iteration's simulations. The first iteration accepts everything so no distance function is required. This acts as an initial tuning step. Note that scales are based on both accepted and rejected simulations from the previous iteration. This is because using just the accepted simulations would mean the scales are sometimes mainly determined by the previous acceptance rule, restricting the scope for adaptation.
Storing all simulated s * vectors to calculate scale estimates in step 3 of Algorithm 4 can be impractical. In practice storage is stopped after the first few thousand simulations, and scale estimation is done using this subset. Other tuning details of Algorithm 4 -the choice of perturbation kernel K t and the rule to terminate the algorithm -are implemented as described earlier for ABC-PMC.

Main loop
2. Repeat following steps until there are N acceptances.

Let w
6. Let h t+1 be the α quantile of the d t+1 i values.

Second proposed algorithm
Algorithm 4 normalises simulations in iteration t based on scales derived in the preceding iteration. This could be inappropriate if two consecutive iterations sometimes generate simulations from markedly different distributions. Algorithm 5 addresses this problem.
A naive approach would be to start iteration t by tuning d t (·, ·) using an additional set of simulations based on parameters drawn from the current importance distribution. However this imposes an additional cost. Instead the algorithm makes a single large set of simulations.
These are first used to construct the tth distance function. Then the best N simulations are accepted and used to construct the next importance distribution.
A complication is deciding how many simulations to make for this large set. There must be enough that N of them are accepted. However the distance function defining the acceptance rule is not known until after the simulations are performed. The solution implemented is to continue simulating until M = N/α simulations pass the acceptance rule of the previous iteration. Let A be the set of these simulations and B be the others.
Next the new distance function is constructed (based on A ∪ B) and the N with lowest distances (from A) are accepted. The tuning parameter α has a similar interpretation to the corresponding parameter in Algorithms 3 and 4: the acceptance threshold in iteration t is the α quantile of the realised distances from simulations in A.
Using this approach means that, as well as adapting the distance function, another difference with Algorithms 3 and 4 is that selection of h t is delayed from the end of iteration t − 1 to part-way through iteration t (and therefore h 1 does not need to be specified as a tuning choice.) If desired, this novelty can be used without adapting the distance function.
Such a variant of Algorithm 3 was tried on the examples of this paper, but the results are omitted as performance is very similar to Algorithm 3.
Given the same importance density and acceptance rule, an iteration of Algorithm 5 requires the same expected number of simulations as Algorithms 3 and 4. In this sense their costs are the same. In practice, the algorithms select their importance density and acceptance rules differently so this comparison of their computational costs is limited. Section 5 contains empirical comparisons in terms of the mean squared error for a given number of simulations.

Main loop
2. Repeat following steps until there are M = N/α acceptances.

Calculate
6. Let h t be the N th smallest d * i value.
7. Let (θ t i ) 1≤i≤N be the θ * i vectors with the smallest d * i values (breaking ties randomly).

Let w
9. Increment t to t + 1.

End of loop
The comments at the end of Section 4.1 on tuning details and storing s * vectors also apply to Algorithm 5.

Convergence
This section shows that conditions for the convergence of Algorithms 4 and 5 in practice are essentially those described in Section 2.3 for standard ABC-PMC plus one extra requirement: In more detail, conditions ensuring convergence of Algorithms 4 and 5 can be taken from Theorem 1 in Appendix A. These are the same as those given for other ABC-PMC algorithms in Section 2.3 with the exception that the acceptance region A t is now defined as  (3) this can easily be seen to correspond to e t having an upper bound. This is not guaranteed by Algorithms 4 and 5, but it can be imposed, for example by updating ω t i to ω t i + δ max i ω t i after step 4 for some small δ > 0. However this was not found to be necessary in any of the examples of this paper.

Examples
This section presents three examples comparing the proposed and existing ABC-PMC algorithms: a simple illustrative normal model, the g-and-k distribution and the Lotka-Volterra model.

Normal distribution
Suppose there is a single parameter θ with prior distribution N (0, 100 2 ). Let s 1 ∼ N (θ, 0.1 2 ) and s 2 ∼ N (0, 1 2 ) independently. These are respectively informative and uninformative summary statistics. Let s obs,1 = s obs,2 = 0. Under the prior predictive distribution the MAD for s 1 is in the order of 100 while that for s 2 is in the order of 1. Therefore the first acceptance region in Figure 2 is a wide ellipse.
Under Algorithm 2 the subsequent acceptance regions are smaller ellipses with the same shape and centre. The acceptance regions for Algorithms 4 and 5 are similar for the first few iterations. After this, enough has been learnt about θ that the simulated summary statistics have a different distribution, with a reduced MAD for s 1 . Hence s 1 is given a larger weight, while the MAD and weight of s 2 remain roughly unchanged. Thus the acceptance regions change shape to become narrower ellipses, which results in a more accurate estimation of θ, as shown by the comparison of mean squared errors (MSEs) in Figure 3. Note that Algorithm 5 adapts its weights more quickly than Algorithm 4 and hence achieves a smaller MSE.

g-and-k distribution
The g-and-k distribution is a popular test of ABC methods. It is defined by its quantile function: where z(x) is the quantile function of the standard normal distribution. Following the literature (Rayner and MacGillivray, 2002), c = 0.8 is used throughout. This leaves (A, B, g, k) as unknown parameters.
The g-and-k distribution does not have a closed form density function making likelihoodbased inference difficult. However simulation is straightforward: sample x ∼ Unif(0, 1) and substitute into (4). The following example is taken from Drovandi and Pettitt (2011b).
Suppose a dataset is 10,000 independent identically distributed draws from the g-and-k distribution and the summary statistics are a subset of the order statistics: those with indices (1250, 2500, . . . , 8750). (As in Fearnhead and Prangle, 2012, a fast method is used to simulate these order statistics without sampling an entire dataset.) The parameters are taken to have independent Unif(0, 10) priors.
To use as observations, 100 datasets are simulated from the prior predictive distribution.
Each is analysed using Algorithms 3, 4 and 5. All analyses uses a total of 10 6 simulations and tuning parameters N = 1000 and α = 1/2. Table 1 shows root mean squared errors for the output of the algorithms, averaged over all the observed datasets. These show that the adaptive algorithms, 4 and 5, are more accurate overall for every parameter, and perform very similarly to each other.  Table 1: Root mean squared errors of each parameter in the g-and-k example, averaged over analyses of 100 simulated datasets.
More detail is now given for a particular observed dataset, simulated under parameter values (3, 1, 1.5, 0.5). Figure 4 shows the estimated MSE of each parameter for each iteration of the three algorithms. The adaptive algorithms, 4 and 5, performs better throughout for the g and k parameters. For this dataset all the algorithms perform similarly for the location and scale parameters A and B, which have smaller MSE values. Table 2 demonstrates that the main difference in the final estimated posteriors is that Algorithm 3 has higher variances for the g and k parameters.  Figure 5 shows some of the distance function weights produced by the algorithms. Algorithm 3 places low weights on the most extreme order statistics, as they are highly variable in the prior predictive distribution. This is because the prior places significant weight upon parameter values producing very heavy tails. However by the last iteration of Algorithms 4 and 5 such parameter values have been ruled out. The algorithm therefore assigns larger weights which provide access to the informational content of these statistics.

Lotka-Volterra model
The Lotka-Volterra model describes two interacting populations. In its original ecological setting the populations represent predators and prey. However it is also a simple example of biochemical reaction dynamics of the kind studied in systems biology. This section concentrates on a stochastic Markov jump process version of this model with state (X 1 , X 2 ) ∈ Z 2 representing prey and predator population sizes. Three transitions are possible: These have hazard rates θ 1 X 1 , θ 2 X 1 X 2 and θ 3 X 2 respectively. Simulation is straightforward by the Gillespie method. Following either a transition at time t, or initiation at t = 0, the Algorithm 3 Algorithm 4 (last iteration) Algorithm 5 (last iteration) Figure 5: Summary statistic weights used in Algorithms 3, 4 and 5 for the g-and-k example, rescaled to sum to 1. time to the next transition is exponentially distributed with rate equal to the sum of the hazard rates at time t. The type of the next transition has a multinomial distribution with probabilities proportional to the hazard rates. For more background see for example Owen et al. (2015), from which the following specific inference problem is taken.
The initial conditions are taken to be X 1 = 50, X 2 = 100. A dataset is formed of observations at times 2, 4, 6, . . . , 32. Both X 1 and X 2 are observed plus independent N (0, σ 2 ) errors, where σ is fixed at exp(2.3). The unknown parameters are taken to be log θ 1 , log θ 2 and log θ 3 . These are given independent Unif(−6, 2) priors. The vector of all 32 noisy observations is used as the ABC summary statistics.
A single simulated dataset is analysed (shown in Figure 8.) This is generated from the model with θ 1 = 1, θ 2 = 0.005, θ 3 = 0.6. ABC analysis is performed using Algorithms 3, 4 and 5. A total of 50, 000 simulations are used in each algorithm. The tuning parameters are N = 200 and α = 1/2. Any Lotka-Volterra simulation reaching 100, 000 transitions is terminated and automatically rejected. This avoids extremely long simulations, such as exponential prey growth if predators die out. These incomplete simulations are excluded from the MAD calculations, but this should have little effect as they are rare. Figure 6 shows the MSEs resulting from the analyses. The adaptive algorithms, 4 and 5, have similar outputs. Both produce smaller errors than Algorithm 3 for all parameters after roughly 10,000 simulations. Table 3 demonstrates that the main difference in the final estimated posteriors is that Algorithm 3 has higher variances. Figure 7 shows the weights used throughout Algorithm 3 and those used in the final iteration of the others. Again the adaptive algorithms are similar to each other but different to Algorithm 3. (2013) propose ABC-SMC algorithms which update the population of (θ, s) pairs between iterations in different ways to ABC-PMC. In all of these it seems possible to update distance functions using the strategies of Algorithms 4 and 5. However some of these variations would require additional convergence results to those given in Appendix A.
Several aspects of Algorithms 4 and 5 could be modified. One natural alternative is to use Mahalanobis-style distance functions d t (x, y) = (x − y) T W t (x − y) 1/2 where W t is an estimate of the precision matrix. Scenarios exist in which this performs much better than weighted Euclidean distance, (3) (Sisson, personal communication). However exploratory work found it gave similar or worse performance for the examples in this paper. Distance (3) is preferred here for this reason, and also because its weights are easier to interpret and there are more potential numerical difficulties in estimating a precision matrix. Nonetheless, for other problems it may be worth considering both alternatives.
Another reason it may be desirable to modify the distance function (3)

A Convergence of ABC-PMC algorithms
Algorithm 6 is an ABC importance sampling algorithm. This appendix considers a sequence of these algorithms. Denote the acceptance threshold and distance function in the tth element of this sequence as h t and d t (·, ·). The ABC-PMC algorithms in this paper can be viewed as sequences of this form with specific choices of how h t and d t are selected. Note ABC-rejection is a special case of Algorithm 6 with q(θ) = π(θ), so this framework can also investigate its convergence as h → 0.
5. Calculate w * i = π(θ * i )/q(θ * i ) (where π(θ) is the prior density) The output of importance sampling is a weighted sample (θ i , w i ) 1≤i≤P for some value of P .

A Monte Carlo estimate of E[h(θ)|s obs ] for an arbitrary function h(·) is then
. For large P this asymptotically equals (as shown in Prangle, 2011 for example) the expectation under the following density: π ABC,t (θ|s obs ) ∝ π(s|θ)π(θ)1[d t (s, s obs ) ≤ h t ]ds, known as the ABC posterior.
The conditions are: C1. θ ∈ R n , s ∈ R m for some m, n and these random variables have density π(θ, s) with respect to Lebesgue measure.
C2. The sets A t = {s|d t (s, s obs ) ≤ h t } are Lebesgue measurable.
The definition of bounded eccentricity is that for any A t , there exists a set B t = {s | ||s − s obs || 2 ≤ r t } such that A t ⊆ B t and |A t | ≥ c|B t |, where ||.|| denotes the Euclidean norm and c > 0 is a constant.
The fourth equality follows from the Lebesgue differentiation theorem, which requires conditions C4 and C5. For more details see Stein and Shakarchi (2009) for example.