Selecting Summary Statistics in Approximate Bayesian Computation for Calibrating Stochastic Models

Approximate Bayesian computation (ABC) is an approach for using measurement data to calibrate stochastic computer models, which are common in biology applications. ABC is becoming the “go-to” option when the data and/or parameter dimension is large because it relies on user-chosen summary statistics rather than the full data and is therefore computationally feasible. One technical challenge with ABC is that the quality of the approximation to the posterior distribution of model parameters depends on the user-chosen summary statistics. In this paper, the user requirement to choose effective summary statistics in order to accurately estimate the posterior distribution of model parameters is investigated and illustrated by example, using a model and corresponding real data of mitochondrial DNA population dynamics. We show that for some choices of summary statistics, the posterior distribution of model parameters is closely approximated and for other choices of summary statistics, the posterior distribution is not closely approximated. A strategy to choose effective summary statistics is suggested in cases where the stochastic computer model can be run at many trial parameter settings, as in the example.


Introduction
To advance knowledge of biological systems, bioinformatics includes a wide range of real and modeled data. For a model with parameters and data , a key quantity in Bayesian inference is the posterior distribution of model parameters given by Bayes rule as post ( | ) = ( | ) prior ( )/ ( ), where prior ( ) is the probability distribution for prior to observing data , ( | ) is the likelihood, and ( ) = ∫ ( | ) prior ( ) is the marginal probability of the data, used to normalize the posterior probability post ( | ) to integrate to 1 [1]. The likelihood ( | ) can be regarded as the "data model" for a given value of . Alternatively, when the data is considered fixed, ( | ) is regarded as a function of , and non-Bayesian methods such as maximum likelihood find the value of that maximizes ( | ) [1]. Regarding notation, note, for example, that ( | ) is not the same as ( ), but to keep the notation simple, we assume the distinction is clear from context. In many applications, the data model ( | ) is computationally intractable but instead is implemented in a stochastic model (SM), so many realizations from ( | ) are available by running the model many times at each of many trial values of . In a bioinformatics example, [2] considered the classic problem of inferring the time to the most recent common ancestor of a random sample of DNA sequences. The full likelihood of the data involves the branching order and branch lengths, which is known to be computationally intractable because the number of possible branching orders of a sample of DNA sequences grows approximately as !. Therefore, [2] greatly simplified the analysis by replacing with the number of segregating sites (a segregating site is a site that exhibits variation in the DNA character across the sample) in the sample of sequences. The key simplification exploited in [2] is that the distribution of does not depend on the branching order or individual branch lengths, but only on the total length of the phylogenetic tree, which is the sum of all branch lengths. Of course is a summary statistic that has long been of interest in population genetics. But how effective is for estimating the posterior distribution of the time to the most recent common ancestor of the sample? The main point of this paper is to explore the impact of the choice of summary statistic(s) on the quality of the estimated posterior distribution post ( | ) when using approximate Bayesian computation (ABC), which is defined in Section 2. We investigate the user requirement to choose good summary statistics to effectively estimate the posterior distribution of model parameters by example, using a model and corresponding real data of mitochondrial DNA population dynamics.
In our context, the SM provides the data generation mechanism, so there is no explicit functional form for ( | ). Likelihood-free inference dates to at least [3], but the name approximate Bayesian computation (ABC) originated in [4] while referring to an approach to likelihood-free inference methods. Effective values of input parameters for both deterministic and stochastic computer models are typically chosen by some type of comparison to measured data. Parameter estimation for deterministic models is frequently done by running the model at multiple values of the input parameters, constructing an approximator to the model, and using the approximator inside a numerically intense loop that examines many trail values for the input parameters [5][6][7][8].
The numerically intense loop is often Markov Chain Monte Carlo (MCMC), which is a method to simulate observations from the posterior distribution of model parameters [1,9]. Parameter estimation for stochastic models for which an explicit likelihood is not available has been attempted at least once using MCMC with a model approximator [10], but is far more commonly done using ABC. For examples of ABC applied to calibrate SMs, see [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27] and the many references cited by [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27]. The example in Section 4 is based on the example in [10], but we use ABC instead of a model approximator inside the MCMC loop. The paper is organized as follows. The next section gives background on ABC. Section 3 describes in more detail the challenge in ABC of choosing effective summary statistics. Section 4 is an example, using a model and corresponding lab data of mitochondrial DNA population dynamics. The example shows that for some choices of summary statistics, the posterior distribution of model parameters is closely approximated and for other choices of summary statistics, the posterior distribution is not closely approximated. A strategy to choose effective summary statistics is suggested in cases where the stochastic computer model can be run at many trial parameter settings, as in the example.

ABC Background
Assume that a SM has input parameters and outputs data = ( | ) ( for "model") and that there is corresponding observed real data obs . In this section and the remaining sections we either use the conventional notation for data or the informal used in the Introduction, depending on context. We replace the notation for the data generation mechanism ( | ) with ( | ) to convey the fact that there is no explicit functional form for the likelihood, but only a "black box" SM that outputs data for given values of inputs . That is, traditionally, the notation ( | ) conveys a specific functional form, such as the familiar Gaussian distribution, while the notation ( | ) conveys the black box function encoded by the SM.
The ABC approach uses obs to "calibrate" the SM by choosing effective values for the parameters. If the SM can be run for many trial values of , MCMC can be used, where candidate values are accepted in the chain if the distance ( obs , ( )) between obs and ( ) is reasonably small. Alternatively, for most applications, and for our focus here, it is necessary to reduce the dimension of obs to a relatively small set of summary statistics and instead accept trial values of inside the MCMC loop if ( ( obs ), ( ( ))) < . For example, obs can be a time series of changes in the proportion of mutant species at various time lags, while ( obs ) could be a scalar count of how often successive differences in obs are larger than a multiple of the measurement error. Most applications of ABC have relied on summary statistics that are chosen on the basis of expert opinion or established practice (such as the number of segregating sites in the example in Section 1) rather than for their role in providing a high quality approximation to the posterior distribution post ( | obs ) [4,12,14,18,20].
The goal in nearly all Bayesian inference is to approximate the posterior distribution post ( | obs ) of given the data obs . The ABC approach to do so is to estimate post ( | obs ) = ( obs | ) prior ( )/ ( ) using the so-called partial posterior distribution post ( | obs ) = ( obs | ) prior ( )/ ( obs ). That is, ABC conditions on the value of the observed summary statistic obs rather than on the actual data obs . Because trial values of are accepted if ( ( obs ), ( ( ))) < , an approximation error to the partial posterior distribution arises that several ABC options attempt to mitigate. Such options involve weighting the accepted values by the actual distance ( ( obs ), ( ( ))) [13].
ABC was developed to calibrate a model using summary statistics, but ABC has the potential to choose between candidate models, say models 1 and 2 . When analytical likelihoods are available, one typically evaluates ( | obs ) using the likelihoods 1 ( obs ) and 2 ( obs ) and the prior probabilities of the models 1 and 2 . Bayesian model selection is a large topic [1,4,12,21], and it is currently used in calibrating deterministic models using field data [5][6][7][8]. Using Bayes rule, ( 1 | obs ) = ( ( obs | 1 ) ( 1 ))/ ( obs ) and ( obs | 1 ) are the marginal likelihood for model 1 , defined as ( obs | 1 ) = ∫ ( obs | , 1 ) prior ( ) . In model selection to decide between 1 and 2 , the prior probabilities ( 1 ) and ( 2 ) must also be specified so that ( 1 | obs ) can be compared to ( 2 | obs ) [1,21]. The analogous concept in the case of stochastic models is still the posterior distribution ( 1 | obs ) or ( 2 | obs ), but summary statistics are used to approximate ( 1 | obs ) and ( 2 | obs ). Applications papers have extended ABC to include an option to choose among candidate models that includes different models with possibly different numbers of parameters in a solution space that is explored by simulation [4,12,21]. However, the approximation quality of ABC with or without model selection is a subject of ongoing research [18][19][20][21].
ABC is compelling, when the data and/or parameter dimension is large, and is becoming the "go-to" option for many application areas, particularly whenever the likelihood involves summing probabilities over many unobserved states such as genealogies in biology [2], applications in epidemiology [22], astronomy, and cosmology [23]. However, challenges remain in ensuring that ABC leads to reasonable approximation to the full posterior distribution of SM parameters .

Choosing Summary Statistics for ABC
To obtain samples from the approximate posterior distribution for candidate models and model parameters, ABC invokes MCMC [1,9,17] with summary statistics such as moments of the observed data to those in the simulated data to decide whether to accept each candidate model and set of parameter values inside the MCMC loop. Note that in cases where the likelihood (the probability density function, pdf, viewed as a function of the parameters values) is known except for a normalizing constant, MCMC has been the main option for numerical Bayesian inference since the 1990s [1]. The main challenges with MCMC using a known likelihood function are that efficient sampling methods are sometimes needed to choose candidate parameter values, and in all cases the burden is on the user to check whether the MCMC is actually converging to the correct full posterior distribution. Because ABC simply accepts trial values of the parameters provided ( ( obs ), ( ( ))) < , a common version of ABC uses a very specialized form of MCMC that is called the "rejection" method. Other ABC versions are under investigation [17].
ABC typically consists of three steps: (1) sample from the prior distribution of parameter values prior ( ); (2) simulate data for each simulated value of ; (3) accept a fraction of the samples prior values in (1) by checking whether the summary statistics computed from the data in (2) satisfy ( ( obs ), ( ( ))) < . If desired, adjust the accepted values on the basis of the actual ( ( obs ), ( ( ))) value. Despite the simplicity of ABC, open questions remain regarding to what extent ABC achieves its goal of approximating the full posterior probability. ABC has been shown to work well in some cases [19], but it has also proven not to work well in other cases [21]. There are open questions for ABC regarding the choice of summary statistics [18][19][20][21], whether model selection via ABC is viable (meaning that the user can know whether the estimation quality of the full posterior distribution is adequate to successfully compare candidate models [21]), and regarding error bounds for the estimated posterior distribution. Approximate error bounds are possible by simulation using auxiliary simulations such as in Section 4 and [20].
ABC requires the user to make three choices: the summary statistics, the threshold , and the distance measure . This paper's focus is on the user's choice of summary statistics. Recall from the Introduction that in many applications of ABC, the user chooses summary statistics such as the number of segregating sites in a random sample of DNA sequences simply because such a statistic is heavily used in the application area without considering whether the chosen summary statistic renders the partial posterior to be a good approximation to the full posterior.
A few recent papers have considered summary statistic selection from the viewpoint of aiming for better inference or better approximation to the full posterior probability [18][19][20][21]. ABC makes two approximation steps. First, the full posterior probability is estimated by the partial posterior probability. Second, the partial posterior probability is itself estimated. Recall from Section 2 that some versions of ABC include options to improve the quality of the partial posterior approximation, such as weighting the accepted parameter values in the MCMC [12,13].
To improve the choice of which partial posterior approximation to use, the notion of approximate statistical sufficiency can be invoked to try to choose more effective summary statistics [18]. Suppose there is a list of candidate summary statistics { 1 , 2 , . . . , }. A user then wonders whether adding candidate statistic +1 would improve the approximation of the full posterior post ( | obs ). In [18], ABC must be performed on { 1 , 2 , . . . , } and then on [18] is therefore the same framework as for variable selection in fitting any response, so the full arsenal of possibilities in modern data mining is possible. To date, only relatively simple variable selection that is sensitive to the order with which candidate summary statistics are presented has been assessed, only in the few examples in [18]. In [18], the procedure to decide whether ( ) is statistically significantly different from 1 involves an auxiliary simulation and calculating the maximum and minimum values of ( ) on a user-chosen grid of values. An even more computationally demanding option to decide whether ( ) is statistically significantly different from 1 could invoke some type of density estimation. The simulation approach in [18] makes no judgment whether including candidate summary statistic +1 leads to a better approximation. Instead, the simulation approach aims to infer whether including +1 has a significant impact on the estimated partial posterior distribution.
Alternatively, to choose effective summary statistics [19] aims to make the selection of summary statistics more "automatic" and less user dependent by requiring the user to run pilot simulations of the model. However, the examples in [19] illustrate the potential for poor ABC performance because the three ABC choices that lead to best performance were shown to vary across examples. The suggested strategy in [19] requires pilot runs of the model in order to improve the user choices, particularly of the summary statistics. The pilot runs require a set of input parameter values to generate data that is similar to the real data obs . The goal is then for many realizations of the data from parameter values to help the user choose summary statistics. Specifically, for ABC to lead to good estimation of , [19] shows that the estimated posterior means of the parameters based on the pilot runs are effective summary statistics. There are several options described in [19] to estimate the posterior means of model parameters. The simplest one to describe is to fit in turn each individual parameter in using some type of data transformation, such as the actual data and moments of the data. The fitted coefficients from the fit are obtained from the pilot simulation runs and can then be used in subsequent runs to estimate the parameter means. Reference [20] also aimed for better estimation of and also used auxiliary simulations, but, unlike [19], summary statistics were pursued that minimized the entropy (uncertainty) of the estimated full posterior probability. Of course the user might have other criteria, such as for ABC to lead to a good estimation of the full posterior for as in our example in Section 4 which also relies on auxiliary simulations. To summarize this section, the choice of summary statistics is very important if the partial posterior post ( | obs ) obtained using ABC is to provide an adequate approximation to the full posterior post ( | obs ). A few publications have begun to address the issue of summary statistic selection [18][19][20][21]. And, a debate has begun to what extent the partial posterior post ( | obs ) obtained using ABC is adequate for model selection [21]. Again, summary statistic selection is an important aspect of ABC's ability to provide adequate model selection capability.

Example: Mitochondrial DNA Population Dynamics Model
This section presents an example for which the stochastic computer model is relatively simple so we can generate many observations from the model.

Example.
Neuronal loss in the substantia nigra region of the human brain is associated with Parkinson's disease [10]. Deletion mutations in the mitochondrial DNA (mtDNA) in the substantia nigra region are observed to accumulate with age. A deletion mutation converts a healthy copy of mtDNA to the mutant (unhealthy) variant. The number of mutant copies in cases with Parkinson's disease tends to be higher than in controls without Parkinson's disease. The role that mtDNA deletions play in neuronal loss is not yet fully understood, so better understanding of how mtDNA deletions accumulate is an area of active research. Reference [10] used a simple stochastic model that allowed for any of five reactions, occurring at rates to be estimated. The five reactions are mutation, synthesis, degradation, mutant synthesis, and mutant degradation. Let 1 denote the number of healthy (1) mtDNA copies and 2 denote the number of unhealthy (2) (mutant) mtDNA copies. Following [10] we assume the following five reactions are possible, with the reaction rates as specified. The lower case 1 and 2 refer to an individual cell of type 1 or 2. So, for example, reaction 1 below depicts a single cell of type 1 mutating to type 2 at a rate 1 1 .
3 : 1 → 0 at rate 3 1 4 : 2 → 2 2 at rate 4 2 /( 1 + 2 ) = 1000 3 2 The time between reactions is assumed to have an exponential distribution. The sum of the five rates is the total reaction rate, which determines exponential parameter (the average time between reactions). Given that a reaction occurs at a specific time, the relative rates determine the probabilities with which the five reactions occur. To model the harmful effects of mutation from type 1 to type 2 cells, it is assumed that a cell dies if its proportion of mtDNA for some lethal threshold . This simple model can be simulated from exactly using Gillespie's discrete event simulation [28]. Reference [10] gives more information, including information about measurement error models. To focus on summary statistic selection, we simplify the measurement assumptions and measurement error model used in [10] and assume that measurements of { 1 , 2 } are available at a sequence of times { 1 , 2 , . . . , }. The real measurement data will be assumed to be of this form, although the measurement details and number of neurons sampled multiple times from each of 15 patients of varying ages make the measurement process used in [10] somewhat more complicated. In particular, the model in [10] did not include a between-patient factor, so we simplified the data by aggregating the data over patients and measurements of the same patient at the same age. Figure 1 plots the aggregated real data from Figure 1 of [10] and from one realization of simulated data assuming that cells are measured each day.
Note that rates 2 and 4 are assumed to be known multiples of rate 3 , so the inference goal is to estimate { 1 , 3 , }. A range of possible values for each of { 1 , 3 , } was based in [10] on previous investigations. The prior range for 1 was 10 −6 to 10 −3 per day, for 3 was 3 × 10 −5 to 10 −3 per day, and for was 0.5 to 1. All three of these parameters are of interest not just as model calibration parameters, but for their physical implications. For example, it is not yet known whether neurons can survive with very high levels of mtDNA deletions. As with any Bayesian analysis, an evaluation of the sensitivity to the prior distribution for the model parameters should be included, particularly for informative priors. In our example, we used informative priors, uniform over the accepted ranges. A separate simulation confirmed that the estimated posterior is sensitive to the assumed parameter range. An example comparison using two ranges for the uniform priors is given in Section 4.4.
The approach in [10] to estimate { 1 , 3 , } is based on approximating the computer model using a Gaussian Process, which is a common approach in calibrating deterministic computer models. Reference [10] mentions the possibility of estimating { 1 , 3 , } using ABC. Future work will compare such options. Here, we focus on better understanding of the effect of summary statistic selection on ABC performance.

ABC Approach.
Recall that we assume measurements of { 1 , 2 } are available at a sequence of times { 1 , 2 , . . . , }. The real data we use is in Table 1 of [10], which for simplicity we aggregate over subjects and measurements within subjects to the data shown in Figure 1 (see Section 4.4). Note that the real data is observed much less frequently than once per simulated time step which is one day in our simulation. For completeness, we first assume real data is available once per day and then assume real data is available much less frequently, such as in Figure 1(a).
In any implementation of ABC the user must specify the distance measure, the acceptance threshold , and the summary statistics. In addition, the user chooses the number of model runs at each value of the parameter vector and the number of values of the parameter vector = { 1 , 3 , } presented to the ABC algorithm. The statistical programming language R is among the good choices for ABC implementation; here we use the abc function in the abctools package for R [29]. The abc function also requires the user to decide whether to work with transformed parameter values and to select a method to improve estimation of the partial posterior by adjusting the accepted values according to the distance between the summary statistics and the observed summary statistics [4,13]. The default method is the "unadjusted" method which accepts all values corresponding to ( ( obs ), ( ( ))) < without any weighting. Results given in Section 4.4 are for the unadjusted option and for the option that adjusts accepted values.

Simulation Approach.
Our goal for this mDNA example is to illustrate an approach to making good choices for the summary statistics when the user wants the estimated partial posterior distribution for to be well calibrated. Well calibrated in this example context means, for instance, that the true is contained in approximately 95% of repeated constructions of 95% predictive intervals for . That is, the actual coverage is very close to the nominal coverage.

Simulation Procedure
Step A. Simulate data from the SM at many parameter values = { 1 , 3 , }. Specifically, (1) select each of { 1 , 3 , } from their respective uniform prior distributions (the ranges are given in Section 4.1) for sim = 1000 simulations, (2) for each selected value of { 1 , 3 , }, simulate up to 100 years of 1-day step sizes of the five reaction rates. If 2 /( 1 + 2 ) > at any step, terminate. Some variations of ABC will repeatedly simulate in step (2) for the chosen { 1 , 3 , } values in step (1).
Step B. Real data (or simulated, but with the simulated playing the role of real data): (1) Real data: Either use real measurement data obs or mimic one realization of real measurement data by repeating Step A once. Here we use simulated measurement data to mimic real data, so that we can know the true value of = { 1 , 3 , }. (2) Using the sim = 1000 simulations from Step A, accept the trial = { 1 , 3 , } values from 100 (10%) of the sim = 1000 simulations on the basis of ( ( obs ), ( ( ))) in each of the sim = 1000 simulations, resulting in an approximation of the partial posterior probability post ( | obs ) which serves to estimate the full posterior probability post ( | obs ). The accepted trial values can be used "as is" to approximate post ( | obs ) or adjusted to account for the actual distance ( ( obs ), ( ( ))), for example, as in [4,13].
Step C. (1) Use the 100 accepted trial values of in Step B to tally whether the true parameter values are contained within the estimated 95%, 90%, 80%, 60%, 50%, 40%, 20%, 10%, and 5% posterior intervals. This 3-step simulation procedure is repeated for rep = 1000 replications. Following [10], each simulation began with 6 BioMed Research International = 1000 cells. We depart slightly from [10] in that each simulation began with 2 = 600 mutant cells (rather than 0 mutant cells), so that each run of up to 100 years tended to conclude in a modest number of years from the starting point due to 2 /( 1 + 2 ) exceeding the lethal maximum , so that run times are shorter; this mimics starting with older subjects, nearly all of which do not live close to 100 years beyond their fictitious starting age defined by having 600 mutant cells at the start of the simulation. Such a choice will impact our inference results, analogous to choosing data ranges in calibration experiments. However, our topic is the choice of summary statistics rather than experimental design for choosing effective data ranges (the data ranges are the subjects' ages in our example).
Also following [10] we simulated the effects of measurement error, but for simplicity we assumed there was only one measurement method rather than two. To mimic measurement errors due to finite number of observations and the actual measurement process itself, we assume that only 300 of the 1000 cells were observed and that 1 /( 1 + 2 ) fraction was measured with a relative random error standard deviation of 0.20 on the log 10 scale. These two effects (observing 300 of 1000 and 0.1% relative error standard deviation on log 10 ( 1 /( 1 + 2 )) result in a root mean squared error of approximately 0.13 in the measured relative frequency 1 /( 1 + 2 ) on average across the range of 1 /( 1 + 2 ) values. In comparison, [10] assumes an absolute random error standard deviation of 0.25 on the log 2 scale. Because two measurement techniques are combined in [10], which complicated the analysis beyond our needs here, we do not attempt to exactly mimic their approach, but only to use reasonable measurement error assumptions for illustration.

The Summary
Statistics. Let 1 denote the measured value of 1 , and 2 denote the measured value of 2 . The first candidate set of summary statistics is the following three: the average rate of change of 1 /( 1 + 2 ), coefficients 1 and 2 in a linear model relating the change in 1 (the response) to predictors consisting of the current 1 , and the current ratio 1 /( 1 + 2 ). The second candidate set of summary statistics is the same as the first, but also includes the maximum of the observed ratio 1 /( 1 + 2 ) and the number of steps until cell death. The third candidate set of summary statistics is the same as the first, but also includes coefficients 1 and 2 in a linear model relating the change in 2 (the response) to predictors consisting of the current 2 , and the current ratio 1 /( 1 + 2 ). All three candidate sets of summary statistics were computed for sets of simulated data that was observed at each time step (day), and also much less frequently as in the real data. To mimic the real data, we sampled the simulated data at 13 random times over the duration of each simulation.
Concerning the choice of summary statistics, these three candidate sets are arbitrary but reasonable statistics that clearly relate to the SM and so are informative for the SM parameters. For example, the average rate of change of 1 /( 1 + 2 ) relates directly to parameter 3 .

Example Results.
Here we present results for three candidate sets of summary statistics. Our strategy involves two criteria. First, retain for consideration any set of summary statistics that leads to a well-calibrated estimate of post ( | obs ) on the basis of the 3-step simulation procedure. Here, the term "well calibrated" means that actual coverage is very close to the nominal coverage. Second, among all sets of such summary statistics, choose the set that has the smallest estimation error for . The second criterion is similar to that suggested in [19,20]. The strategy in [18] described in Section 3 to decide whether adding an additional statistic will impact the posterior could of course also be used to confirm that the three candidate sets of summary statistics do lead to meaningfully different estimates of the partial posterior distribution post ( | obs ). Figure 2 is a plot of the actual (estimated to within ± 0.03 on the basis of 1000 replications of the simulation approach) coverage versus the nominal coverage for 95%, 90%, 80%, 60%, 50%, 40%, 20%, 10%, and 5% posterior intervals for set one of the three sets of summary statistics, using the ridge-based adjustment of the accepted values in abc or not [4,13]. Ridge-based adjustment is a form of local ridge regression (a modification of ordinary regression to adjust for collinearity of the predictors) that uses the actual distance ( ( obs ), ( ( ))) rather than the simple rejection criterion. Figure 3 is the same as Figure 2 but for summary statistics set 2. Figure 4 is the same as Figure 2 but for summary statistics set 3. Notice from Figures 2-4 that the unadjusted values lead to better calibration than the adjusted values, with the actual probabilities being closer to the nominal probabilities. Apparently, although it is reasonable to adjust accepted parameter values by using the actual distance ( ( obs ), ( ( ))) [4,13], whether such adjustment improves the approximation to the posterior depends on the specifics of each data set, including the adequacy of the chosen summary statistics. It is for that reason that available software such as abc allows the user to compute both adjusted or unadjusted values.
To quantify the results shown in Figures 2-4 we compute the root mean squared error (RMSE) between the observed coverage probability and the nominal coverage probability for the nine posterior intervals (95%, 90%, 80%, 60%, 50%, 40%, 20%, 10%, and 5%) for each parameter estimate for each of the three sets of summary statistics, Informally, we can choose the candidate set of summary statistics that has the smallest RMSE 1 . More formally, to determine whether the smallest RMSE 1 among the three (or any number of) candidate sets of summary statistics is significantly smaller than the second smallest RMSE, we can repeat the entire simulation procedure approximately 100 times and rank the candidate sets of summary statistics on the basis of their RMSE 1 values across the 100 repetitions of the 3-step simulation. In this example, candidate set 3 has the smallest RMSEs among the three sets of candidate summary statistics. For any set of candidate summary statistics that are acceptable on the basis of RMSE 1 , the second version of the RMSE defined as RMSE 2 = √ ∑ rep =1 (̂− ) 2 / rep in estimating 1 , 3 ,  and should be evaluated. In RMSE 2 , is either 1 , 3 , or , and̂is the corresponding estimate. As the corresponding estimate, we use the mean of the corresponding estimated posterior. The two types of RMSEs for the three sets of candidate summary statistics are listed in Table 1.
There is no guarantee that the "best" set of candidate summary statistics will dominate the other choices of summary statistics. For example, summary statistic set 3 is our choice in this example, but it has higher RMSE 2 for than the other two choices. As a reviewer has pointed out, such an outcome requires a user choice, and we suggest "majority rule, " meaning that we choose the summary statistic set that has the smallest RMSE 1 and/or RMSE 2 (depending on user needs) for the most number of parameters. So, in this case we invoke "majority rule" and choose summary statistic set 3.
Any application of ABC that does not include a simulation evaluation such as this one or similar ones in [18][19][20][21] is incomplete. Somewhat unfortunately, this means that the choice of summary statistics is not truly "automatic, " because it relies on intensive simulations in addition to those in standard ABC. However, the choice of summary statistics can be regarded as "objective, " because a similar strategy is a necessary part of a complete ABC application.
For completeness here, we also use the real data from Table 1 of [10] in place of the simulated data described in the simulation procedure above. Recall from above that the best results (lowest RMSEs) were obtained using candidate  1  1  1  1  1 1  2 2  2  2  2  2  2  2 2   3 3  3  3  3  3  3  3 3 (1) c 1 (2) c 3 summary statistics choice 3 with no adjustment of the accepted values. However, the real data is observed less frequently than each time step (day), so next we use slightly different summary statistics than described above. Rather than lag-one (one-day) changes, we use the actual times between measurements so that approximate rates of changes can be computed. The resulting posterior for the data in Table 1 of [10] is given in Figure 5 for summary statistic choice 3. Additionally, a second set of simulations was done for simulated data observed only approximately 13 times over the simulation (as in the real data in Figure 1(a)). Again, summary statistic choice 3 had the lowest RMSE 1 and RMSE 2 values (uniformly lowest in this case, even for ).
Finally, any Bayesian analysis should address the issue of whether the posterior is sensitive to the prior. For example, in our ABC context, we first used the uniform priors for each parameter as described in Section 4.1, which are the same as those used in [10]. To evaluate sensitivity to the prior, we modified the parameter ranges for 1 from 10 −6 to 10 −3 per day to 10 −7 to 10 −2 per day, for c 3 from 3 × 10 −5 to 10 −3 per day to 3 × 10 −4 to 10 −2 per day, and for from 0.50 to 1 to 0.85 to 1. Using summary statistics set 3, the posterior means for { 1 , 3 , } are 0.0007, 0.0001, and 0.90, respectively, for the original prior ranges and are 0.00007, 0.001, and 0.87, respectively, for the modified prior ranges. These posterior means were each calculated twice using 10 3 simulations and are repeatable to within the number of digits listed. Therefore, the choice of prior does significantly impact the posterior in our example. Reference [10] discusses the physical consequences of various parameter values, particularly for . However, there are no widely accepted values for any of the three parameters, so we cannot use accepted parameter values as another check to compare ABC summary statistic choices. Instead, we assess the quality of the ABC-based approximation to the posterior using auxiliary simulation as illustrated in Figures 2-4 comparing predicted to actual coverage probabilities and using RMSEs as in Table 1.

Summary
ABC is becoming the "go-to" option when the data and/or parameter dimension is large because it relies on user-chosen summary statistics rather than the full data and is therefore computationally feasible. Although ABC is compelling, when the data and/or parameter dimension is large, and is beginning to be used in many application areas, as of 2013, there is no cohesive theory or a consistent strategy for ABC, yet there are many applications in bioinformatics, astronomy, epidemiology, and elsewhere for which a stochastic CM provides an alternative to the likelihood. In addition software to implement ABC is becoming widely available; see [30] for a partial list of currently available ABC software.
One technical challenge with ABC is that the quality of the approximation to the posterior distribution of model parameters depends on the user-chosen summary statistics. In this paper, the user requirement to choose effective summary statistics in order to accurately estimate the posterior distribution of model parameters is illustrated by example, using a model and corresponding lab data of mitochondrial DNA population dynamics. The example shows that for some choices of summary statistics, the posterior distribution of model parameters is closely approximated and for other choices of summary statistics, the posterior distribution is not closely approximated.
A strategy to choose effective summary statistics is suggested in cases where the stochastic computer model can be run at many trial parameter settings, as in the example. The strategy is to choose the best results from several candidate sets of summary statistics, such as shown in the Results in Figures 2-4. As in [19,20], auxiliary simulations that produce data having similar summary statistics as the observed data are needed. Then, the best results are defined on the basis of two criteria. First, those summary statistics that lead to the best-calibrated estimated posterior probabilities are identified. Second, among those summary statistics that perform well on the first criterion, those summary statistics that lead to the smallest estimation errors for the parameters are preferred. The disadvantage of this approach is that reliance on auxiliary simulations to choose summary statistics adds to the computational burden. However, the ABC algorithm is easily parallelized so modern desktop computers are fully adequate for many problems, such as our example. The user might consider using criteria other than those used in Figures  2-4 and in Table 1 to evaluate the posterior distribution. However, we regard those criteria as necessary for adequate approximation to the posterior in this context. Future work will consider the acceptance threshold and variations of ABC such as [17] that mimic standard MCMC sampling rather than using the rejection method with adjustments to the accepted trial values as in [4,13]. Also, because real data almost never obey all the assumptions of any model, even the most elaborate stochastic model, some allowance for model bias should be made as that done with deterministic models [6][7][8]. Finally, a comparison of this ABC approach with the stochastic model approximator approach in [10] would be valuable.