Improving the Eﬃciency of Fully Bayesian Optimal Design of Experiments using Randomised Quasi-Monte Carlo

Optimal experimental design is an important methodology for most eﬃciently allocating re-sources in an experiment to best achieve some goal. Bayesian experimental design considers the potential impact that various choices of the controllable variables has on the posterior distribution of the unknowns. Optimal Bayesian design involves maximising an expected utility function, which is an analytically intractable integral over the prior predictive distribution. These integrals are typically estimated via standard Monte Carlo methods. In this paper, we demonstrate that the use of randomised quasi-Monte Carlo can bring signiﬁcant reductions to the variance of the estimated expected utility. This variance reduction can then lead to a more eﬃcient optimisation algorithm for maximising the expected utility.


Background
A data collection process often involves variables that can be controlled. The field of optimal experimental design is devoted to developing methods to optimally choose values for these controllable variables so that the resulting data is likely to have a sufficient amount of information to address the aims of the analysis.
Fully Bayesian experimental design (see Bernardo and Smith (2000)) proceeds by specifying a utility function U (d, y, θ), which defines the worth of conducting an experiment that generated the observed data y, assuming that the dataset was drawn from a model with parameter value θ and the design d was applied. Of course, the experimental design d must be specified before data collection. A common approach in Bayesian design is to consider the expected utility over the prior predictive distribution of the data where p(y|θ, d) is the likelihood function and p(θ) is the prior distribution, which incorporates information from previous similar experiments and/or expert opinion. The optimal Bayesian design is the one that maximises the expected utility where D is the set of allowable designs. An alternative approach is when the utility is some functional of the posterior, in which θ is integrated out producing the utility U (d, y). The expected utility is then where p(y|d) is the prior predictive distribution of the data y. An example is where U (d, y, θ) = log p(θ|y, d) − log p(θ), which is the Shannon information gain (SIG). In this case, U (d, y) is the Kullback-Leibler divergence (KLD) between the prior and the posterior, U (d, y) = KLD(p(θ)||p(θ|y, d)). It should be noted that under the SIG utility, the expected utility represents the mutual information (MI) between the parameter θ and the potential future dataset y when the design d is applied. That is, the expected reduction in entropy about the parameter θ brought about by the data y. In general, the KLD is difficult to estimate precisely in a computationally efficient manner. A pragmatic alternative Bayesian parameter estimation utility is U (d, y) = − log(|Σ y|d |) where | · | denotes the determinant of a matrix and Σ y|d is the posterior covariance matrix obtained when observing data y under design d.  show that this utility is one component of the KLD when the prior and posterior are multivariate normal. In general, the posterior covariance matrix is easier to estimate than the KLD. It is important to note that other utility functions can be defined to reflect different goals such as prediction or model discrimination (see Section 3.4 for an example of the latter).
Whatever formulation is adopted for U (d) (integration over (y, θ) or just y), the integral is generally analytically intractable. Our focus is on the formulation in (2) however the ideas in this paper can be adapted to the formulation in (1). A common approach in the Bayesian design literature is to estimate the integral via Monte Carlo (MC) integration parameter value and the design d. Each U (d, y j ) calculation involves the computation or the approximation of the posterior distribution conditional on y j . Thus approximating a single expected utility calculation can be expensive, which subsequently needs to be maximised via some optimisation algorithm. The efficiency of the optimisation process will depend on the level of noise associated with expected utility estimate. The computational challenges associated with the Bayesian optimal design problem are well summarised in Ryan et al. (2016).

Contribution and Outline
In this paper we propose the use of randomised quasi-Monte Carlo (RQMC) methods to improve the precision with which the expected utility is estimated for the same value of J. Any optimisation algorithm can then benefit from the reduction in the noise of expected utility approximations. It is well known that MC methods estimate expectations that converge at a rate of O(1/ √ J) whereas RQMC can generally achieve a faster convergence rate (Owen, 1997).
The idea of QMC has been considered in pseudo-Bayesian design (see, for example, Bliemer et al. (2008)). Here the utility is some scalar function of the Fisher information matrix. In this case the utility is independent of the future data y and the expected utility is given by This integral can be estimated by MC or RQMC by taking J samples of θ from the prior in the relevant way. Our paper focuses on RQMC in the context of fully Bayesian design, which is much more difficult as it involves also the integral over the data space. We suspect that variance reduction is more critical in fully Bayesian design as it is much more computationally intensive than pseudo-Bayesian design. This is due to the fact that fully Bayesian design typically requires the repeated approximation of posterior quantities whereas pseudo-Bayesian design relies on an expression for the observed or Fisher information matrix only.
RQMC has recently received increasing attention in the statistics community. Gerber and Chopin (2015) show the efficiency of RQMC in sequential Monte Carlo, Tran et al. (2015) document a faster convergence in their Variational Bayes updating procedure when the noisy gradient is computed using RQMC. In Bayesian decision problems, one selects the best decision by minimising the Bayes risk which is an integral of the loss function over the posterior (Robert, 2007, ch.2) and it is important to estimate such integrals accurately. The RQMC method that we describe in this article can be applied to any Bayesian decision problem.
Some experimental design problems involve sequential decision making, where the loss function takes into account both the decision loss and the cost of conducting the experiment, and where the data are collected sequentially (Berger, 1985, ch.7). The optimal stopping time and the best decision are selected to optimise an expected loss function (Berger, 1985, equation 7.2), which can be estimated accurately with RQMC. To reduce the complexity of the sequential design problem, a myopic approach may be adopted that only makes an optimal decision for the next observation only (e.g. ). In this setting, the current 'prior' distribution may not be available in closed form and may be represented by a Monte Carlo sample. The application of RQMC is less clear in this setting. However, it would be possible to use importance sampling through a parametric approximation of the posterior, and the same RQMC idea in this paper can be used to estimate the expected utility of the design for the next observation conditional on the data obtained so far. In these respects, we conjecture that RQMC can have a beneficial effect on sequential decision making problems, but a thorough investigation of this is beyond the scope of the current paper. In this article, we focus only on using RQMC for Bayesian experimental design problems where we know in advance how many observations we wish to design for and that this data will be collected in a single batch after the optimal design has been determined. This is often referred to in the literature as a static design.
In Section 2 we describe RQMC and how we implement it in the context of Bayesian design. Four examples that illustrate the benefits of RQMC are provided in Section 3. We conclude in Section 4.

Quasi-Monte Carlo
This section presents a short tutorial to QMC and in particular RQMC. A thorough treatment of the subject can be found in the monographs by Niederreiter (1992) and Dick and Pillichshammer (2010). Glasserman (2004, ch.5) provides a more accessible introduction to the subject.
Consider the problem of estimating the following integral over the s-dimensional unit hypercube for some function f . Almost all Monte Carlo problems can be written in this form. It is convenient in QMC to take intervals to be closed on the left and open on the right.
Plain MC methods estimate this integral by an average of f (u j ) over J iid samples u j ∼ U(0, 1) s . It is well known that the convergence rate of such MC estimators is of order J −1/2 . QMC methods are a deterministic alternative that chooses the points u j more evenly in [0, 1) s than random in the sense that they minimize the so-called star-discrepancy of the point set. Consider a set of J points P J = {u 0 , . . . , u J−1 } with u j ∈ [0, 1) s . Let A be a collection of subsets in [0, 1) s of the form The star-discrepancy of the point set P J = {u 0 , . . . , u J−1 } is defined as where #{u j ∈ A} is the number of u j belonging to A and Vol(A) is the volume of set A. Clearly, D * (P J ) measures the non-uniformity of the point set P J . The Koksma-Hlawka inequality states that where σ 2 (f ) is the variation of the function f . This inequality provides an upper bound on the approximation error and a criterion on searching for efficient QMC point sets P J .
Several construction rules of low-discrepancy point sets P J have been proposed, for which D * (P J ) is of order J −1 (log J) s−1 . Before describing such construction rules, let us introduce the so-called van der Corput sequences (van der Corput, 1935). Given an integer b ≥ 2, any integer k ≥ 0 can be written as with all, but finitely many, zero coefficients a i (k). Then the function maps each integer k ≥ 0 to a point in [0, 1). The sequence {ψ b (k), k = 0, 1, ...} is called the base-b van der Corput sequence. The table below gives the first 8 values of the base-2 van der Corput sequence. As k increases, the van der Corput sequence fills up the interval [0, 1) k 0 1 2 3 4 5 6 7 ψ 2 (k) 0 1/2 1/4 3/4 1/8 5/8 3/8 7/8 in a very balanced way: they appear alternately on both sides of 1/2, first 1/4 and 3/4, then 1/8 and 5/8, then 3/8 and 7/8 and so on. Because of this uniformity feature, van der Corput sequences are extensively used in the QMC literature for constructing low-discrepancy point sets.
The simplest construction of low-discrepancy point sets is the Halton and Hammersley sequence (Halton, 1960;Hammersley, 1960). Let b 1 , . . . , b s be s positive integers such that their greatest common divisor is one. Then is a sequence of low-discrepancy points, which satisfies D * (u 0 , . . . , u J−1 ) = O(J −1 (log J) s ).
A more sophisticated rule is (t, m, s)-nets (Niederreiter, 1992). The set P J of J = b m points, where b ≥ 2 is an integer, is said to be a (t, m, s)-net in base b if for all d 1 , . . . , d s ≥ 0 with d 1 + · · · + d s = m − t, every elementary box of the form contains exactly b t points of P J . The volume of each elementary box is 1/b d 1 +···+ds = 1/b m−t . A (t, m, s)-net has a high uniformity in the sense that the portion of points contained in each elementary box b t /J = 1/b m−t is exactly the volume of the box. The reader is referred to Niederreiter (1992) and Dick and Pillichshammer (2010) for construction of (t, m, s)-nets.

Randomised Quasi-Monte Carlo
Unlike MC estimators which are stochastic, QMC estimators are deterministic. Applications in statistics often require randomness so that probability theory can be applied, for instance, for estimating approximation errors and constructing confidence intervals. Furthermore, we sometimes require an unbiased estimator of I(f ). To get the best of both MC and QMC, randomised versions of QMC have been proposed. The idea is that, by injecting a random element into a QMC sequence P J = {u 0 , ..., u J−1 }, the resulting randomised QMC sequencẽ P J = {ũ 0 , . . . ,ũ J−1 } preserves the low-discrepancy property and, at the same time,ũ j ∼ U[0, 1) s .
The simplest randomisation method is to shift the point set P J by a random vector r uniformly distributed over [0, 1) sũ j = (u j + r) mod 1, j = 0, 1, ..., J − 1, where α mod 1 = α − α with α the largest integer number that is smaller than α. Here the mod operator applies separately to each coordinate. It is easy to see thatũ j is uniformly distributed over [0, 1) s . Then I(f ) = (1/J) f (ũ j ) is an unbiased estimator of I(f ). It is important to note that theũ j are not independent of each other.
A more sophisticated RQMC method (Owen, 1997;Matousek, 1998) is to apply random permutations to the coefficients in the base-b expansion (3) of each number in a (t, m, s)-net. The resulting set is called a scrambled net. Owen (1997) shows that, given that f is smooth enough, the variance of I(f ) is of order J −3 (log J) s−1 , compared to the order of J −1 for plain MC estimates. So given that the dimension s is not too large, RQMC estimates achieve a better convergence rate than plain MC estimates.
Here we describe the simplified version of Matousek (1998). Write P J as a matrix of size J × s and consider a single element u ji in this matrix with the base-b expansion u ji = a 0 /b+a i.e., we apply π (i) ν to the coefficients a ν , ν = 0, 1, . . .. The same set of random permutations Π (i) is used for all elements in the ith column of P J , different and independent sets Π (i) are used for different columns. The scrambled setP J = {ũ 0 , . . . ,ũ J−1 } is a (t, m, s)-net with probability one and eachũ j ∼ U [0, 1) s . As the f (ũ j ) are dependent, it is in general harder to obtain central limit theorems (CLT) for the RQMC estimator I(f ) = (1/J) f (ũ j ). Loh (2003) obtains a CLT for RQMC estimators based on scrambled (0, m, s)-nets.

Estimating the Expected Utility
In the context of Bayesian design, simulation from the prior predictive, y ∼ p(y|θ, d)p(θ), can be written as a function (transformation) of uniform random variates. This involves a two stage process. In an abuse of notation we write the simulation of θ from the prior as the function θ = θ(u θ ) where u θ is uniformly distributed over the hypercube (0, 1) n θ where n θ is the dimension of u θ . Then to produce a simulation of y we require an additional set of n y uniform random numbers and we set y = y(u y , θ) where u y ∈ (0, 1) ny . We can write this more succinctly as u = (u θ , u y ) and set y = y(u) where u is uniform over the hypercube (0, 1) n θ +ny . Then, we can re-write the integral in (2) as If we simulate u uniformly using pseudo random numbers then we recover the standard MC estimator in (2). However, if we generate u using the procedure outlined in the previous section then we obtain an estimator based on RQMC that should have lower variance.
We note that the choice of transformation function is not unique. For example, to simulate from a normal distribution, we could use the quantile function of the normal distribution (the inversion method) or the Box-Muller method (Box and Muller, 1958). In our examples in Section 3, we use the inversion method exclusively. Note that it is not necessarily guaranteed that the space filling properties of the RQMC numbers are preserved on the transformed space. Nonetheless, in Section 3, we obtain significant improvements.
We show later that increased precision afforded by the RQMC approach can improve the efficiency of optimisation algorithms for determining the design d * that leads to the largest expected utility. There have been several methods developed for this optimisation task. We do not intend to give a complete overview here but give a few examples to provide a flavour. Müller et al. (2004) convert the optimisation into a Markov chain Monte Carlo (MCMC) simulation problem that simulates over the joint space of (d, y, θ) and admits a probability density proportional to the expected utility surface as the marginal in d. Such an approach has been commonly used in the Bayesian design literature (e.g. Cook et al. (2008); Drovandi and Pettitt (2013)).  note some issues with this approach. In particular, the mixing of the MCMC can be poor and furthermore the optimal design is obtained by estimating the mode of the density proportional to the expected utility surface in a non-parametric fashion based on only MCMC samples from this density. This approach does not scale with an increase in the length of d.
Other approaches attempt to maximise U (d) directly but their performance may be affected by the noise associated with U J (d). Huan and Marzouk (2014) implement various stochastic approximation algorithms to perform the optimisation. Gotwalt et al. (2009) apply the coordinate exchange (CE) algorithm (Meyer and Nachtsheim, 1995) in a pseudo-Bayesian design setting. The CE algorithm involves cycling through each of the design variables one-at-a-time, trialling a set of candidate replacements and choosing the best replacement if it improves the expected utility. Overstall and Woods (2016) extend this idea to an approximate CE (ACE) algorithm that uses a Gaussian process emulator so that only a relatively small number of candidates are required. Weaver et al. (2016) consider a Bayesian optimisation algorithm that uses a Gaussian process emulator over the entire design space. Our main message is that all of these approaches can benefit from a reduction in the variance of U J (d). To illustrate the capability of RQMC in determining designs with higher expected utility than MC, we use the CE algorithm (shown in Algorithm 1). This algorithm requires the specification of a vector e containing potential values for each component of the design vector d and the number of sweeps, S, to perform over the design vector.
Algorithm 1 Co-ordinate exchange design optimisation algorithm.
Input: Number of sweeps S, J, and a vector, e, containing potential values for each component of the design vector d. Output: An approximation to the optimal design d * and its corresponding estimated expected utility value. 1: Set initial design d 0 2: Estimate expected utility U 0 = u J (d 0 ) 3: Set iteration counter I ← 0 4: for s = 1 to S do 5: for j = 1 to |e| do 6: Set d + ← d I

Examples
Here we consider four examples that illustrate the variance reduction in the estimate of the expected utility that can be achieved through RQMC. For a particular design d we obtain U J (d) under both MC and RQMC for J ∈ {5, 10, 20, . . . , 100}. Note that for both MC and RQMC we use the same approach to compute/approximate U (d, y). We consider two different designs, one that is sub-optimal (e.g. a randomly selected design) and one that is closer to optimal. To obtain the closer-to-optimal design we use one run of the CE algorithm in conjunction with RQMC and J = 100. Note that we do not expend a great deal of effort into determining the most optimal design, we are simply interested in comparing MC and RQMC at two markedly different designs with different expected utilities. To estimate the precision of the estimated expected utility we obtain 50 (unless otherwise specified) independent values of U J (d) under each scenario, then compute the sample standard deviation to obtain sd(U J (d)). We prefer the integration method that leads to lower values of the standard deviation. In this paper, we use the scrambled Sobol's net, i.e. the scrambled (t, m, s)-net in base b = 2.
We also investigate how much influence the reduction in variance provided by RQMC has in determining efficient designs. As mentioned above, we use the CE optimisation algorithm for this purpose. We run this for both RQMC and MC with different values of J. When a design is updated by the CE algorithm, we compute a gold standard value of its expected utility by using RQMC and J = 1000 and keep track of this gold standard value throughout the optimisation process. Since the CE algorithm is not a global optimisation method and that it may be impacted by the noise in the expected utility estimate, we run it several times from different random starting designs. For each individual run of the CE method, we use the same starting design for both MC and RQMC. We compare the MC and RQMC methods via the median value (over the repeated runs) of the gold standard estimate of the expected utility throughout the CE iterations.
For a fixed value of J, the only difference between run times of MC and RQMC for estimating U J (d) can be attributed to the time to generate random numbers for producing different values of y. We find that there is only a very small additional cost for generating randomised QMC numbers over pseudo random numbers. The additional time required to estimate the expected utility via RQMC becomes negligible as the value of J and/or the time to approximate U (d, y) is increased.

Pharmacokinetics
We take this example from Ryan et al. (2014), which involves determining optimal blood sampling times for a pharmacokinetics (PK) model. PK studies involve the administration of a specified quantity of a drug to subjects and investigate the absorption, distribution and elimination of the drug and its metabolites. Let y t be the observed concentration of the drug at time t. We assume that where θ 1 is the elimination rate, θ 2 is the absorption rate and θ 3 is the 'the volume of distribution'. D is the administered dose, which is set fixed at 400 units. Additive and proportional error variances are set fixed at σ 2 = 0.1 and τ 2 = 0.01, respectively. The goal of the experiment is the precise estimation of the parameter log(θ) = (log(θ 1 ), log(θ 2 ), log(θ 3 )) . The prior for θ i is log-normal with a mean (on the log scale) of log(0.1), log(1) and log(20) for i = 1, 2 and 3, respectively. Each parameter has a variance (on the log scale) of 0.05 and the parameters are assumed independent a priori.
The design problem is to determine the optimal set of 15 sampling times d = (t 1 , . . . , t 15 ) where t 1 < t 2 < · · · < t 15 and the sampling times are constrained to be at least 0.25 time units apart.
To estimate the posterior distribution we use a Laplace approximation (LA) of log(θ). The LA is useful in optimal Bayesian design since the estimate of U (d, y) does not have noise associated with it (see Ryan et al. (2015) and ). Since the prior and posterior of log(θ) are both normally distributed, there is an analytical expression for the KLD.
To apply RQMC we are required to write the simulation from the prior predictive distribution as a function of uniform random variates. The prior on each θ i is log-normal so we can simulate it via θ i = exp(µ i + σ i Φ −1 (u i θ )) where Φ −1 (·) is the quantile function of a standard normal random variate, u i θ ∼ U(0, 1), and µ i and σ i are the prior mean and standard deviation of log θ i . After simulation from the prior to obtain θ, we can easily simulate the response as it is normally distributed, y t k = c(θ)µ t k (θ) + σ v t k (θ)Φ −1 (u k y ) where u k y ∼ U(0, 1) for k = 1, . . . , 15. Thus we can write the expected utility with respect to the prior predictive distribution as an integral over the 3 + 15 = 18 dimensional unit hypercube.
The sub-optimal design we consider is the evenly spaced design d = (1, 2, . . . , 15) . The closer-to-optimal design is d = (0.75, 1.0, 1.25, 1.75, 3.75, 4.0, 4.75, 5.0, 5.25, 10.5, 11.0, 11.25, 13.0, 15.0, 15.75) . Based on J = 1000 and RQMC the expected utilities of these designs are 7.02 and 7.37, respectively. The results are shown in Figure 2. It is evident that RQMC clearly outperforms MC for any value of J. Figure 3 shows the median of the gold standard expected utility when performing 50 independent runs of the CE algorithm for both MC and RQMC for a variety of J values. Here we set e = (0.25, 0.5, . . . , 24) and S = 5. It is clear that the RQMC method is beneficial in obtaining designs with higher expected utility.

Logistic Regression
We consider the logistic regression example of Overstall and Woods (2016). The response is binary, where n is the number of observations. The model parameter is θ = (β 0 , β 1 , β 2 , β 3 , β 4 ) . The observed vector of responses is denoted as y = (y 1 , . . . , y n ) and the design vector is the concatenation of the controllable elements of the design matrix, d = {x i,j ; i = 1, . . . , n, j = 1, . . . , 4} and is of length n × 4. The design space has x ij ∈ (−1, 1) for all i = 1, . . . , n and j = 1, . . . , 4. We design for n = 24 independent observations. The prior distribution allocated by  is β 0 ∼ N (0, 3), β 1 ∼ N (7, 3), β 2 ∼ N (8, 3), β 3 ∼ N (−3, 3), β 4 ∼ N (0.5, 3), and that all parameters are independent a priori. We consider the MI utility for U (d). In this case U (d, y) is the KLD between the prior and the posterior, and may be written as We consider an MC approximation of the above utility based on importance sampling (IS) where the importance distribution is simply the prior. Denote a large collection of N samples taken from the prior as Θ = {θ i } N i=1 iid ∼ p(θ). Here we set N = 100, 000. A weight is assigned to each particle, forms a particle approximation of p(θ|y, d). Then the approximation of U (d, y) is given by The advantage of using IS in the context of Bayesian design is that the same collection Θ can be used for each dataset y drawn during the design optimisation process. However, the optimal design obtained will be dependent on Θ, which we denote as d * Θ . To eliminate the dependence on Θ it would be necessary to re-generate a new collection of Θ for each y. However, this leads to an increase in the computation time. We consider both cases here. We refer to the former as 'fixed Θ' and the latter as 'new Θ'. Here we perform 100 independent repeats to estimate sd(U J (d)).
Using J = 1000, RQMC and new Θ, the expected utilities for the random and closer-tooptimal designs are 2.94 and 4.29, respectively. Intuitively, the utility, U (d, y), may be more difficult to estimate at efficient designs as they provide more information about the parameters thus leading to a lower effective sample size in the IS approximation. This will naturally lead to more variability associated with U N (d, y).
Since the prior on each parameter is normal we can simulate from it using the quantile function of the normal distribution (similar to the previous example). The response is binary so we may simulate it as y i = I(u i y < p i ) for i = 1, . . . , n where p i is a function of the model parameter, u i y ∼ U(0, 1) and I(·) is the indicator function. The comparison of the standard deviation of the estimated expected utility is shown in Figure  4. The top row has fixed Θ while the bottom row has new Θ. The left column is based on the sub-optimal design whereas the right column uses the closer-to-optimal design. In all four scenarios and for the majority of J values the RQMC approach brings a considerable variance reduction. Figure 5 shows the median of the gold standard expected utility when performing 30 independent runs of the CE algorithm for both MC and RQMC for a variety of J values. Here we set e = (−1, −0.5, 0, 0.25, 0.5, 1) and S = 2. Again it is evident that RQMC provides a better chance of finding designs with higher expected utility than MC.

Susceptible-Infected Example
We investigate an example involving a stochastic infectious disease model considered by Cook et al. (2008); . Such models are important for gaining an understanding of how a disease is transmitted and for assessing the impact that various intervention strategies may have on the spread of the disease. Scientists may be interested in the optimal times to observe an epidemic in order to increase the chance of learning about the parameter of an infectious disease model or to learn the structure of the model itself, or both. In this section we consider the goal of parameter estimation while in the next section the aim is model discrimination.
The model of interest in this section is the susceptible-infected (SI) model, which is a continuous time Markov chain (or Markov process). We denote the number of susceptibles at time t as S(t). We have that P (S(t where ∆ t is an infinitesimal time such that at most one event can occur during that time. The parameter b 1 represents the natural death rate of susceptibles while b 2 is the transmission rate of the disease. We are interested in the times to observe the process (here four times) to gain the most information about the parameter, θ = (b 1 , b 2 ) . The parameters are independent a priori and log-normally distributed: b 1 ∼ LN (−3.6, 0.1024) and b 2 ∼ LN (−4.5, 0.16).
We assume a closed population of size 50 and all individuals are susceptible at time 0. Further, we assume that it is only possible to observe the state of the system at various discrete times, so that the process is only partially observed. The likelihood function for Markov processes involves the computation of the transition probability matrix for moving between states of the Markov chain (here the number of infected individuals) during some time interval t a −t b where b > a. For continuous time Markov chains, the transition probability matrix is calculated via the matrix exponential (see Moler and van Loan (2003)), which is given by exp(G(t j − t i )), where G is the so-called generator matrix which contains as its (i, j)th element the transition intensity of moving from state i to state j if i = j. Each diagonal element contains the negative sum of its corresponding row. The matrix exponential is only computationally feasible for very small populations. Further, the number of likelihood calculations required during the optimal design process may render this approach infeasible for many Markov processes of interest. This motivated  to develop a likelihood-free approach to experimental design using approximate Bayesian computation (ABC, see also Hainy et al. (2014)).  use ABC rejection to generate samples from an approximate posterior, which is used to estimate the utility U (d, y) = − log(|Σ y|d |) using the sample covariance matrix. ABC rejection involves generating and storing a set of N simulations from the prior predictive distribution, which we denote as (Θ, Y). Then, a small proportion of these simulations, α, are kept that are 'closest' to the dataset y. The ABC approach of  for estimating the utility U (d, y) is shown in Algorithm 2. We use the same discrepancy function as in : where n is the number of observations to collect. The value std(x i |d i ) is the empirical prior predictive standard deviation of the data at time point d i .
Algorithm 2 ABC rejection algorithm used in . Input: Design, d, a potential future dataset, y, prior distribution, p(θ), number of prior predictive simulations, N , discrepancy function, ρ(·|d) and proportion of parameter samples to keep, α, to estimate the ABC posterior distribution Output: An ABC estimate of U (d, y) = − log(|Σ y|d |) 1: Generate θ i ∼ p(θ) for i = 1, . . . , N 2: Simulate x i ∼ p(y|θ i , d) for i = 1, . . . , N (Note that steps 1 and 2 can be performed prior to the optimal design process and the simulated data can be stored at a discretised grid of the design space) 3: Compute discrepancies ρ i = ρ(y, x i |d) for i = 1, . . . , N , creating particles {θ i , ρ i } N i=1 4: Sort the particle set via the discrepancy ρ such that ρ 1 ≤ ρ 2 ≤ · · · ≤ ρ N 5: Calculate = ρ αN (where · denotes the floor function). The ABC posterior samples consist of the set The advantage of ABC rejection, as with IS in the previous example, is that the same simulations, (Θ, Y), can be re-used for each y generated during the optimal design process. We refer to this as 'fixed (Θ, Y)'. We also implement the more computationally demanding process of generating a new (Θ, Y) for each new y, which we refer to as 'new (Θ, Y)'. The ABC rejection approach requires a discretisation of the design space, which here is the sampling times between 0 and 10 days. As in , we use a design space with a time increment of 0.25 days. For the sub-optimal design we set d = (1, 2, 3, 4) . This design is chosen arbitrarily to be less efficient than the closer-to-optimal design. The closer-to-optimal design is d = (1. 5, 3.75, 5.25, 9.75) . Using J = 1000, RQMC and new (Θ, Y), we estimate the expected utilities of these designs to be 20.89 and 21.76, respectively. The prior value of the utility is roughly 19.72.
The priors are log-normal so that prior simulation can proceed in the same way as the PK example. To simulate from the model given a value of θ we can use the algorithm of Gillespie (1977), where the time between a loss of a susceptible has an exponential distribution with a rate parameter that depends on θ and the current number of susceptibles in the simulation, which we write as η(θ, S(t)) = (b 1 + b 2 (50 − S(t)))S(t) where t is the current time of the simulation. To produce the simulation, the value of S(t) is recorded at sampling times d.
Since there are 50 susceptibles at time 0 the maximum number of random numbers required to simulate the data is 50. We can simulate the time between each loss of susceptible using the quantile function of the exponential distribution, t j −t j−1 = − log(1−u j θ )/η(θ, S(t j−1 )) where u j θ ∼ U(0, 1) for j = 1, . . . , 50. Most simulations will not require the use of all 50 uniform random numbers, but we store all of the potential randomised Sobol's numbers nonetheless.
The results for the four different scenarios for different values of J are shown in Figure 6. Again, RQMC is clearly preferred to MC. Figure 7 shows the median of the gold standard expected utility when performing 50 independent runs of the CE algorithm for both MC and RQMC for a variety of J values. Here we set e = (0.25, 0.5, . . . , 10) and S = 5. In this example it appears that RQMC is only beneficial relative to MC for small J. In this case, the design vector is only of length four and thus it is not difficult to find a design with relatively high expected utility.

Model Discrimination Example
In many scenarios there may be uncertainty about the true underlying model. We consider the M-closed perspective of Bernardo and Smith (2000), which assumes that the true data generating process is unknown but is one of K candidate models described by the random variable M ∈ {1, 2, . . . , K}. For model M = m, we denote its prior probability as p(m), the prior for its model parameter as p(θ m |m) and the likelihood function under that model as p(y|m, θ m , d). The expected utility to accommodate this model uncertainty can be written as which can be estimated as where y j ∼ p(y|m, d) for j = 1, . . . , J and m = 1, . . . , K. If determining the design to best discriminate between models is of interest, then one possibility is to set U (d, y, m) = log p(m|y, d) where p(m|y, d) is the posterior probability of model m (see, for example, Box and Hill (1967); Drovandi et al. (2014)). This choice amounts to maximising the mutual information between the model indicator M and the potential future dataset y.
Here we consider the example specified in Dehideniya et al. (2016), which involves discriminating between the SI model (above) and also the 'death' model, which has P (S(t We use a parameter prior of b 1 ∼ LN (−0.48, 0.15 2 ) for the death model and b 1 ∼ LN (−1.1, 0.2 2 ), b 2 ∼ LN (−4.5, 0.6 2 ) for the SI model. The two models are assumed equally likely a priori. To estimate the utility for some generated dataset y, the ABC rejection algorithm is used. Out of all the prior predictive simulations that are within some distance of the dataset y, the proportion of those belonging to model m is used to estimate the posterior model probability of model m (see Dehideniya et al. (2016) and the references therein for more details).
To utilise RQMC here a very similar procedure is required as in the previous example. The only difference is that two sets of J prior predictive simulations are required, one for each of the two models. Here we perform 100 independent repeats to estimate sd(U J (d)). The sub-optimal design is d = (5, 10) days (chosen arbitrarily to be less efficient than the closerto-optimal design) and the closer-to-optimal design is d = (0.75, 4.75) days. The expected utilities of these designs based on RQMC and J = 1000 are -0.64 and -0.44, respectively. The results for both of these designs and also whether or not a new set of prior predictive simulations is used for the ABC rejection for each value of j is shown in Figure 8. Despite the MC error associated with the U (d, y, m) calculation, a significant gain is achieved with RQMC without any additional computational cost. However, owing to the low dimensional design space, RQMC does not speed up the rate of convergence to designs with high expected utility.

Conclusion
This paper has demonstrated that RQMC can lead to substantial variance reduction in expected utility estimates for fully Bayesian experimental design. Our focus was on applying RQMC to increase the precision of U J (d) for fixed J. Our results suggest that precision can be gained without compromising at all on computational cost. In high dimensional design optimisation problems, we demonstrated that RQMC can more rapidly find designs with high expected utility relative to MC.
Monte Carlo variability of the U (d, y) estimation will be present regardless of whether MC or RQMC is used to estimate the expected utility, U (d), and this could dampen the impact of RQMC. We did not consider the possibility of implementing RQMC to reduce also the variability of U N (d, y).