Bayesian prediction for physical models with application to the optimization of the synthesis of pharmaceutical products using chemical kinetics

Quality control in industrial processes is increasingly making use of prior scientific knowledge, often encoded in physical models that require numerical approximation. Statistical prediction, and subsequent optimization, is key to ensuring the process output meets a specification target. However, the numerical expense of approximating the models poses computational challenges to the identification of combinations of the process factors where there is confidence in the quality of the response. Recent work in Bayesian computation and statistical approximation (emulation) of expensive computational models is exploited to develop a novel strategy for optimizing the posterior probability of a process meeting specification. The ensuing methodology is motivated by, and demonstrated on, a chemical synthesis process to manufacture a pharmaceutical product, within which an initial set of substances evolve according to chemical reactions, under certain process conditions, into a series of new substances. One of these substances is a target pharmaceutical product and two are unwanted by-products. The aim is to determine the combinations of process conditions and amounts of initial substances that maximize the probability of obtaining sufficient target pharmaceutical product whilst ensuring unwanted by-products do not exceed a given level. The relationship between the factors and amounts of substances of interest is theoretically described by the solution to a system of ordinary differential equations incorporating temperature dependence. Using data from a small experiment, it is shown how the methodology can approximate the multivariate posterior predictive distribution of the pharmaceutical target and by-products, and therefore identify suitable operating values. Materials to replicate the analysis can be found at www.github.com/amo105/chemicalkinetics.


Introduction
Pharmaceutical products are often manufactured by chemical synthesis, with an initial set of chemical substances evolving over time via a series of chemical reactions into a substance or substances of interest. One (or more) of these substances will be the target pharmaceutical product while the others will be unwanted or harmful by-product substances. In addition to a time dependence, the chemical synthesis process may be hypothesised to depend on various controllable factors. The aim of the chemical engineers will be to manipulate the controllable factors such that specification limits on the final substances are satisfied. For example, the quantity of pharmaceutical product may be required to exceed some specified level whilst the amounts of unwanted by-products are correspondingly kept below a certain level.
It is common for a physical model, for example derived from scientific theory as the solution to a set of ordinary differential equations (ODEs), to be postulated to approximately describe the dependence of the synthesis process on the controllable factors (and time); see chapters in am Ende (2011). In addition, often an experiment can be performed within which the amounts of the substances of interest are measured for several different specifications of the controllable factors. These observed responses can then be used to inform the relationship between controllable factors and the synthesis process, e.g. through the estimation of unknown tuning parameters.
In this paper, we develop and apply methodology for the statistical modelling of such experiments, motivated by a pharmaceutical exemplar which measured three substances of interest: the pharmaceutical product and two unwanted by-products. The dynamics behind the synthesis process are approximately described by a system of non-linear ODEs, the solution to which forms a physical model for the amounts of the three substances at a given time. This model depends on a set of controllable factors (including time) and unknown parameters. A small experiment has been conducted where the amounts of the substances of interest have been observed for different controllable factors at certain time points. The aim is to use this observed data to estimate the unknown parameters and then to use these estimates to maximize the probability of the predicted amounts satisfying known specification limits as a function of the controllable factors and time. The approach used here will be Bayesian, giving a coherent method of propagating uncertainty on the unknown parameters (given the observed data) through to the predicted amounts. This approach is closely aligned with the concept of "pharmaceutical design space" 1 , a set of combinations of values of the controllable factors that have been demonstrated to provide assurance of quality. See, for example, Peterson (2008), Peterson and Yahyah (2009) and Lebrun et al. (2013) for related Bayesian approaches to the definition of a design space using linear statistical modelling methods. Beyond pharmaceutical design space, in the wider field of process optimization, see, for example, Chiao and Hamada (2001), Peterson (2004) and Del Castillo (2007).
Despite the apparent simplicity and appealing nature of the Bayesian modelling approach, several computational challenges remain. Firstly, the solution to the system of ODEs is analytically intractable and so a computationally expensive numerical solution is required (e.g. Iserles, 2009). The computational expense of this approximation will impact the computing time required both for the estimation of the unknown parameters through the evaluation of their posterior distribution and the predictions for new combinations of the controllable factors. To reduce computational expense, we make use of (multivariate) Gaussian process emulators to accelerate both the Markov chain Monte Carlo (MCMC) algorithm employed and the prediction of future responses. Secondly, some of the observations fall below a minimum threshold, χ, under which the actual amount cannot be measured reliably, resulting in left-censored observations. Instead of assuming that the true observations are equal to zero or χ with certainty, we assume that the true observations are unknown, subject to being in the interval (0, χ), within a natural Bayesian formulation. Thirdly, we know that the physical model derived from the solution to the ODEs only provides an approximation to the true process. We account for any potential mismatch by introducing an explicit model discrepancy term (c.f. Kennedy and O'Hagan, 2001). Fourthly, the function giving the probability of satisfying the specification limits for given controllable factors will be noisy and computationally expensive to evaluate making optimization non-trivial.
The contribution of this paper is the novel application of a combination of modern computational and statistical methodology to address these challenges, and the demonstration of this approach on an important practical problem.
The paper is organised as follows. In Section 2 we describe the motivating experiment, and the physical and statistical models for the observations in more detail. In Section 3 we describe the methodology including construction of Gaussian process emulators. In Section 4 we present the results of applying the methodology to the motivating experiment, before concluding with a discussion in Section 5.

Chemical Reactions
The motivating chemical synthesis process involves nine substances, labelled A to I. The substances of interest are E, F and H, where F constitutes the pharmaceutical product with E and H being unwanted by-product substances. Let A(t) denote the amount (in mols) of substance A at time t (in seconds), with similar notation for the remaining substances. At time t = 0, initial amounts of A, D and E, denoted by A 0 = A(0), D 0 = D(0) and E 0 = E(0), respectively, are introduced into the laboratory apparatus at a specified temperature (λ in Kelvins) and volume (V , in litres). Let x = (x 1 , . . . , x 5 ) = (A 0 , D 0 , E 0 , λ, V ) be an F × 1 vector (F = 5) and assume that . The upper and lower limits, x f,min and x f,max , for each factor are given in Table 1  The chemical synthesis process is governed by the following series of chemical reactions: where k 1 , k 2 , and k 3 denote the chemical reaction rates. The second line denotes a reversible reaction and thus k 2 is split into k − 2 and k + 2 , for the forward and backward reactions, respectively. Let [A](t) = A(t) V be the concentration (in mol/litre) at time t of A, letȦ(t) = d[A](t) dt be the corresponding time derivative, and assume similar definitions for the remaining substances; B to I. Chemical reactions (1) lead to the following system of non-linear ODEs (suppressing the dependence on t): The initial concentrations of all nine substances are zero, except for A, D and E which are For given chemical reaction rates, the solution ([A](t), . . . , [I](t)) from (2) provides the concentrations of the nine substances for given combinations of the controllable factors, x, and reaction rates. The amounts of the nine substances are then given by (A(t), . . . , I(t)) = V × ([A](t), . . . , [I](t)). However, the solution is analytically intractable and hence we employ a computationally expensive numerical method to find an approximate solution. In particular, we use the variable coefficient ordinary differential equation solver (Brown et al., 1989) as implemented in the R (R Core Team, 2018) package deSolve() function (Soetaert et al., 2010). However the methodology described in Section 3 can be used in conjunction with any numerical method.
The first reaction is assumed to be instantaneous, which the chemical engineers model by assuming k 1 = 10000. The remaining reaction rates, k − 2 , k + 2 and k 3 , are unknown, and the temperature dependence is incorporated into the system of ODEs via the Arrehnius equation (see, for example, Laidler 1984). Hence, the general form for a reaction rate is where λ (r) is the reference temperature (here 298.15K), k (r) i > 0 is the reaction rate at the reference temperature, E i > 0 is the activation energy and G = 8.31445 Jmol −1 K −1 is the gas constant. By definition, the activation energies for the reversible reactions are equal, i.e. E + 2 = E − 2 = E 2 . This means there are p = 5 unknown physical parameters, γ = (γ 1 , . . . , 3 , E 2 , E 3 . To increase computational efficiency of the methodology described in Section 3, we operate on the logarithm scale, i.e. consider θ k = log γ k , for k = 1, . . . , p, and let θ = (θ 1 , . . . , θ p ) ∈ Θ = R p .
The quality control question of interest is to make choices of initial amounts of A, D and E, temperature, volume and time such that the amount of the three substances of interest satisfy the following specification limits (in mols): Recall that F is the pharmaceutical product and its amount needs to be sufficiently large to make the production process economically viable. However E and H represent unwanted by-products with upper limits for safety reasons.

Experiment and Physical Model
We estimate the values of the unknown physical parameters θ using observations from a small experiment. The chemical reactions (1) are observed for I ′ = 6 combinations of the controllable factors x, given in Table 2. For run i = 1, . . . , I ′ , we observe the amounts of E, F and H at a series of n i times t i1 , . . . , t ini . The times range from 0.5 to 2902 seconds, with n i ranging from 17 to 20 time points per run. In total, there are n = ∑ I ′ i=1 n i = 109 observations of the amounts over all I ′ runs. Note how run 6 is a repetition of run 5, thus there are I = I ′ − 1 = 5 unique treatments. Runs 5 and 6 have observations taken at 18 different time points. However the last two time points for run 5 are at 1620 and 1825 seconds, whereas the last two time points for run 6 are taken four seconds later. Therefore there are m = 93 unique combinations of controllable factors and time points. Let x ′ j , t j be the values of the controllable factors for each of these unique combinations (j = 1, . . . , m). The experiment was not designed to be statistically optimal for either the estimation of parameters θ or prediction for new treatments.
Let K = 3 be the number of substances of interest. We denote by µ (θ; x, t) = log (E(t), F (t), H(t)) the K ×1 vector giving the log of the theoretical amount of the three substances of interest (E, F and H) obtained from the numerical solution to the system of ODEs at x, time t and parameters θ.
Let Y i be the n i × K matrix containing the log observed amounts of E, F and H for the ith run, for i = 1, . . . , I ′ . The jth row of Y i , denoted by y ij is a K × 1 containing the observed amounts for the jth time point of the ith run, for i = 1, . . . , I ′ and j = 1, . . . , n i . Let Y be the n × K matrix formed by stacking the matrices Y 1 , . . . , Y I ′ , i.e.  The experimental apparatus cannot detect substance amounts lower than χ = 0.01 mols. Therefore, we have censored values when the observed amount is less than χ. In total, there are zero censored observations in the first column of Y, and 5 and 45 in the second and third columns, respectively. We take account of the censoring in our statistical modelling strategy described in Section 2.3.

Statistical Model
We assume that In (4), M(θ) is the m × K matrix of unique model solutions, given by where M i (θ), for i = 1, . . . , I − 1, is the n i × K matrix with jth row given by µ(θ; x i , t ij ), and M I (θ) is the (n I + 2) × k matrix where the jth row is given by µ(θ; x I , t Ij ), for j = 1, . . . , n I , and the (n I + 1)th and (n I + 2)th rows are given by µ(θ; x I , 1624) and µ(θ; x I , 1829), respectively (corresponding to the two time points that differ between runs 5 and 6).
Furthermore, D is the m × K matrix of model discrepancy errors given by which represents the discrepancy between the physical model and the true mean value of the process (Kennedy and O'Hagan, 2001). The n × m binary incidence matrix G identifies the rows of M + D corresponding to the rows of Y. Finally, E is the n × K matrix of observational errors. The distribution of E is assumed to be where MN denotes the matrix normal distribution (e.g. Dawid, 1981), Ω is an unknown K × K column covariance matrix, and T is an n × n row correlation matrix. The matrix T = diag {T 1 , . . . , T I ′ } is blockdiagonal, with jlth element of T i given by with time correlation parameter ρ > 0 assumed unknown. This covariance structure for E assumes that the correlation between observations from the same run is dependent on the difference in time, but observations from different runs are independent.
We complete the model specification by specifying prior distributions for the discrepancy D and the model parameters. The discrepancy is assumed to follow a matrix normal distribution, where Σ is an unknown K × K covariance matrix and S is an m × m correlation matrix. The jlth element of S is given by for j, l = 1, . . . , m and where the model discrepancy correlation parameters ψ 1 > 0 and ψ 2 > 0 are unknown.
In (6), This correlation structure allows the differences in model discrepancy between different treatments to depend on the "distance" between the controllable factors and time points for each treatment.
The chemical engineers have some limited prior knowledge on the location of θ. It is believed that the reaction rates are likely to lie between 10 −8 and 10 −4 and the activation energies between 10 2 and 10 6 . We encode this information by assuming that the reaction rates and activation energies have independent lognormal prior distributions with hyperparameters chosen so that there is probability 0.95 that the value of interest lies between the two limits above. This is achieved by setting the 0.025 and 0.975 quantiles to be the lower and upper limits respectively. In summary, we assume where µ = (−13.8, −13.8, −13.8, 9.21, 9.21) T and ∆ = 5.52I p . There is no prior information available for the remaining parameters, so we specify vague prior distributions that contribute negligible information. The variance matrices Ω and Σ are both assumed to have inverse-Wishart prior distributions with ν = 4 degrees of freedom and an identity scale matrix (Gelman et al., 2014, Chapter 3). The correlation parameters ρ, ψ 1 and ψ 2 are given independent exponential prior distributions with mean equal to one following the arguments of, for example, Overstall and Woods (2016). In Section 4 we discuss the results of a sensitivity analysis to assess the robustness to the choice of these vague prior distributions. Let y = vec(Y), m(θ) = vec(M(θ)), d = vec(D) and e = vec(E), with the vec operator stacking columns of a matrix. Then a model specification equivalent to (4) is given by where and ⊗ denotes the Kronecker product. Let y S and y C denote the elements of y which are fully observed (i.e. greater than log χ) and censored (i.e. less than log χ), respectively. Therefore, we need to evaluate the posterior distribution (conditional on y S ) of the model parameters and censored observations. This distribution is given by where π(y θ, d, Ω, ρ) is the complete data likelihood given by (7) and (8), π(d Σ, ψ) is the density of the vectorised model discrepancy (9), and π(θ), π(Ω), π(ρ), π(Σ) and π(ψ) are prior densities for the model parameters.

Prediction
Let y 0 be a K × 1 vector denoting the predicted log amounts of E, F and H for arbitary controllable factors x 0 ∈ X and t 0 ∈ T . The posterior predictive distribution of y 0 is given by where and s is an m × 1 vector with jth element given by The probability of E, F and H satisfying the specification limits (3) at point (x 0 , t 0 ) is given by

Methodology
The posterior and predictive distributions given by (10) and (11), respectively, will be analytically intractable.
Our general approach to numerically approximating these distributions uses two phases. The Sampling Phase involves generating a sample from the posterior distribution using MCMC methods. The Prediction Phase uses this MCMC sample to generate a sample from the posterior predictive distribution and thus estimate the probability P(y 0 ∈ Y y S ) by the proportion of sampled values satisfying the specification limits. These phases make use of Gibbs sampling and parallel tempering algorithms, and multivariate Gaussian process emulators for the numerical solution to the ODEs.

Gibbs sampling and parallel tempering
We use Gibbs sampling in conjunction with parallel tempering to generate a sample from the joint posterior distribution (10) of model parameters and censored observations. To improve the efficiency of the Gibbs sampler, we employ hierarchical centring (Papaspiliopoulos et al., 2003) and reparameterise model (7) as where Sampled values of d * can easily be transformed to values of d using d = d * −m(θ). The posterior distribution of model parameters and censored observations is now given by where π(y d * , Ω, ρ) is the complete-data likelihood defined by (12) and π(d * θ, Σ, ψ) is defined by (13). The full conditional posterior distributions of d * , Ω and Σ are of known form and are given in Appendix A. The full conditional posterior distribution of y C is a truncated multivariate normal distribution (see Appendix A) and there exist efficient methods for generating values from such distributions (Geweke, 1991).
For the remaining parameters θ, ρ and ψ, the full conditional distributions are not of known form, so a Metropolis-within-Gibbs step is employed. Specifically, we exploit the Riemann manifold Langevin Metropolis-Hastings (RMLMH) algorithm (Girolami and Calderhead, 2011) . This method uses derivative information on the posterior surface to construct an efficient proposal distribution. The algorithmic details are briefly described in Appendix B and the components necessary for its application to the full conditional posterior distributions of θ, ρ and ψ are given in Appendix C.
Parallel tempering (see, for example, Gelman et al., 2014, pp. 299-300) is used to improve the efficiency of sampling from a potentially multi-modal posterior distribution. It involves defining R distributions by a series of increasing "temperatures", where the lowest temperature distribution corresponds to the posterior distribution. An MCMC chain is initialized under each distribution, and the parallel tempering scheme allows for swaps between the states of the chains for distributions at different temperatures. The lowest temperature MCMC chain is a sample from the posterior distribution. Parallel tempering allows larger moves (e.g. between different areas of high posterior density) made at the higher temperature chains to be passed down to the lower temperature chains, hence improving mixing.
Let δ = (θ, Ω, ρ, d * , Σ, ψ, y C ) be the collection of unknown parameters and censored observations. For r = 1, . . . , R, define the distribution for the rth chain to have density proportional to where 1 = τ 1 < ⋅ ⋅ ⋅ < τ R are a series of temperatures. A parallel tempering iteration involves applying either i) a sampling step (with probability ω); or ii) a swap step (with probability 1 − ω). In the sampling step, the current values δ c (r) in the rth chain are updated to δ c+1 (r) , for r = 1, . . . , R, using the Gibbs sampling algorithm. In the swap step, two neighbouring chains (r and r + 1) are chosen at random. With probability . To complete sampling step (i) for each distribution at temperature t r , the full conditional and RMLMH components need to be adjusted (see Appendices A and B for details).
The system of ODEs can be augmented with additional equations to numerically solve for the firstand second-order sensitivities (Valko and Vajda, 1984;Girolami and Calderhead, 2011). However this will substantially increase the computational burden of the numerical solution. Note that to generate a sample of size B using parallel tempering with R chains will require BR evaluations of the three functions for each x i and t ij , for i = 1, . . . , I, j = 1, . . . , n i and k = 1, . . . , p. Once we have generated a sample of size B from the posterior distribution, to estimate the probability P(y 0 ∈ Y y S ), we require B further evaluations of µ(θ; x, t), for each x ∈ X and t ∈ T at which we wish to evaluate this probability.
Making such a large number of evaluations of the numerical solution is clearly computationally infeasible. Instead we form an approximation to the numerical solution by the construction of a statistical emulator through a computer experiment (Sacks et al., 1989). The numerical solution to the ODEs is evaluated at a selection of N combinations of parameters θ; controllable variables x; and times t. These evaluations are treated as data to which a statistical model, termed an emulator, is fitted. The emulator is essentially a predictive distribution (denoted by Q(θ, x, t)) from which fast predictions of the numerical solution can be obtained. For example, the predictive mean,μ(θ; x, t), will be used as a point prediction. For more details on computer experiments and statistical emulators, see, for example, Santner et al. (2003), Fang et al. (2006) and (Dean et al., 2015, Section V).
This modifications results in the sample generated in chain r = 0 being from the true posterior distribution of δ but the overall sampling scheme requiring only B evaluations of µ(θ; x i , t ij ) for i = 1, . . . , I and j = 1, . . . , n i . In the Prediction Phase, to estimate the probability of satisfying the specification limits, all evaluations of µ(θ; x, t) are replaced by a value generated from Q(θ, x, t). Sampling from this predictive distribution means the estimate of the probability takes account of additional uncertainty introduced by using an emulator approximation.

Multivariate Gaussian process emulator
In this paper, we employ the multivariate Gaussian process (MGP; Conti and O'Hagan 2010) emulator. The numerical solution to the ODEs, µ(θ; x, t), is evaluated at each of the elements of the set (or meta-design) The superscript * notation has been introduced to distinguish the metadesign from the design of the physical experiment. Let The central assumption of the MGP is that Z follows a matrix normal distribution, where β is a K × 1 vector of common column means of Z, Φ is a K × K unstructured column covariance matrix, P is an N × N row correlation matrix and 1 N is the N × 1 vector of ones. We model the ijth element of P as where w iu is the uth element of w i and α u > 0 are correlation parameters, for u = 1, . . . , U . Suppose we wish to predict z 0 = µ(θ 0 ; x 0 , t 0 ). Let and w 0 = (θ 0 , x 0 , t 0 ), with I N the N × N identity matrix. It can be shown that the predictive distribution of z 0 , i.e. Q(θ, x, t), conditional on outputs z 1 , . . . , z N is given by N μ(θ; x, t),Π(θ; x, t) , with N × 1 vector p 0 having ith element and w 0u being the uth element of w 0 . The predictive distribution given by (16) is conditional on the correlation parameters, α = (α 1 , . . . , α U ). We replace these values by their marginal posterior mode (e.g. Overstall and Woods, 2016) although a fully Bayesian approach could be taken with these parameters integrated out with respect to their posterior distribution, conditional on Z.

Meta-design for the Sampling and Prediction Phases
The two emulators formed prior to the Sampling and Prediction Phases depend on the choice of the metadesign ζ. The numerical solution, µ(θ; x, t), has three different inputs: θ, x and t. We construct the meta-design as a cartesian product of designs for each of these three input types, for our experiment. Separate designs in t and θ and x allow us to exploit computational efficiencies in the computation of the numerical solution to the ODEs; computing µ(θ; x, t 1 ) and µ(θ; x, t 2 ), for t 1 ≠ t 2 , requires a negligible increase in computational effort over just computing µ(θ; x, t 1 ).
This structure of meta-design also allows the emulator row correlation matrix, P, to be decomposed as where the ijth elements of P 1 , P 2 and P 3 are respectively. Hence, significantly reducing the computational burden of constructing the emulator and improving numerical stability. Similarly, for a meta-design with this structure, (14), (15), (17) and (18) also simplify: where vectors p 1 , p 2 and p 3 have jth elements given by for j = 1, . . . , N 3 , respectively. When we choose ζ to construct the emulator for the Sampling Phase, we choose the values of x and t to coincide with the design of the physical experiment, i.e. to be equal to x 1 , . . . , x I and t 1 , . . . , t m . Hence uncertainty in the prediction will only result from the different choice of parameter θ. For our experiment, this means fixing ζ 2 = {x 1 , . . . , x 5 } and ζ 3 = {t 1 , . . . , t m }, i.e. N 2 = I = 5 and N 3 = m.
We use the following exploratory algorithm to adaptively choose ζ 1 . This uses a hybrid of the methodologies proposed by Rasmussen (2003), Fielding et al. (2011) and Overstall and Woods (2013). Rasmussen (2003) first proposed using MCMC methods to adaptively improve an emulator to an unnormalised posterior density based on a computationally expensive function in light of observed data. Fielding et al. (2011) extended this methodology by using parallel tempering, and Overstall and Woods (2013) proposed to emulate the computationally expensive function, as opposed to the posterior density, to make model criticism more efficient.
The following exploratory algorithm iteratively improves the accuracy of the emulator in the region of the parameter space Θ corresponding to high posterior density.
3. Let ζ = ζ 0 and repeat the following steps until µ(θ; x, t) as been evaluated a total of N times.
(a) Perform L iterations of the Gibbs sampling and parallel tempering scheme (for chains r = 1, . . . , R) where evaluation of µ(θ; x, t) is replaced by evaluation ofμ(θ; x, t) in all instances.
The meta-design for the Prediction Phase is constructed as the cartesian product of three space-filling designs. Firstly, ζ 1 is constructed as a space-filling (e.g. Johnson et al., 1990) sub-sample of size N 1 , selected using the cover.design function in the R package Fields (Nychka et al., 2015) using the sample generated from the posterior distribution of θ in the Sampling Phase as a candidate list. Secondly, ζ 2 and ζ 3 are chosen as Latin hypercube designs in X and T of sizes N 2 and N 3 , respectively (Santner et al., 2003, Ch. 5).

Sampling Phase
To create the meta-design for the Sampling Phase, we set N 0 1 = 50, N 1 = 100 and set up R = 5 parallel chains. This results in ten iterations of the exploratory algorithm in Section 3.3.1 with L = 50. We adapt the temperatures of the parallel chains using the methodology of Miasojedow et al. (2013). Figure 1 shows the resulting values of θ * in meta-design ζ 1 . It is clear that the exploratory algorithm selects points from a concentrated region of Θ.
We then obtain a sample of size B = 50000 from the posterior distribution of δ using the amended Gibbs sampling and parallel tempering algorithm presented in Section 3.2. Convergence was assessed informally via trace plots (not shown) which showed that the chains had mixed adequately and each posterior was unimodal. Figure 2 shows plots of the prior and estimated posterior densities for each element of θ as well as the elements of ψ and ρ. Clearly the data has led to an increase in information from the prior to posterior distributions.
Diagnostics for assessing the adequacy of the fitted model (see, for example, Gelman et al., 2014, Chapter 6) indicated that there were no reasons to believe that the fit was inadequate. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q qq q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qq qq q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q 4 6 8 10 12

Prediction Phase
We now use the sample generated from the posterior distribution of δ, and, in particular, θ, to construct an emulator Q(θ, x, t), as defined in (16), for the numerical solution to the ODEs. We let N 1 = 20, N 2 = 50 and N 3 = 50. It could be noted that this value of N 1 is quite small compared to the rule of thumb of Loeppky et al. (2009) that the sample size should be approximately ten times the number of input dimensions (in this case p = 5). However, we found that, due to the concentration of the posterior, this value was sufficiently large to produce an adequate emulator and avoid numerical instability in the inversion of P 1 . To assess the adequacy of the emulator, we created a test meta-design of the same size (as above, a space filling sub-sample with a candidate list excluding the points from the original meta-design) and implemented the diagnostic methods of Overstall and Woods (2016). In particular, the root mean squared errors between the numerical solution and the predictive mean were 3.4 × 10 −2 , 0.19 and 0.40 for each of the K = 3 dimensions of µ(θ; x, t), respectively. The overall coverage of the 95% predictive intervals, as assessed on the test design, was 89.0%.  Table 3: Optimum values of the controllable factors, x and t, as found by ACE.

Name
Now we can use emulator Q(θ, x, t) to approximate the probability, P(y 0 ∈ Y y S ), of satisfying the specification limits for given x and t. We need to maximize this probability over the space S = X × T . To do this we use the approximate coordinate exchange (ACE; Overstall and Woods 2017) algorithm. ACE was originally developed for finding Bayesian optimal experimental designs where the objective function is maximised over a design space. In these situations, the objective function is usually analytically intractable. ACE uses a cyclic ascent algorithm (e.g. Lange, 2013, Chapter 7) to maximise the objective function via a univariate Gaussian process emulator based on evaluations of a Monte Carlo approximation to the objective function. We use the implementation of ACE given by the R package acebayes . We use 100 random starts of ACE, where the initial value of (x, t) is uniformly generated in S ×T . The optimum values of x and t, as found by ACE, are shown in Table 3. The approximate maximum value of P(y 0 ∈ Y y S ) attained was 0.77 (subject to Monte Carlo error). We investigated the sensitivity of the above approach to the choice of prior distributions for ρ, ψ, Ω and Σ. We did this by repeating the above analysis to find the optimal settings for x and t under different prior specifications. We found the results to be broadly robust to choice of prior distribution and details of this are provided in Appendix D.
To further explore how P(y 0 ∈ Y y S ) depends on x and t, we fix the initial amount of E, temperature and volume at 26.47 mols, 31.28 litres, and 313.5 K, respectively, i.e. the optimum values as found by ACE. We then vary the initial amounts of A and D, and time over the ranges identified in Table 1. The last row of Figure 3 shows P(y 0 ∈ Y y S ) for time against the initial amount of A, with different columns for different initial amounts of D. The first three rows show corresponding plots for the marginal probability of satisfying each of the specification limits (3) on E, F and H, respectively. The trade-off between the objectives given by the specification limits can be clearly seen. To satisfy the limit E(t) < 3 mols, t has to be large. This is intuitively obvious since the initial amount of E is 26.47 mols so the process will need to progress for some time before the amount of E has decreased enough to satisfy the limit. The opposite is true for H (initial amount of 0 mols), where the optimum time to satisfy H(t) < 3 mols is for t to be small. This means that there is a very narrow window of values of t that will satisfy all three constraints, as shown in the last row of Figure 3. Note that the optimum values of x and t (as shown in Table 3 Figure 3: The probability of satisfying each of the three marginal constraints and the joint constraints for time (in s; x-axis) against initial amount of A (in mols; y-axis) for five different initial amounts of D. The colours indicate the magnitude of the probability, with black being 0 and white being 1.

Discussion
This paper has developed and applied methodology for choosing combinations of values of the controllable factors that produce a high probability of meeting process specification when the process is at least approximating determined by a physical model. Our application was a chemical synthesis process used to produce a pharmaceutical product when specification was determined by a) the amount of the substance constituting the pharmaceutical product being above some level; and b) the amounts of two unwanted by-product substances being below some levels. The relationship between the controllable factors and the amounts of substances of interest is hypothesised to be governed by the analytically intractable solution to a system of ODEs. Responses from a physical experiment are used to refine the knowledge on this relationship and the probability of satisfying the constraints is maximized. Statistical emulators for the numerical solution are constructed to accelerate the model-fitting process and the estimation of the probability of satisfying the constraints. The methodology produces computationally feasible results that take account of different uncertainties involved in the experimental and modelling processes (measurement error, model inadequacy, emulator error). The methods have the potential to be applied to a variety of quality control and pharmaceutical design space problems that involve the numerically expensive approximation to physical models. The modelling and computational approach taken in this paper can be modified in a variety of different ways to suit the application of interest. For example, we have explicitly taken account of the model discrepancy. If the experimenters believed the model given by the solution to the ODEs was a very close approximation to the true physical process then the model discrepancy could be discarded by setting D = 0 and omitting the steps for sampling from the full conditional distributions of Σ and ψ. Another example of modification is the introduction of noise into the specification of the controllable factors. In this paper, we assumed that the chemical engineers have complete control over the specification of the controllable factors when we perform the maximization of the probability of satisfying the specification limits given by (3). In some cases, the chemical engineer would not be able to control these exactly. Following the approach in this paper, it would be straightforward to introduce variability as an intermediate step between specifying the controllable variables and evaluating the Gaussian process emulator. Lastly, we employed the RMLMH algorithm to sample from the full conditional distributions of θ, ρ and ψ. It would also be possible to use the methodology in this paper to employ Riemann Manifold Hamiltonian Monte Carlo methodology (Girolami and Calderhead, 2011).

A.2 Full conditional distributions for covariance matrices
For the rth chain with temperature τ r , the full-conditional distributions of Ω and Σ are given by respectively, where IW denotes the inverse-Wishart distribution.

A.3 Full conditional distribution for censored observations
Let A be the n × n permutation matrix which re-orders the elements of y such that we can write C are the n S × 1 and n c × 1 sub-vectors corresponding to y S and y C , respectively. Similarly, partition L as The full conditional distribution of y c is

C.1 Physical parameters
The log density of the full conditional distribution of θ is given by The derivative with respect to θ is The matrix tensor for θ is for k = 1, . . . , p.

C.2 Correlation parameter for time dependency
Consider a log transformation of ρ, i.e. a = log ρ. The log density of the full conditional distribution of a is given by The derivative with respect to a is ∂h(a) ∂a = − where T ia = ∂T i ∂a with jlth element − exp(a)(t ij − t il ) 2 T i,jl . The tensor matrix for a is where T iaa = ∂ 2 T i ∂a 2 with jlth element exp(a)(t ij − t il ) 2 exp(a)(t ij − t il ) 2 − 1 T i,jl .

C.3 Correlation parameters for model discrepancy
Consider a log transformation of each element of ψ, i.e. b i = log ψ i for i = 1, 2. The log density of the full conditional distribution of b = (b 1 , b 2 ) is given by The derivative with respect to b i is where S i = ∂S ∂b i with jlth element given by The tensor matrix, G(b) has ijth element The derivatives of G(b) are where S ik = ∂ 2 S ∂b i ∂b k with jlth element given by − 1 S jl , if i = 1 and k = 1, exp(b 2 )(t j − t l ) 2 exp(b 2 )(t j − t l ) 2 − 1 S jl , if i = 2 and k = 2, if i = 1 and k = 2.

D Prior sensitivity analysis
Suppose ρ ∼ Gamma(α, β) Σ ∼ IW(ν, I 3 ) ψ i ∼ Gamma(α, β) Ω ∼ IW(ν, I 3 ) for i = 1, 2, where the Gamma distribution is parameterised such that the expectation and variance are α β and α β 2 , respectively, and the inverse-Wishart such that the mode is I 3 ν. Let the prior distribution for ρ, ψ, Σ and Ω outlined in Section 2.3 be denoted as Prior 0 where α = β = 1 and ν = 4. We consider three different prior specifications (denoted Prior 1, 2 and 3) given by different combinations of α, β and ν, corresponding to a sequence of increasingly diffuse prior distributions. For each prior, the analysis described in Section 4 was undertaken and the optimum values of x and t were found by ACE. Table 4 shows the values of α, β and ν for Priors 1, 2 and 3 as well as the optimum values of x and t. Also included is the probability of satisfying the specification limits given by (3) under each prior distribution for the optimum values also found under each prior distribution. There is some variability with respect to initial amounts of A and D and, particularly, with respect to time. However the probability of satisfying the specification limits remain broadly insensitive to these changes. We conclude that the results are robust to choice of prior distribution.  Table 4: For Priors 0, 1 , 2 and 3: a) the values of α, β and ν specifying the prior distributions for ψ, Σ and Ω; b) the optimum values of x and t for maximizing the probability of satisfying the specification limits; and c) the probability of satisfying the specification limits given by (3) under each prior distribution for the optimum values also found under each prior distribution.