Bayesian Solution Uncertainty Quantiﬁcation for Diﬀerential Equations

. We explore probability modelling of discretization uncertainty for sys-tem states deﬁned implicitly by ordinary or partial diﬀerential equations. Accounting for this uncertainty can avoid posterior under-coverage when likelihoods are constructed from a coarsely discretized approximation to system equations. A formalism is proposed for inferring a ﬁxed but a priori unknown model trajectory through Bayesian updating of a prior process conditional on model information. A one-step-ahead sampling scheme for interrogating the model is described, its consistency and ﬁrst order convergence properties are proved, and its computational complexity is shown to be proportional to that of numerical explicit one-step solvers. Examples illustrate the ﬂexibility of this framework to deal with a wide variety of complex and large-scale systems. Within the calibration problem, discretization uncertainty deﬁnes a layer in the Bayesian hierarchy, and a Markov chain Monte Carlo algorithm that targets this posterior distribution is presented. This formalism is used for inference on the JAK-STAT delay diﬀerential equation model of protein dynamics from indirectly observed measurements. The discussion outlines implications for the new ﬁeld of probabilistic numerics.


Quantifying uncertainty for differential equation models
Many scientific, economic, and engineering disciplines represent the spatio-temporal evolution of complex systems implicitly as differential equations, using few but readily interpretable parameters.Differential equation models describe the natural dependence between system states and their rates of change on an open spatio-temporal domain, D ⊂ R d .Their relationship, represented by the function F , is fully specified up to some physical constants θ belonging to a parameter space Θ.Mathematically, system states, u : D × Θ → R p satisfy the partial differential equation (PDE), , . . ., u, θ = 0, (x, t) ∈ D ∪ ∂D, and constraints at the boundary ∂D of D. When states evolve with respect to a single variable, such as time, the model simplifies to an ordinary differential equation (ODE).
The explicit solution, denoted u * : D × Θ → R p , of a differential equation problem over spatio-temporal locations, (x, t) ∈ D, is used to explore system dynamics, design experiments, or extrapolate to expensive or dangerous circumstances.Therefore, increasing attention is being paid to the challenges associated with quantifying uncertainty for systems defined by such models, and in particular those based on mathematical descriptions of complex phenomena such as the weather, ocean currents, ice sheet flow, and cellular protein transport (Ghanem and Spanos, 2003;Kaipio et al., 2004;Huttunen and Kaipio, 2007;Marzouk et al., 2007;Marzouk and Najm, 2009;Stuart, 2010).
The main challenge of working with differential equation models, from both mathematical and statistical perspectives, is that solutions are generally not available in closed form.When this occurs, prior exploration of the mathematical model, u * (x, t, θ), cannot be performed directly.This issue is dealt with throughout the literature by replacing the exact solution with an N dimensional approximate solution, ûN (x, t, θ), obtained using numerical techniques over a size N grid partitioning of the domain D. Inference and prediction based on ûN (x, t, θ) then inform the reasoning process when, for example, assessing financial risk in deciding on oil field bore configurations, or forming government policy in response to extreme weather events.
Limited computation and coarse mesh size are contributors to numerical error which, if assumed negligible can lead to serious misspecification for these highly nonlinear models.The study of how uncertainty propagates through a mathematical model is known as the forward problem.Numerical error analysis provides local and global discretization error bounds that are characterized point-wise and relate to the asymptotic behaviour of the deterministic approximation of the model.However, accounting for this type of verification error for the purpose of model inference has proven to be a difficult open problem (Oberkampf and Roy, 2010), causing discretization uncertainty to be often ignored in practice.For complex, large-scale models, maintaining a mesh size that is fine enough to ignore discretization uncertainty demands prohibitive computational cost (Arridge et al., 2006), so that a coarse mesh is the only feasible choice despite associated model uncertainty (Kim, J.-h. et al., 2014).Furthermore, some ODE and PDE models exhibit highly structured but seemingly unpredictable behaviour.Although the long term behaviour of such systems is entrenched in the initial states, the presence of numerical discretization error introduces small perturbations which eventually lead to exponential divergence from the exact solution.Consequently, information about initial states rapidly decays as system solution evolves in time (Berliner, 1991).
The calibration or statistical inverse problem concerns the uncertainty in unknown model inputs, θ, given measurements of the model states (Bock, 1983;Ionides et al., 2006;Dowd, 2007;Ramsay et al., 2007;Xue et al., 2010;Brunel, 2008;Liang and Wu, 2008;Calderhead and Girolami, 2011;Campbell and Steele, 2011;Gugushvili and Klaassen, 2012;Campbell and Chkrebtii, 2013;Xun et al., 2013).However, inference for differential equation models lacking a closed form solution has mainly proceeded under the unverified assumption that numerical error is negligible over the entire parameter space.From the perspective of Kennedy and O'Hagan (2001), the discrepancy between the numerical approximation, ûN , and the unobservable mathematical model, u * , at the observation locations (x, t) ∈ D T would be modelled by the function, δ(x, t) = u * (x, t, θ) − ûN (x, t, θ).Estimation of δ(x, t) requires observations y(x, t) from u * (x, t, θ).However, magnitude and structure of discretization error change with θ, while δ(x, t) is agnostic to the specific form of the underlying mathematical model, treating it as a "black box".Consequently exploring the model uncertainty for given values of θ requires observations for each parameter setting or the assumption that δ(x, t) is independent of θ.Alternatively, in the discussion of Kennedy and O'Hagan (2001), H. Wynn argues that the sensitivity equations and other underlying mathematical structures that govern the numerical approximation could be included in the overall uncertainty analysis.

Contributions and organization of the paper
This paper develops a Bayesian formalism for modelling discretization uncertainty within the forward problem, and explores its contribution to posterior uncertainty in the model parameters.We substantially expand and formalize ideas first described by Skilling (1991) and develop a general class of probability models for solution uncertainty for differential equations in the so-called explicit form, where A and B are subsets of the deterministic solution u = u(x, t, θ) and its partial derivatives, and the function f is Lipschitz continuous in B. Although (1) is very general, Sections 2 and 3 consider ordinary differential equations for expositional simplicity, where A = du(t, θ)/dt := u t and B = u(t, θ) := u for t ∈ [0, L], L > 0, i.e. u t = f (t, u, θ), with a given initial condition.Sections 4 and 5 extend this to the generality of (1), showcasing a wide class of ODEs and PDEs, such as mixed boundary value problems with multiple solutions, delay initial function problems, and chaotic differential equations.
Section 2 considers fixed model parameters θ, hyperparameters Ψ, and discretization grid of size N .Prior uncertainty about the solution u and its derivative u t is encoded in a probability model [u, u t | θ, Ψ] on a space of functions satisfying the initial condition.Section 3 describes a sequential model interrogation method that consists of (i) drawing a sample u n−1 from the marginal [u | f 1 , . . ., f n−1 , θ, Ψ], and (ii) using this to compute model interrogations Conditioning on these incrementally obtained interrogations defines the sequence of models [u, to obtain the probability model, or probabilistic solution, which contracts to the true deterministic solution u as N → ∞.Computational complexity is shown in Section 3.2 to be of the same order as comparably sampled explicit numerical solvers, and scalability is illustrated in Section 5 for the chaotic Navier-Stokes system involving the solution of over 16,000 coupled stiff ordinary differential equations.The probabilistic solution defines a trade-off between discretization uncer- tainty and computational cost.To illustrate this point, Figure 1 compares the exact, but a priori unknown solution with sample trajectories from the proposed probability model ( 2) and a first order numerical scheme obtained over progressively refined discretization grids.The exact solution lies within a high posterior density region of (2), and as the grid is refined yielding more information about the model, knowledge of the solution increases, concentrating to the exact solution.Strong posterior non-stationarity of the error structure ( 2) is illustrated in Figure 2 for the three states of a system whose trajectory is restricted around a chaotic attractor (details are provided in Section 5.1).
In Section 4 we embed discretization uncertainty in the inferential hierarchy, where Σ includes parameters of the observation process, such as the covariance structure of the error terms.We adopt a forward simulation approach for the middle layer within a Markov chain Monte Carlo (MCMC) algorithm to sample from this posterior distribution.The importance of explicitly modelling discretization uncertainty within the inverse problem is illustrated in Section 4.1 by the nonlinear posterior dependence between solver hyperparameters Ψ and differential equation model parameters θ.This approach provides a useful exploratory tool to diagnose the presence of discretization effects.The problem of posterior under-coverage resulting from ignoring discretization induced solution uncertainty in (3) is illustrated in Section 5.4.
Discussion of various extensions and implications for the emerging field of probabilistic numerics is provided in Section 6. Algorithmic and theoretical details are described in the Supplement (Chkrebtii et al., 2016).MATLAB implementations of all examples are provided at github.com/ochkrebtii/uqdes.

Notation
We use of the mathematical framework for defining numerical problems on function spaces developed through the work of Stuart (2010), Kaipio and Somersalo (2007), O'Hagan (1992), and Skilling (1991).We denote partial derivatives using subscripts to indicate the variable with respect to which a partial derivative is taken.For example ∂x 2 u refers to the second partial derivative of u with respect to x. Superscripts in parentheses denote elements of a vector, for example, u (k) is the kth element of the Pdimensional vector u.The methodology proposed in this paper is based on a sequentially updated probabilistic solution, consequently we use superscripts without parentheses, e.g., m k , to refer to k-times updated functions.Superscripts on random variables, e.g., u k , represent a sample from its k-times updated distribution.For notational simplicity, we omit the dependence of the differential equation solution u, and derivative u t on θ, Ψ, N , and x and/or t when the context makes this dependence clear.

Prior model for a fixed but unknown solution
This section is concerned with probability modelling of initial beliefs about the unknown function, u * : D × Θ → R p , satisfying an ODE initial value problem of the form, with fixed initial value u(0) = u * (0).We describe how to incorporate boundary and regularity assumptions into a flexible Gaussian process (GP) prior model (Rasmussen and Williams, 2006;Stuart, 2010) for the unknown solution.Prior beliefs are updated in Section 3 by conditioning on information about the differential equation model obtained over a size N partition of the domain D according to a chosen sampling scheme.

GP prior on the solution of an ODE initial value problem
GPs are a flexible class of stochastic processes that can be defined jointly with their derivatives in a consistent way through the specification of an appropriate covariance operator (Solak et al., 2003).We model uncertainty jointly for the time derivative in (4) and its solution, (u t , u), using a GP prior with covariance structure, (C 0 t , C 0 ), and mean function, (m 0 t , m 0 ).The marginal means must satisfy the constraint t 0 m 0 t (z)dz = m 0 (t).The covariance structure is defined as a flexible convolution of kernels (see, e.g., Higdon, 1998Higdon, , 2002)).A covariance operator for the derivative u t evaluated between locations t j and t k is obtained by convolving a square integrable kernel R λ : R × R → R with itself, parameterized by the length-scale λ > 0 and prior precision α > 0, to be discussed in Section 3.4.An example is the infinitely differentiable kernel, R λ (t j , t k ) = exp{−(t j − t k ) 2 /2λ 2 }, which decreases exponentially with squared distance between t j and t k weighted by the length-scale λ.Its convolution yields the squared exponential covariance An example of a kernel with bounded support is R λ (t j , t k ) = 1 (tj −λ,tj +λ) (t k ), which yields the piecewise linear uniform covariance C 0 t (t j , t k ) = {min(t j , t k ) − max(t j , t k ) + 2λ} 1 (0,∞) {min(t j , t k ) − max(t j , t k ) + 2λ}.Details are provided in Supplement D.4.A marginal covariance operator, C 0 , on the state is obtained by integrating C 0 t with respect to evaluation points t j and t k or, equivalently, by convolving the integrated kernel Defining the covariance over the state by integration ensures that C 0 (0, 0) = 0, which enforces the boundary condition, u(0) = m 0 (0) (see Supplement D.1; an alternative way to enforce the boundary constraint is to obtain the derivative process u t by differentiation of u and condition the joint prior for (u t , u) on the exact initial state, u * (0)).This introduces anisotropy over the state that is consistent with exact knowledge of the solution at the initial boundary and increasing uncertainty thereafter.The cross-covariance terms are defined analogously as where each is the adjoint of the other.The initial joint GP prior probability distribution for the solution at a vector of evaluation times t k and its time derivative at a possibly different vector of evaluation times t j , given fixed hyperparameters Ψ = (m 0 t , m 0 , α, λ) is, This prior modelling strategy is straightforwardly generalized to explicit initial value problems of the form (1) by defining the prior jointly on the state and the required higher order derivatives.
As a simple example, consider the second order initial value ODE problem, Its exact solution, u * (t) = {−4 cos(t) + 2 sin(t) − sin(2t) + cos(t)}/(4 − 1), is assumed unknown a priori.The first column of Figure 3 shows five draws from the marginal densities of the prior [u t (t), u(t)] defined over a fine grid t ∈ [0, 10] T using a squared exponential covariance with (λ, α) = (0.8, 5).Section 3 describes a sequential updating procedure of the prior model to obtain the probabilistic solution for this example.

Updating prior beliefs for fixed but unknown ODE solution
The prior model (5) on the unknown solution and its derivative will be updated conditionally on model information collected over the ordered partition s = (s 1 , . . ., s N ) of the domain [0, L].Define a model interrogation at time s n as the output f n ≡ f {s n , u n−1 (s n ), θ} from the right hand side of equation ( 4) evaluated at a sample Model interrogations are generated sequentially based on state samples drawn from progressively refined prior models in order to accommodate fast changing dynamics.Because the ODE model (4) defines the solution implicitly, we are unable to measure the exact state or derivative, and therefore both the model interrogations and their error structure will take into account increasing uncertainty with time.
This section begins by expanding on these function estimation steps with full details of the Bayesian updating algorithm.We illustrate the forward solution algorithm on the simple example (6) with known closed form solutions before considering more realistic systems.Section 3.2 examines modelling choices for which this procedure can attain computational complexity of O(N ), while Section 3.3 establishes the consistency and first order convergence properties of this approach.Choice of hyperparameters is addressed in Section 3.4.Specific extensions to the full generality of (1) are presented in Sections 4 and 5.

Model updating based on interrogations from sampled states
This section describes a sequential sampling scheme to interrogate the model after each update.Model interrogations occur at pre-determined discretization grid locations s = (s 1 , . . ., s N ), while the resulting sequence of posterior solution models can be computed at a possibly different vector of time points, t.There are many ways to generalize this basic one-step-ahead sampling scheme, as discussed in Section 6.

First update
The derivative of the exact solution on the boundary, s 1 = 0, can be obtained exactly by evaluating the right hand side of (4) at the known initial condition u * (s 1 ) as, The first update of the prior probability model ( 5) is performed by conditioning on the exact derivative f 1 .This yields the joint conditional predictive probability distribution at evaluation times t j and t k , where marginal means and covariances at the vectors of time t j and t k are given by: Closed forms for means and covariances are obtained by integration (along with crosscovariances, Supplement D.1).We refer to this predictive probability measure as the first update of the algorithm and use it as the prior for the next iteration of the algorithm.

Second update
The derivative of the exact solution is unknown beyond s 1 = 0. We therefore introduce the random variable f 2 , representing a model interrogation at the subsequent time step, s 2 > 0. A realization of f 2 is obtained by drawing a sample u 1 (s 2 ) at time s 2 from the marginal predictive distribution of ( 7), and then applying the transformation from the right hand side of equation ( 4), The uncertainty in u(s 2 ) | f 1 means that f 2 is not guaranteed to equal to the exact derivative at time s 2 (the function f is a differential operator only for the input u = u * ).Indeed, without the exact solution, the probability distribution of f 2 is unavailable.As with any model for a data generating process, the likelihood (error model) should be elicited on a case-by-case basis.Here we provide general guidance on constructing this likelihood in the absence of additional expert information about the error model for f 2 , and all subsequent interrogations f n .In general, we expect that (i) the discrepancy between u t (t) and f {t, u(t), θ}, t ∈ [0, L] is zero when u is the exact solution; and (ii) the function f is smooth in the second argument, therefore discrepancy decreases smoothly with decreasing uncertainty in the derivative of the sampled state realization u 1 (s 2 ).These facts suggest a spherically symmetric error model for f 2 with a single mode at f 2 = u t (s 2 ) and dispersed as an increasing function of distance from s 1 = 0.For example, we may consider the Gaussian error model, with mean u t (s 2 ) and variance r 1 (s 2 ) = C 1 t (s 2 , s 2 ).A degenerate special case of this error model is discussed at the end of this section.We may now update ( 7) by conditioning on the model interrogation f 2 under the error model ( 8) to obtain the predictive probability distribution at the vectors of evaluation times t j and t k , Defining g 2 = C 1 t (s 2 , s 2 ) + r 1 (s 2 ), the marginal means and covariances at the vectors of evaluation times t j and t k are given by, where state mean, covariance, and derivative cross-covariances can be obtained analytically via integration (see Supplement D.1).

Subsequent updates
Subsequent updates, 2 < n ≤ N , begin by drawing a sample u n−1 (s n ) at time s n (illustrated by circles in the top row of Figure 3) from the marginal predictive posterior where, e.g., , to obtain the predictive probability distribution at the vectors of evaluation times t j and t k , , the resulting joint predictive probability distribution is Gaussian with marginal means and covariances: and with state mean, covariances, and derivative cross-covariances again obtained via integration (see Algorithm 1 and Supplement D.1).Here the state and derivative, u and u t respectively, may be of arbitrary dimension.Under a bounded support covariance, C n (t j , t k ) and C n t (t j , t k ) become sparse, band diagonal matrices.After N updates have been performed, we obtain a joint posterior probability distribution [u t , u | f ] for the unknown solution and its derivative, conditional on realizations of one trajectory of the model interrogation vector f.Draws from the marginal model (2) can be obtained via Monte Carlo, as described in Algorithm 1, yielding realized trajectories from the forward model (2) at desired time locations t.Five draws from the joint distribution of (2) and the interrogations f n at iterations n = 12, 24, 36, 48 are illustrated in the second through fifth columns of Figure 3 for the second-order ODE initial value problem (6).For notational simplicity, we will hereafter assume that the temporal locations of interest, t, form a subset of the discretization grid locations, s.

Point mass likelihood on model interrogations
A special case of this procedure models each interrogation, f n , as the exact unknown derivative, u * t (s j ), regardless of grid size, N .Thus, model interrogations are interpolated in the derivative space by replacing ( 8) and ( 10) with the point mass density: Because sampled process realizations can still be inconsistent with the solution, computational implementation of this alternative scheme often requires using an arbitrary covariance nugget increasing with time, or the use of a rougher covariance function than the smoothness implied by the differential equation model.For these reasons we will restrict our attention to explicitly modelling the interrogation error, e.g. ( 8) and (10).

Computational complexity
A reasonable modelling assumption for the state is that temporal correlation decays to zero with distance.Indeed, most deterministic numerical methods make the analogous assumption by defining a lag of M ≥ 1 time steps, beyond which the solution has no direct effect.This Markovian assumption both adds flexibility for modelling very fast-changing dynamics and drastically reduces computational cost.Under a compactly supported covariance structure, Algorithm 1 requires only O(N ) operations by allowing truncation of the weight matrices employed in the mean and covariance updates.In contrast, when employing a covariance structure with unbounded support, Algorithm 1 attains computational complexity proportional to the cube of the number of time steps, since the number of operations can be written as a finite sum over 1 ≤ n ≤ N with each iteration requiring an order O(n 2 ) operations.
A computational comparison of Algorithm 1 with a similarly sampled explicit first order solver is provided in the Supplement C.Although probability models for the solution attain the same computational scaling, the probabilistic method is more expensive.However, the model of solution uncertainty makes up for its added computational cost in practical applications where discretization grids cannot be refined enough to ensure negligible discretization effects.Within the inverse problem, the adaptive estimation of length-scale can substantially increase solver accuracy (see Figure 10 in Supplement C).

Convergence
We now consider the rate at which the stochastic process with probability distribution (2) generated via Algorithm 1 concentrates around the exact solution of (4).The rate Algorithm 1 For the initial value problem (4), draw one sample from the forward model ( 2 At time s 1 = 0 initialize the derivative f 1 = f {s 1 , u * (0), θ}, and define m 0 , m 0 t , C 0 , C 0 t as above; for n = 1 : if n < N then Sample one-step-ahead realization u n (s n+1 ) from the predictive distribution on the state for each modelled system component, and interrogate the model by computing f n+1 = f {s n+1 , u n (s n+1 ), θ}; end if end for Sample and return u = (u of convergence is as fast as the reduction of local truncation error for the analogously sampled one-step explicit Euler method.Proof is provided in the Supplement D.2. Theorem 1.Consider a function f (t, u(t)) : [0, L] × R → R that is continuous in the first argument and Lipschitz continuous in the second argument.The stochastic process with probability distribution (2) obtained using Algorithm 1 with r n (s) = 0, s ∈ [0, L] and a covariance kernel satisfying conditions (D.6) and (D.7) in Supplement D.3, converges in L 1 to the unique solution satisfying (4) as the maximum discretization grid step size, h, the length-scale, λ, and the prior variance, α −1 , tend to zero.The rate of convergence is O(h), proportional to the decrease in the maximum step size, h.

The choice of hyperparameters
Analogously with choosing a numerical solver, the choice of probabilistic solver hyperparameters must be based on prior knowledge about the solution, such as its smoothness.
Moreover, as with many nonparametric function estimation problems, asymptotic results on convergence may be used to guide our choice of covariance hyperparameters within the forward problem.For example, for the explicit first order sampling scheme, the result of Theorem 1 suggests setting the prior variance, 1/α, and the length-scale, λ, proportionally to the maximum distance, h, between subsequent discretization grid locations.When experimental data is available, it also becomes possible to obtain marginal posterior densities for the hyperparameters as part of the calibration problem (see Section 4), allowing data driven adaptation over the space of model parameters.In contrast, deterministic numerical differential equation solvers effectively fix the analogues of these hyperparameters, such as the quadrature rule or the step number, both within the forward and inverse problems.

Exact Bayesian posterior model inference
The discretization uncertainty model in Section 3 may now be incorporated into the calibration problem.We wish to infer model parameters θ defining states u = (u(t 1 ), . . ., u(t T )) from measurement data y = (y(t 1 ), . . ., y(t T )) , via the hierarchical model ( 3), where the transformation H maps the differential equation solution to the observation space, ρ is a probability density, π is a prior density, Σ defines the observation process, such as the error structure, and θ may include unknown initial conditions or other model components.We describe a Markov chain Monte Carlo procedure targeting the joint posterior of the state and unknown parameters conditional on a vector of noisy observations from [y | u, θ, Σ].Section 4.1 describes a case where states are indirectly observed through a nonlinear transformation H.
Algorithm 2 targets the posterior distribution (3) by generating forward model proposals via conditional simulation from Algorithm 1 (Supplement B provides a parallel tempering implementation).Within Algorithm 2, proposed sample paths are conditionally simulated from p(u | θ, Ψ) = p(u, f | θ, Ψ)df , removing the dependence of the acceptance probability on this intractable marginal density.Such partially likelihoodfree Markov chain Monte Carlo methods (Marjoram et al., 2003) are widely used for inference on stochastic differential equations (e.g., Golightly and Wilkinson, 2011).
As with numerical based approaches, computation of the forward model is the rate limiting step in Algorithm 2. Computational savings can be obtained by exploiting the structure of the probabilistic solver, such as by targeting progressively refined posterior models in a Sequential Monte Carlo framework, or using ensemble Markov chain Monte Carlo methods (Neal, 2011) when mixture components of (2) can be quickly computed.We now have the components for accounting for discretization uncertainty in differential equation models of natural and physical systems.The remainder presents a calibration problem illustrating posterior sensitivity to discretization uncertainty.
Algorithm 2 Draw K samples from posterior p(θ, u | y, α, λ, Σ) given observations of the transformed solution of a differential equation with unknown parameters θ.
Initialize θ, α, λ ∼ π(•) where π is the prior density; Using a probabilistic solver (e.g., Algorithm 1) simulate a vector of realizations, u conditional on θ, α, λ, over the size N discretization grid; where q is a proposal density; Using a probabilistic solver (e.g., Algorithm 1) simulate a vector of realizations, u conditional on θ , α , λ , over the size N discretization grid; Compute the rejection ratio, where the ratio of marginal densities, p(u | f , θ, α, λ) to p(u | f , θ , α , λ ), is omitted because the realizations u and u are simulated directly from these densities;

Inference on a delay initial function model
The JAK-STAT mechanism describes a series of reversible biochemical reactions of STAT-5 transcription factors initiated by binding of the Erythropoietin (Epo) hormone to cell surface receptors (Pellegrini and Dusanter-Fourt, 1997).After gene activation occurs within the nucleus, the transcription factors revert to their initial state, returning to the cytoplasm to be used in the next activation cycle.This last stage is not well understood and is replaced by the unknown time delay, τ > 0. One model describes changes in 4 reaction states of STAT-5 via the nonlinear delay initial function problem, The initial function components φ (2) (t) = φ (3) (t) = φ (4) (t) are identically zero, and the constant initial function φ (1) (t) is unknown.The states for this system cannot be measured directly, but are observed through a nonlinear transformation, t (s) ds Table 1: Priors on the states and parameters, assumed to be a priori independent.R 4 , defined as, and parameterized by the unknown scaling factors k 5 and k 6 .The indirect measurements of the states, are assumed to be contaminated with additive zero mean Gaussian noise, {ε (j) (t j,i )} 1≤i≤Tj , 1≤j≤4 , with experimentally determined standard deviations.Analysis is based on experimental data (Swameye et al., 2003;Raue et al., 2009).As per Raue et al. (2009), the forcing function EpoR A shares the same smoothness as the solution state (piecewise linear first derivative) as modelled by a GP interpolation.
This data has been used for calibrating the JAK-STAT pathway mechanism by a number of authors (e.g., Schmidl et al., 2013;Swameye et al., 2003;Raue et al., 2009;Campbell and Chkrebtii, 2013) under a variety of modelling assumptions and inferential approaches.The inaccuracy and computational constraints of numerical techniques required to solve the system equations have led some authors to resort to coarse ODE approximations, or to employ semiparametric Generalized Smoothing methods.This motivates analysis of the structure and propagation of discretization error through the inverse problem.A further issue is that the model ( 12) and its variants may be misspecified, which must be first uncoupled from discretization effects for further study.Indeed, the above cited analyses conclude that the model is not flexible enough to fit the available data, however it is not clear how much of this misfit is due to model discrepancy, and how much may be attributed to discretization error.
The probabilistic approach for forward simulation of delay initial function problems is described in Supplement B, extending the updating strategy to accommodate current and lagged state process realization information.The probabilistic solution is generated using an equally spaced discretization grid of size N = 500, and a covariance structure based on the uniform covariance kernel.Prior distributions are listed in Table 1.Markov chain Monte Carlo is used to obtain samples from the posterior distribution (3).A parallel tempering sampler (Geyer, 1991) described in Supplement B efficiently traverses the parameter space of this multimodal posterior distribution.
A sample from the marginal posterior of state trajectories and the corresponding observation process are shown in Figure 4.As expected, the posterior trajectories, which incorporate discretization uncertainty in the forward problem, appear more diffuse than those obtained in e.g., Schmidl et al. (2013), while more flexibly describing the observed data, with the exception of a small number of outliers.Correlation plots and kernel density estimates with priors for the marginal posteriors of rate parameters k 1 , k 2 , k 3 , k 4 , observation scaling parameters k 5 , k 6 , time lag τ , initial function value u (1) (0), and hyperparameters for the solution uncertainty Ψ = (α, λ) are shown in Figure 5.All parameters with the exception of the prior precision α are identified by the data, while the rate parameter k 2 appears to be only weakly identified.We observe strong correlations between parameters, consistent with previous studies on this system.For example, there is strong correlation among the scaling parameters k 5 , k 6 and the initial first state u (1) (0).Importantly, there is a relationship between the length-scale λ of the probabilistic solver and the first, third and fourth reaction rates.Furthermore, this relationship has a nonlinear structure, where the highest parameter density region seems to change with length scale implying strong sensitivity to the solver specifications.The lengthscale is the probabilistic analogue, under a bounded covariance, of the step number in a numerical method.However, in analyses based on numerical integration, the choice of a numerical technique effectively fixes this parameter at an arbitrary value that is chosen a priori.The result here suggests that for this problem, the inferred parameter values are highly and nonlinearly dependent on the choice of the numerical method used.Because we have explicitly quantified discretization uncertainty, the observed lack of fit may suggest additional model refinement may be required.
Figure 5: 50,000 posterior samples of the model parameters using a probabilistic solver with a grid of size 500, generated using parallel tempering Algorithm 7 with ten chains.Prior probability densities are shown in black.Note the complex posterior correlation structure between model parameters and solver hyperparameters across the lower rows.

Applications
We extend probability modelling of solution uncertainty to a variety of forward problems: chaotic dynamical systems, an ill-conditioned mixed boundary value problem, a stiff, high-dimensional initial value problem, and extensions to PDEs.

Discretization uncertainty under chaotic dynamics
The classical "Lorenz63" initial value problem (Lorenz, 1963) is a three-state ODE model of convective fluid motion induced by a temperature difference between an upper and lower surface.States u (1) , u (2) and u (3) are proportional to the intensity of the fluid's convective motion, the temperature difference between (hot) rising and (cold) descending currents, and the deviation of the vertical temperature profile from linearity respectively.The model describing the time-evolution of their dynamics, depends on dimensionless parameters σ, r and b.The standard choice of θ = (σ, r, b) = (10, 8/3, 28) with initial state u * (0) = (−12, −5, 38) falls within the chaotic regime.This system has a unique solution whose trajectory lies on a bounded region of the phase space (e.g., Robinson, 2001, pp. 271-272), and is unknown in closed form.A sample of 1000 trajectories generated from the probabilistic solution for this system is shown in Figure 2 given hyperparameters N = 5001, α = N, λ = 2h, h = 20/N , constant prior mean function and a squared exponential covariance structure.There is a short time window within which there is negligible uncertainty in the solution, but the accumulation of discretization uncertainty results in rapidly diverging, yet highly structured errors.Figure 2 shows posterior samples in the 3 state dimensions (top row) and density estimates of the samples for u (1) (t) (bottom row) at 4 distinct evaluation times.The structured uncertainty in the forward model ( 2) is not simply defined by widening of the high posterior density region, but instead exhibits a behaviour that is consistent with the dynamics of the underlying stable attractor.

Solution multiplicity in a mixed boundary value problem
Consider a special case the Lane-Emden ODE mixed boundary problem, which is used to model the density, u, of gaseous spherical objects, such as stars, as a function of radius, t, (Shampine 2003).The second order system can be rewritten as a system of a first order equations on the interval [.5, 1] with mixed boundary values, where the notation u(t = 1) and v(t = .5)is used to emphasize that boundary conditions are different for each state.The two leftmost panels of Figure 6 illustrate multimodality in the posterior solution, when multiple functions satisfy the mixed boundary value problem.These high posterior density regions concentrate as the number of grid points and the prior precision grow, suggesting the potential for defining intermediate target densities for sequential and particle Markov chain Monte Carlo schemes to sample realizations from the posterior distribution over unknown model parameters within the calibration problem.This example highlights the need for accurately modelling functional rather than point-wise solution uncertainty.
Mixed boundary value problems introduce challenges for many numerical solvers, which employ optimization and the theory of ODE initial value problems to estimate the unknown initial state, here u(t = .5),subject to the state constraints at either boundary of the domain of integration, here [.5, 1].The optimization over u(t = .5)is performed until the corresponding initial value problem solution satisfies the known boundary condition, u * (t = 1), to within a user specified tolerance.Considering the mixed boundary value problem (15), the probabilistic solution consists of an inference problem over the unspecified initial value, u(t = .5),given its known final value u * (t = 1).The likelihood naturally defines the mismatch between the boundary value u(t = 1), obtained from the realized probabilistic solution given u * (t = 1), as follows, where m N (u) (t = 1) and C N (u) (t = 1, t = 1) are the posterior mean and covariance for state u at time t = 1, obtained via Algorithm 1.The posterior over the states is, The possibility of multimodality suggests the use of an appropriate Markov chain Monte Carlo scheme, such as parallel tempering (Geyer, 1991), implemented in Supplement B, which can quickly identify and explore disjoint regions of high posterior probability.

Discretization uncertainty for a model of fluid dynamics
The following example illustrates that the probabilistic framework developed may be reliably and straightforwardly applied to a high-dimensional dynamical system.The Navier-Stokes system is a fundamental model of fluid dynamics, incorporating laws of conservation of mass, energy and linear momentum for an incompressible fluid over a domain given constraints imposed along the boundaries.It is an important component of complex models in oceanography, weather, atmospheric pollution, and glacier movement.Despite its extensive use, the dynamics of Navier-Stokes models are poorly understood even at small time scales, where they can exhibit turbulence.
The Navier-Stokes PDE model for the time evolution of 2 components of the velocity, u : D → R 2 , of an incompressible fluid on a torus, D = [0, 2π) × [0, 2π], can be expressed in spherical coordinates as, where ] is the Laplacian operator such that Δu = (u is the gradient operator such that ∇u = (u ).The model is parametrized by the viscosity of the fluid, θ ∈ R + , the We will visualize the probabilistic solution of the Navier-Stokes system by reducing the two components of velocity to a one dimensional function: the local spinning motion of the incompressible fluid, called vorticity, which we define as, = −∇ × u, where ∇ × u represents the rotational curl (the cross product of ∇ and u), with positive vorticity corresponding to clockwise rotation.We discretize the Navier-Stokes model ( 18) over a grid of size 128 in each spatial dimension.A pseudo spectral projection in Fourier space yields 16,384 coupled, stiff ODEs with associated constraints (details provided on the accompanying website).Figure 7 shows four forward simulated vorticity trajectories (along rows), obtained from two components of velocity at four distinct time points (along columns).Differences in the state dynamics arise by the final time shown, where the four trajectories visibly diverge from one another.These differences describe uncertainty resulting from discretizing the exact but unknown infinite dimensional solution.

Discretization uncertainty for partial differential equations
Similarly to the ODE case, we model the prior belief about the fixed but unknown solution and partial derivatives of a PDE boundary value problem by defining a proba-bility measure over multivariate trajectories, as well as first or higher order derivatives with respect to spatial inputs.The prior structure will depend on the form of the PDE model under consideration.We illustrate discretization uncertainty modelling for PDEs by considering the parabolic heat equation describing heat diffusion over time along a single spatial dimension implicitly via, ⎧ ⎨ ⎩ with conductivity parameter of κ = 1.Note that this PDE has the explicit form (1) with A = {u, u t } and B = u xx .
One modelling choice for extending the prior covariance to a spatio-temporal domain is to assume separability by defining covariances over time and space independently.Accordingly we choose temporal kernel R λ as in Section 2.1 and let R ν : R × R → R be a possibly different spatial kernel function with length-scale ν.We integrate the spatial kernel once to obtain The covariance structures between temporal and spatial evaluation points [x j , t j ] and [x k , t k ] are obtained from the process convolution formulation with temporal and spatial prior precision parameters α and β respectively, as follows, with cross-covariances defined analogously.The mean function and covariance specification uniquely determine the joint GP prior on the state and its partial derivatives given hyperparameters Ψ, which include α, β, λ, m 0 xx , m 0 t , m 0 , at vectors of temporal and spatio-temporal locations (x j , t j ), (x k , t k ), (x l , t l ) as: where, When integrated covariances are available in closed form (e.g., Supplement D.4), the above matrices can be computed analytically.
A generalization of the sequential updating scheme presented in Algorithm 1 is outlined in Supplement B. The forward in time, continuous in space (FTCS) sampling scheme takes steps forward in time, while smoothing over spatial model interrogations The characterization of spatial uncertainty is critical for more complex models in cases where there are computational constraints limiting the number of system evaluations that may be performed.We illustrate its effect by performing posterior inference for the parameter κ from data simulated from an exact solution with κ = 1 and zero mean Gaussian error with standard deviation of 0.005.Figure 9 shows the posterior distribution over κ obtained by using both a deterministic "forward in time, centred in space" (FTCS) finite difference solver and a probabilistic solver under a variety of discretization grids.For the probabilistic solution, a squared exponential covariance structure was chosen over the spatial domain, with length-scale 1.5 times the spatial step size.For computational efficiency a uniform covariance structure was chosen over the temporal domain, with length-scale 2 times the temporal step size.The prior precision was set to 10,000.Note the change in the scale as the mesh is refined and the posterior variance decreases.The use of a deterministic solver illustrates the problem of posterior under-coverage that may occur if discretization uncertainty is not taken into account and too coarse a grid is employed relative to the complexity of model dynamics.If the discretization is not fine enough, the approximate posterior assigns negligible probability mass to the true value of κ, in contrast to the unbiased posterior obtained when employing the exact solution for inference.In this illustrative setting, the use of a probabilistic solver propagates discretization uncertainty in the solution through to the posterior distribution over the parameters.By using a probabilistic solver with the same coarsely discretized grid in this example, the posterior density is more diffuse but approximately centred on the true value of κ.Inference is based on the probabilistic differential equation solver using three grid sizes (blue).The posterior density based on a deterministic FTCS numerical scheme (red) and the posterior density based on the exact solution (red) are also provided.

Discussion
This paper presents a Bayesian formalism for characterizing the structure of discretization uncertainty in the solution of differential equations.Discretization uncertainty for dynamical systems may be naturally described in terms of degrees of belief, an idea first formulated by Skilling (1991) andO'Hagan (1992).A Bayesian function estimation approach offers a way of defining and updating our beliefs about probable model trajectories conditional on interrogations of the differential equation model.The resulting uncertainty over the solution can then be propagated through the inferential procedure by defining an additional layer in the Bayesian posterior hierarchy.As such, the proposed formulation is a radical departure from the existing practice of constructing an approximate statistical error model for the data using numerical techniques.
The more general goal of studying other sources of model uncertainty requires first understanding how much is known about the dynamics of the proposed model and how that information changes with time, space, and parameter settings.Fundamentally, the probabilistic approach we advocate defines a trade-off between solution accuracy and the size of the inverse problem, by choice of discretization grid size.This relationship has been of deep interest in the uncertainty quantification community (see, e.g., Kaipio et al., 2004;Arridge et al., 2006) and our approach offers a way of resolving the individual contributions to overall uncertainty from discretization, model misspecification (in the sense of Kennedy and O'Hagan, 2001), and Monte Carlo error.
The area of probabilistic numerics is an emerging field at the intersections of Statistics with Applied Mathematics, Numerical Analysis, and Computer Science, and we hope that this paper will provide a starting point for further work in this area.In particular, while the proposed approach is computationally competitive with first order numerical methods, it will be important to implement more efficient sampling algorithms analogous to those of higher order deterministic solvers developed from over a century of research into numerical methods.For example a sampling scheme developed based on multiple model interrogations located between discretization grid points.Because many traditional higher order numerical solvers (e.g.explicit fourth order Runge-Kutta) use such a carefully constructed weighted average of model interrogations, our approach also provides a way to interpret numerical method in the context of a general probabilistic framework (e.g., Schober et al., 2014).It is not surprising that a self-consistent Bayesian modelling approach has similarities to numerical methods whose form has been chosen for their convergence properties.From the perspective presented in this paper, rather than emulate numerical solvers, quadrature is automatically provided by the Gaussian process integration that occurs at each step of Algorithm 1.If the solution smoothness is correctly specified a priori via the covariance structure, then the Gaussian process integration step will naturally yield an appropriate quadrature rule.
Modelling uncertainty about an exact but unknown solution may be viewed as providing both an estimate of the exact solution and its associated functional error analysis.Although proportionally scalable, uncertainty quantification methods are naturally more expensive than deterministic numerical approximations without associated error analyses, making efficient sampling strategies for stochastic model calibration an important goal for future research.As may be expected, low dimensional, stable systems, without any topological restrictions on the solution may not suffer from significant discretization effects at a reasonably fine discretization grid.However, it is often difficult to determine a priori whether this stability will persist across the entire parameter space.An example is illustrated by the JAK-STAT case study, where the relationships between model parameters and solver specifications (reflecting changes in the system dynamics) may introduce bias in the estimation of parameters.If model uncertainty is indeed negligible and comparable over all parameter settings, sampled trajectories obtained via the probabilistic method will have little variability around the exact solution.If, however, the chosen discretization grid is too coarse with respect to the rate of change of system dynamics over some subset of the input space, the uncertainty associated with the model error will be accounted and propagated.For this reason, it may be prudent to adopt a probabilistic approach if one cannot be reasonably certain of negligible and comparable discretization error across all parameter settings.Importantly, the probabilistic framework can be used to optimize computation.For a given discretization grid size, a natural question is how we can arrange the grid on the domain in such a way that the resulting approximation is as close as possible in terms of some specified criterion.Most commercially available numerical differential equation solvers select the step length of the discretization grid sequentially.Controlling the local error by choice of the distance between neighbouring grid locations is called adaptive step size selection.For example, in the simplest cases, this is accomplished by evaluating the local truncation error at each step and halving the step size if this exceeds an error tolerance that is pre-specified by the user (the process may be repeated several times per step until an appropriate local truncation error is achieved).In a similar way to local truncation error, a predictive probability model of discretization uncertainty may be used to guide sequential mesh refinement for complex dynamical systems.Optimizing a chosen design criterion would yield the desired discretization grid, as suggested by Skilling (1991).Some initial work in this direction has been conducted in Chkrebtii (2013), where it is shown that the natural Kullback-Leibler divergence criterion may be successfully used to adaptively choose the discretization grid in a probabilistic manner.
Prior specification may be used to rule out physically inconsistent trajectories that may arise from a numerical approximation.In numerical analysis, specialized techniques are developed and shown to satisfy certain properties or constraints for a given class of differential equations.A probabilistic framework would permit imposing such constraints in a more flexible and general way.Smoothness constraints on the solution may be incorporated through the prior, and anisotropy in the trajectory may be incorporated at each step through the covariance and the error model.Although in such general cases it may no longer be possible to obtain closed form representations of the sequential prior updates, an additional layer of Monte Carlo sampling may be used to marginalize over the realizations required to interrogate the differential equation model.
These developments lie at the frontier of research in uncertainty quantification, dealing with massive nonlinear models, complex or even chaotic dynamics, and strong spatial-geometric effects (e.g.subsurface flow models).Even solving such systems approximately is a problem that is still at the edge of current research in numerical analysis.
Inference and prediction for computer experiments (Sacks et al., 1989) is based on a model of the response surface, or emulator, constructed from numerical solutions of prohibitively expensive large-scale system models over a small subset of parameter regimes.Currently, numerical uncertainty is largely ignored when emulating a computer model, although it is informally incorporated through a covariance nugget (see, e.g., Gramacy and Lee, 2012) whose magnitude is often chosen heuristically.Adopting a probabilistic approach on a large scale will have practical implications in this area by permitting relaxation of the error-free assumption adopted when modelling computer code output, leading to more realistic and flexible emulators.For this reason, Bayesian calibration of stochastic emulators will be an important area of future research.
Numerical approximations of chaotic systems are inherently deterministic, qualitative dynamics (of models such as Lorenz63) are often studied statistically by introducing artificial perturbations on the numerical trajectory.The rate of exponential growth (the Lyapunov exponent) between nearby trajectories can be estimated via Monte Carlo by introducing small changes to the numerical trajectory over a grid defined along the domain.Our proposed approach explicitly models discretization uncertainty and therefore offer an alternative to such artificial perturbations, thus opening new avenues for exploring uncertainty in the system's qualitative features.

Figure 2 :
Figure 2: 1000 draws from (2) for the Lorenz63 model with fixed initial states and model parameters in the chaotic regime.Red vertical lines in u (1) (t) correspond to four time points used in the diagrams in the last two rows.

Figure 3 :
Figure 3: Five draws (solid lines) from the marginal densities of (2) over the state (first row) and its derivative (second row) for ODE (6) after n = 0, 12, 24, 36, 48 iterations (from left to right) of Algorithm 1 with discretization grid of size N = 50; step-ahead samples u n and f n are shown as circles; exact solution is shown as a dotted red line.

Figure 4 :
Figure 4: Sample paths (grey lines) of the observation processes (first row) obtained by transforming a sample from the marginal posterior distribution of the states (second row) by the observation function (13); experimental data (red points) and error bars showing two standard deviations of the measurement error.

Figure 7 :
Figure 7: Time evolution of four forward simulated realizations (along rows) of fluid vorticity, governed by the forced Navier-Stokes model (18), over two spatial dimensions: the angle of the inner ring (horizontal axis) and outer ring (vertical axis) of a two dimensional torus.Angles are expressed in radians.Vorticities are evaluated at times t = (0.2, 0.4, 0.6, 0.8) units (along columns).

Figure 8 :
Figure 8: Uncertainty in the solution of the heat equation discretized using two spatiotemporal grids in blue and red respectively: 15 × 50 and 29 × 100.Spatial posterior predictions are shown at t = (0.02, 0.12, 0.22).The exact solution is shown in green; error bars show the empirical mean and 2 standard deviations computed from 50 simulations.

Figure 9 :
Figure 9: Posterior densities for the conductivity parameter in the heat equation given simulated data over an 8 × 25 spatio-temporal grid from the exact solution with κ = 1.Inference is based on the probabilistic differential equation solver using three grid sizes (blue).The posterior density based on a deterministic FTCS numerical scheme (red) and the posterior density based on the exact solution (red) are also provided.