Bayesian estimation of discretely observed multi-dimensional diffusion processes using guided proposals

Estimation of parameters of a diffusion based on discrete time observations poses a difficult problem due to the lack of a closed form expression for the likelihood. From a Bayesian computational perspective it can be casted as a missing data problem where the diffusion bridges in between discrete-time observations are missing. The computational problem can then be dealt with using a Markov-chain Monte-Carlo method known as data-augmentation. If unknown parameters appear in the diffusion coefficient, direct implementation of data-augmentation results in a Markov chain that is reducible. Furthermore, data-augmentation requires efficient sampling of diffusion bridges, which can be difficult, especially in the multidimensional case. We present a general framework to deal with with these problems that does not rely on discretisation. The construction generalises previous approaches and sheds light on the assumptions necessary to make these approaches work. We define a random-walk type Metropolis-Hastings sampler for updating diffusion bridges. Our methods are illustrated using guided proposals for sampling diffusion bridges. These are Markov processes obtained by adding a guiding term to the drift of the diffusion. We give general guidelines on the construction of these proposals and introduce a time change and scaling of the guided proposal that reduces discretisation error. Numerical examples demonstrate the performance of our methods.


Introduction
In this article we discuss a novel approach for estimating an unknown parameter θ ∈ Θ of the drift and the diffusion coefficient of a diffusion process dX t = b θ (t, X t ) dt + σ θ (t, X t ) dW t , X 0 = u (1.1) which is observed discretely in time. Here b θ : R × R d denotes the drift function, a θ = σ θ σ θ is the diffusion function, where σ θ : R×R d → R d×d , and W is a d -dimensional Wiener process. The observation times will be denoted by t 0 = 0 < t 1 < · · · < t n = T and the corresponding observations by x i = X ti . Estimation of θ in this setting has attracted much attention during the past decade. Here we restrict attention to estimation within the Bayesian paradigm. From a theoretical perspective, results on posterior consistency have been proved in Van der Meulen and Van Zanten (2013) and Gugushvili and Spreij (2012). The associated computational problem is the object of study here. Two review articles that include many references on this topic are Van Zanten (2013) and Sørensen (2004).
The main difficulty in estimation for discretely observed diffusion processes is the lack of a closed form expression for transition densities, making the likelihood intractable. If the diffusion path is observed continuously, then estimation becomes easier as for a fully observed diffusion path the likelihood is available in closed form (and parameters appearing in the diffusion coefficient can be determined from the quadratic variation of the process). This naturally suggests to study the computational problem within a missing data framework, treating the unobserved path segments between two succeeding observation times as missing data. This setup dates back to at least Pedersen (1995), who used it to obtain simulated maximum likelihood estimates for θ. Within the Bayesian computational problem, the resulting Markov-Chain-Monte-Carlo algorithm is known as data-augmentation and was introduced in this context by Eraker (2001), Elerian et al. (2001) and Roberts and Stramer (2001). This algorithm is a special form of the Metropolis-Hastings (MH) algorithm which iterates the following steps: 1. draw missing segments, conditional on θ and the observed discrete time data; 2. draw from the distribution of θ, conditional on the "full data".
Here, by "full data" we mean the path formed by the drawn segments joined at the observation times. The algorithm can be initialised by either interpolating the discrete time data or choosing an initial value for θ. Since in both steps it is usually not possible to draw directly from the distribution of interest, each step can be replaced by generating a proposal which is then accepted with probability defined by the MH-algorithm. There are two challenges with this approach: Challenge 1: generating "good" proposals for the missing segments. The problem of simulating diffusion bridges has received a lot of attention in the past. Vastly different techniques have been proposed, including (i) single site Gibbs updating of the missing segments locally on a discrete grid (Eraker (2001)), (ii) independent Metropolis-Hastings steps using as a proposal a Laplace approximation to the conditional distribution obtained by Euler approximation (Elerian et al. (2001)), (iii) forward simulated processes derived from representations of the Brownian bridge in discrete time (Durham and Gallant (2002)), (iv) coupling arguments (Bladt and Sørensen (2012)), (v) a constrained sequential Monte Carlo algorithm with a resampling scheme guided by backward pilots Lin et al. (2010), and (vi) exact simulation (Beskos et al. (2006)). Delyon and Hu (2006) extended the work of Durham and Gallant to a continuous time setup and derived an innovative proposal process taking the drift of the target diffusion into account. In case the diffusion coefficient is constant this proposal was proposed earlier in Clark (1990). The basic idea consists of superimposing an additional term to the drift of the unconditioned diffusion to guide the process towards the endpoint. Such proposals are termed guided proposals. The proposals of this type have the drawback of a possible mismatch between drift and guiding term which prevents their use for routine practice. Instead a variant of the proposal without the target drift, also due to Delyon and Hu (2006), is used. When discretised, the latter is closely related to the Modified Brownian Bridge proposed in Durham and Gallant (2002). In Schauer et al. (2013) a general class of proposal processes for simulating diffusion bridges was introduced. The proposals in Schauer et al. (2013) do take the drift of the target diffusion into account, but in a way different from Delyon and Hu (2006). As a result, these proposals substantially reduce the mismatch of drift and guiding, because they allow for more flexibility in choosing an appropriate guiding term to pull the process towards the endpoint in the right manner. An example of the advantage of this approach is given in the introduction of Schauer et al. (2013). Further, this method only requires forward simulation and applies to diffusion processes which cannot be transformed to have unit diffusion coefficient as well.
This comes with another challenge, as in any implementation, ultimately all proposals have to be evaluated on a finite number of grid points. Now the pulling term added to the drift for guided proposals has a singularity near the endpoint. This renders direct Euler discretisation inaccurate. Furthermore, integrals that appear in the acceptance probability of bridges potentially suffer from this problem as well.
Challenge 2: handling unknown parameters appearing in the diffusion coefficient. As pointed out by Roberts and Stramer (2001), the data augmentation algorithm degenerates if θ appears in the diffusion coefficient as the quadratic variation of the full data T 0 a θ (t, X t ) dt forces the conditional distribution for the next iterate for θ to be degenerate at the current value. Hence, iterates of θ remain stuck at their initial value. From a second perspective, diffusion bridges corresponding to different θ have mutually singular measures. For illuminating figures that illustrate this problem, we refer to Roberts and Stramer (2001). This difficulty can be overcome, when the laws of the bridge proposals can be understood as parametrised push forwards of the law of an underlying random process common to all models with different parameters θ. This is naturally the case for proposals defined as solutions of stochastic differential equations and the driving Brownian motion can be taken as such underlying random process. If X denotes a missing segment given that the parameter equals θ, the main idea consists of finding a map g and a process Z such that X = g(θ, Z ). The process Z will be called the "innovation process", as was done in related approaches (Cf. Chib et al. (2004) and Golightly and Wilkinson (2010) ). Defining g and Z appropriately is rather subtile and we postpone a detailed discussion to Sections 2 and 4. To be successful, the tight dependence between θ and X should be decoupled. In a more general set-up, decouplings of similar forms are discussed under the keyword non-centered parameterisation (Papaspiliopoulos et al. (2003)). It turns out that likelihood ratios between diffusion bridges with different parameters driven by the same innovations Z can be defined and used to update the diffusion parameter. In an elementary form, where σ θ (t, x) = θ this technique was used by Roberts and Stramer (2001). Chib et al. (2004) and Golightly and Wilkinson (2010) extended this technique to general diffusions in a discretised setting. A construction for time homogeneous diffusions based on Delyon and Hu (2006) was explored by Fuchs (2013) (in particular section 7.4). Papaspiliopoulos et al. (2013) have an approach where the missing data is initially considered in continuous time using Delyon and Hu (2006) bridge proposals, but the degeneracy problem is tackled after discretisation.

Contribution
In this article we show how the guided proposals introduced in Schauer et al. (2013) address the outlined challenges and can be used to define an efficient MCMC procedure for generating draws from the posterior distribution of θ given discrete time data. The procedure can be seen as extension and unification of previous approaches within a continuous time framework. Specific features of our approach include: • We use in each data augmentation step "adapted" bridge proposals which take the drift and the value of θ at that particular iteration into account. Hence, at each iteration, the pulling term depends on θ, a feature which is unavailable using proposals as in Delyon and Hu (2006). Especially in the multivariate case, the additional freedom in devising good proposals is crucial for obtaining a feasible MCMC procedure. The possibility to exploit special features of the drift function to achieve high acceptance rates makes this approach interesting for practitioners. This is illustrated with a practical example in Section 7.2. This example concerns a diffusion approximation of a chemical reaction network and it turns out that our methods are often well suited for estimation of diffusions obtained in this way. • The algorithm does not suffer from the degeneracy problem in case unknown parameters appear in the diffusion coefficient, not even in the continuous time setup. • The innovation process is defined using the proposal process. As a result, in our algorithm (Cf. see algorithm 1), somewhat surprisingly, the innovations actually never need to be computed. This implies that our method can also cope with the case where σ is not a square matrix. • Though we derive all our results in a continuous time setup, for implementation purposes integrals in likelihood ratios and solutions to stochastic differential equations need to be approximated on a finite grid. As the drift of our proposal bridges has a singularity near its endpoint, we introduce a time-change and space-scaling of the proposal process that allows for numerically accurate discretisation and evaluation of the likelihood.

Outline
In Section 2 we clarify the aforementioned difficulties in a toy example. Here, we set some general notation, introduce some key ideas used throughout and further motivate the advantage of using guided proposals. In Section 3 we recap some key results from Schauer et al. (2013) on guided proposals. The core of this paper is Section 4 where we precisely state our algorithm. In Section 5 we relate our approach to earlier work by Golightly and Wilkinson (2010), Fuchs (2013) and Papaspiliopoulos et al. (2013).
In Section 6 we introduce a time-change and space-scaling of the proposal process to deal with the aforementioned numerical discretisation issues. Next, in Section 7 we discuss two examples. In the second example we estimate the parameters in a prokaryotic auto-regulation network example of Golightly and Wilkinson (2010). The appendices contain a few postponed proofs and some implementation details.

A toy problem
In this section we consider a toy example used to illustrate some key ideas to solve the aforementioned problems with a simple data-augmentation algorithm. The type of reparameterisation introduced shortly is not new, and has appeared for example in Roberts and Stramer (2001). The goal here is to introduce the main idea and point out some of its potential shortcomings in more complex problems. Furthermore, later on we will deal with more difficult cases and this toy example allows us to sequentially build up an appropriate framework for that. We consider the diffusion process where b is a known drift function and θ ∈ Θ an unknown scaling parameter. We assume θ is equipped with a prior distribution π 0 (θ) and only one observation X T = v at time T is available. We aim to draw from the posterior π(θ | X T ). The diffusion process conditioned on X T = v is a diffusion process itself. Denote by X the conditioned diffusion path (X t , t ∈ (0, T )) (conditional on X T = v). Suppose we wish to iterate a data-augmentation algorithm and the current iterate is given by (X , θ). Updating X : For almost all choices of b, there is no direct way of simulating X . Instead, one can first generate a proposal bridge X • and accept with MH-acceptance probability. As an easy tractable example we choose to take where Z • is a standard Brownian bridge on [0, T ] (i.e. Z • (0) = 0 and Z • (T ) = 0). This is the bridge process corresponding to the unconditioned diffusion process dX t = θdW t , X 0 = u. Denoting the laws of X and X • by P θ and P • θ respectively, we have Here, p andp denote the transition densities of the processes X andX respectively. Absolute continuity is a consequence of Girsanov's theorem applied to the unconditioned processes and the abstract Bayes' formula. Now the MH-step consists of generating a proposal X • and accepting it with probability 1 ∧ Ψ θ (X • )/Ψ θ (X ) (the ratio of transition densities just acts as a proportionality constant here).
Updating θ: As explained in the introduction, taking the missing segment as missing data yields the Metropolis-Hastings algorithm reducible. To deal with this problem, note that Define the process Z by the relation This construction can be viewed by first choosing a process Z • whose law does not depend on θ (this is crucial, as we will shortly explain). Here, we have chosen Z • to be a standard Brownian bridge on (0, T ). Second, setting X • = g(θ, Z • ) defines the mapping g. Third, equation (2.3) defined the process Z . Now that Z is defined, rather then drawing from the distribution of θ conditional on (X 0 = u, X T = v, X ) we will sample from the the distribution of θ conditional on (X 0 = u, X T = v, Z ). This means that we augment the discrete time observations with Z instead of X . Denote the laws of Z and Z • by Q θ and Q • respectively. Suppose the current iterate is (θ, Z ), where Z can be extracted from θ and X by means of equation (2.3). The following diagram summarises the notation introduced: For updating θ we propose a value θ • from some proposal distribution q(· | θ) and accept the proposal with probability min(1, A), where (2.5) Here, we have implicitly assumed that Q θ • and Q θ are equivalent, which is indeed the case as we have and thus results from absolute continuity of P θ and P • θ . Being able to relate the likelihood ratio of Q θ • and Q θ in this way is possible as Q • does not depend on θ. By equation (2.2), we now get

Substituting this expression into equation (2.5) yields
and all terms containing the unknown transition density cancel. The efficiency of this algorithm is crucially determined by choice of the transition kernel q and proposal process X • . We focus on an appropriate choice of X • , though in section 4.3 we give guidelines on appropriate choice of q if the drift is of a specific structure with respect to θ. In the toy-example, X • is defined in equation (2.1). This is a convenient choice, as it easily yields absolute continuity of P θ and P • θ . Moreover, generating an exact realisation of X • at any set of points in [0, T ] is straightforward. Nevertheless, there are at least two disadvantages to this approach: 1. The proposal does not take the drift b into account. Depending on the precise form of b, this means that proposal bridges can look completely different from the bridge we wish to draw. Figures that illustrate this phenomenon in case dX t = sin(X t − θ) dt + θ dW t can be found in Lin et al. (2010) (figure 4). This results in low MH-acceptance probabilities. 2. If the dispersion coefficient takes the more general form σ θ (t, X t ), then P θ and P • θ are singular. Taking X • to be the process defined by dX t = σ θ (t, X • t )dW t , conditioned on starting in u and ending in v at time T , may seem to solve this problem. Unfortunately this introduces another problem as for almost all choices of σ θ it will be impossible to generate realisations from X • conditioned on hitting the endpoint v at time T . Moreover, the transition densitiesp will be completely intractable in this case.

Guided proposals
A flexible class of proposal processes was developed and studied in Schauer et al. (2013). We will use this framework throughout and provide a recap of the relevant results in this section. For precise statements of these results we refer the reader to Schauer et al. (2013).
Our starting point is that under weak assumptions the target diffusion bridge X from u at time t = 0 to v at time t = T is characterised as the solution to the SDE and a(t, x) = σ(t, x)σ (t, x). As the true transition density p(t, x; T, v) (this is the density of the process starting in x at time s, ending in v at time T ) of the diffusion process is generally intractable, we propose to replace it by the transition density of a diffusion processX for which it is known in closed form. Denote the transition density ofX byp(s, x; T, v). Define the process X • as the solution of the SDE A process X • constructed in this way is referred to as a guided proposal (a guiding term is superimposed on the drift to ensure the process hits v at time T ). Denote the laws of X • and X (viewed as Borel measures on C([0, T ], R d )) by P • and P respectively. We reduce notation by writing p(s, x) for p (s, x; T, v). Definẽ where ∇ and ∆ denote gradient and Laplacian with respect to x. In Schauer et al. (2013) sufficient conditions for absolute continuity of µ X • and µ X are established together with a closed form expression for the Radon-Nikodym derivative. It turns out that dP dP where D is given by and hence does not depend on any intractable objects. The class of linear processes, is a flexible class with known transition densities and its induced guided proposals satisfy the conditions for absolute continuity derived in Schauer et al. (2013) under weak conditions onB,β andσ. Proposal processes X • derived by choosing a linear process as in (3.3) will be referred to as linear guided proposals. One key requirement for absolute continuity of X and X • is thatσ is such thatã(T ) = (σσ T )(T ) = a(T, v). Another requirement is existence of an ε > 0 such that for all s ∈ [0, T ] x ∈ R d and y ∈ R d we have min(y a(s, x)y, y ã(s)y) ≥ ε y 2 (uniform ellipticity). Therefore, a simple type of guiding proposals is obtained upon choosing dX t =β dt+ σ(T, v) dW t . For this particular choice Depending on the precise form of b and σ it can nevertheless be advantageous to use guided proposals induced for nonzeroB. For later reference, we write and Ψ θ (X • ) in the case where b or σ depend on a parameter θ.

Proposed MCMC algorithm
4.1. Innovation process Using the notation introduced in the previous section, the dynamics of the proposal process can be described by the stochastic differential equation Herer is determined byβ,B andσ appearing in equation (3.3). Although we conceptually follow the same approach as in Section 2, we now take Z • to be the driving Brownian motion instead of a Brownian bridge and set Z • = W . Define the mapping g by Such a mapping exists if X • is a strong solution to equation (4.1) (Cf. chapter V-10, p. 127 of Rogers and Williams (2000)). Sufficient conditions for existence of a strong solution are detailed in Schauer et al. (2013). Define the process Z by Now it's easily verified that X = g(θ, Z ).
We refer to the process Z as the innovation process corresponding to X • (by analogy of the terminology of Golightly and Wilkinson (2010) and Chib et al. (2004)). From equations (4.2) and (4.3) it follows that X is related to Z just like X • is related to Z • . Note however that while Z does not depend on θ, Z does. Denote the laws of X • , X , Z • and Z by P • θ , P θ , Q • , Q θ respectively and recall the diagram in (2.4). We have Here, the final equality follows from equation (3.2), which takes over the role of (2.2) in the toy-example. If we had taken as different proposal process from (4.1) then the mapping g would change according to (4.2). At this point we want to stress that while different choices for the map g formally yield valid algorithms, updating θ to θ • given Z is difficult if g(θ • , Z ) does not resemble a draw from X conditional on θ • . Ideally, one would take g = g opt , where g opt is defined by the relation X = g opt (θ, W ). As this is not feasible, using g as derived from (•) instead of ( ) is a very reasonable choice, see figure 1.

Algorithm
In this section we present an algorithm to sample from the posterior of θ given the discrete observations D = {X 0 = u, X t1 = x 1 , . . . , X tn = x n }. Denote the prior density on θ by π 0 . The idea is to define a Metropolis-Hastings sampler on (θ, Z ) instead of (θ, X ) where Z is the innovation process from the previous section. More precisely, we construct a Markov chain for (θ, (Z i ) 1≤i≤n ), where each Z i is an innovation process corresponding to the bridge X i connecting observation x i−1 to x i .
Step 2 constitutes a step of a MH-sampler with independent proposals. The expression for A 1 follows directly from equation (3.2). The expression for A 2 in step 3 follows in exactly the same way as equation (2.6) was established in the toy-example (Cf. section 2.)

Partially conjugate series prior for the drift
In this subsection we study the specific case for which where ϑ = (ϑ 1 , . . . , ϑ N ) is an unknown parameter and ϕ 1 , . . . , ϕ N are known functions on R d . We assume the diffusion coefficient is parametrised by the parameter γ.
We denote the vector of all unknown parameters by θ = (ϑ, γ) and assume these are assigned independent priors. With slight abuse of notation we use π 0 (ϑ) and π 0 (γ) to denote the priors on ϑ and γ respectively (the argument in parentheses will clarify which prior is meant). In this case it is convenient to choose a conjugate Gaussian prior for the coefficients ϑ i ∼ N (0, ξ 2 i ) for positive scaling constants ξ i . Priors for the drift obtained by specifying a prior distribution on ϑ were previously considered in Van der Meulen et al. (2014). Upon completing the square, it follows that the distribution of ϑ conditional on γ and the full path Y of the diffusion is multivariate normal with mean vector W −1 γ µ γ and covariance matrix W −1 γ . We define for k, ∈ {1, . . . , d}, (For a vector x ∈ R n we denote the i-th element by x[i]. To emphasise the dependence on Y we sometimes also write µ γ (Y ), W γ (Y ) etc.) This leads to a natural adaptation of algorithm 1 from section 4.2.
Note that computation of Z • in step 3.2(c) requires invertibility of σ.
Proof. Suppose (ϑ, γ, Z ) ∼ π, where π denotes the posterior distribution. Consider the map f : (ϑ, γ, Z ) → (ϑ, γ, X ), where X = g((ϑ, γ), Z ). Here X solves ( ) by (4.3). We show that step 3.2 preserves π. The distribution of (ϑ, γ, X ) is the image measure of the posterior distribution π of the tuple (ϑ, γ, Z ) under f and coincides with the posterior distribution of (ϑ, γ, X ). Denote the image measure of π under f by by π • f −1 . In steps 3.2(a) and 3.2(b) we apply the mapping f , followed by a Gibbs step in which we draw ϑ • conditional on (γ, X ). The latter preserves π • f −1 . Hence (ϑ • , γ, X ) ∼ π • f −1 . In step 3.2(c) we we compute (ϑ • , γ, Z • ) as preimage of (ϑ • , γ, X ) under f (this is possible as we assume σ to be invertible). Hence A variation of this algorithm is obtained in case the drift is of the form specified in equation (4.5) and the diffusion coefficient depends on both ϑ and γ. In this case we can update γ just as in algorithm 2. Updating ϑ can be done using a random walk type proposal of the form with α a positive tuning parameter. Motivated by the covariance matrix of the prior exploited in the case of partial conjugacy we propose to replace V by W −1 ϑ . By this choice, if two components ϑ i and ϑ j are strongly correlated, the proposed local random walk proposals have the same correlation structure, which can improve mixing of the chain.
The following argument gives some guidance in the choice of α. If the target distribution is a d-dimensional Gaussian distribution N d (µ, Σ) and the proposal is of the form ϑ • ∼ q(ϑ • , ϑ) ∼ N d (ϑ, α 2 Σ q ), then optimal choices for α and Σ q are given by Σ q = Σ and α = 2.38/ √ d (Cf. Rosenthal (2011)). Hence, we will choose α = 2.38/ dim(ϑ), which corresponds to average acceptance probability equal to 0.234.

Discussion on related work
In this section we point out similarities and differences of this paper with respect to related approaches in Golightly and Wilkinson (2006), Golightly and Wilkinson (2010), Delyon and Hu (2006), Fuchs (2013) and Papaspiliopoulos et al. (2013). Delyon and Hu (2006) introduced proposals with dynamics governed by the SDE where λ ∈ {0, 1}. Sufficient conditions for absolute continuity for both cases are given in Delyon and Hu (2006). The pulling term (v − X • t )/(T − t) in the SDE forces X • to hit v at time T and the case λ = 1 takes the drift of the process into account when proposing bridges. However, for both λ = 0 and λ = 1 the resulting bridges may not resemble true bridges.
Without loss of generality we may assume we have have one observation X T = v. If λ = 0 in (• ), a slight adjustment of the Euler discretisation of the bridge (introduced by Durham and Gallant (2002)) sets where 0 = t 0 < t 1 < · · · < t N = T and δ i = t i − t i−1 (the adjustment being the addition of the term (T − t i )/(T − t i−1 ) in front of σ). If δ := max 1≤i≤N δ i is small, the right hand side defines a discrete time Markov chain approximation X • i to X • ti termed the Modified Brownian Bridge. We can define a mapping g by the relation This implies that parameters in the diffusion coefficient can be updated conditional on (∆W 1 , . . . , ∆W N ), instead of (∆X • 1 , . . . , ∆X • N ); thereby preventing degeneracy of the data-augmentation algorithm. This is the approach of Golightly and Wilkinson (2010). Chib et al. (2004) use the same algorithm using Euler discretisation of X • in case λ = 0 (so no correction in the diffusion coefficient appears in the discretisation). Note that there is no formal proof that these methods work as δ ↓ 0.
Fuchs (2013) explores how to use the driving Brownian motion in (• ) as innovation process within a continuous time setup. MH-acceptance probabilities are obtained using results in Delyon and Hu (2006) who provide expressions for the likelihood ratio of the laws of X and X • . However, the absolute continuity results of Delyon and Hu (2006) do not provide the proportionality constants in the derived likelihood ratio. For generating diffusion bridges using a MH-sampler these constants are irrelevant. However, as these constants do depend on θ they cannot be neglected.
We shortly review the derivation of this constant in the one-dimensional case if λ = 0 (the argument for λ = 1 is similar). Let t < T and denote the restrictions of the measures P and P • by P t and P • t respectively. By equation (4.1) in Schauer et al. (2013) and Girsanov's theorem, A careful argument by Delyon and Hu (2006) (also outlined in Papaspiliopoulos et al. (2013)) shows that this likelihood ratio can be rewritten as .
Here ϕ(x; µ, a) denotes the value of the normal density with mean µ and variance a, evaluated at x and α is a constant given by The -integral is obtained as the limit of sums where the integrand is computed at the right limit of each time interval as opposed to the left limit used in the definition of the Itō integral. Delyon and Hu (2006) provide sufficient conditions such that for solutions to (• ) there exists an almost surely finite random variable C such that This ensures that the second term in equation (5.1) is well defined in the limit t ↑ T . Intuitively, when t is close to T and ) cancel each other. This is true under the expectation, for completeness we give the result in A.2, as to the best of our knowledge a rigorous derivation of the constant appearing in the likelihood ratio for the Delyon and Hu (2006) type proposals is missing in the literature. In this derivation, we naturally assume the conditions stated in section 4 of Delyon and Hu (2006). Papaspiliopoulos et al. (2013) extend the reparametrisation of the diffusion bridges in Roberts and Stramer (2001) (cf. section 2) to reducible diffusions and otherwise decouple parameter and bridge by first discretising and then using the transformation formula in Euclidean coordinates for the discretised bridges. The use of the driving Brownian motion of the continuous time bridge proposal (• ) as innovation process was proposed by Fuchs (2013).

Time changing and scaling of linear guided proposals
Simulation of X • and numerical evaluation of Ψ as defined in (3.4) is numerically cumbersome since the drift of X • and the integrand D explode for s near the endpoint T . As an example, suppose σ is constant and we takeX = σdW t . Then we havẽ In this section we explain how these numerical problems can be dealt with using a time and scaling of the proposal process. For the particular example just given Clark (1990) proposed to perform a time change and scaling of the proposal process to remove the singularities. Define τ C : . Then U C satisfies the stochastic differential equation which behaves as a mean-reverting Ornstein-Uhlenbeck-process as s → ∞. Furthermore, (note that there are some minor typographical errors in Clark (1990)). Clearly, if b is bounded, this removes the singularity near T , but at the cost of having to deal with an infinite integration interval. For this reason, we propose a different time-change and scaling.
The intuition about our choice of the time change is as follows: as shown in Schauer et al. (2013) T . To make this intuitive idea more precise, we need a few more results from Schauer et al. (2013). IfX is a linear process (satisfying equation (3.3)), theñ and define the process U by In the following, we denote the derivatives of v and τ byv andτ respectively.
Proposition 6.1. The time changed process U = (U s , s ∈ [0, T )) satisfies the stochastic differential equation If we simulate U on an equidistant grid we can recover X • on a nonequidistant grid from equation (6.4). This implies X • is evaluated on an increasingly finer grid as s increases to T . In our implementation, all computations are done in time-changed/scaled domain, and the mapping g is in fact defined by setting At first, it may seem that the SDE given by equation (6.5) has not resolved the singularity problem at time T . However, note that if b and σ are smooth, then for s ≈ T , In the following subsection we study discretisation of this SDE and show that no discretisation error is made at the final discretisation step near T . Finally we give the time changed version of formula (3.2), which follows from the same elementary, if tedious, calculations as in the proof of Proposition 6.1:

Discretisation error
In this section we show in a simple setting the benefits of discretising U instead of X • .
To this end, we consider the diffusion dX s = σ dW s . In this case it is natural to take dX = σ dW s which yields proposals satisfying the stochastic differential equation (note the resemblance with equation (6.7)). Of course, for this particular example it is trivial to obtain an exact realisation at a finite set of points in [0, T ], but suppose for a moment we're not aware of this fact. Now there are at least two approaches for obtaining an approximate discrete skeleton of the process: 1. Apply the Euler discretisation directly to the SDE for X • . 2. First transform X • to the process U as defined in (6.3), then apply Euler discretisation on that process and subsequently transform this discretisation back to X • .
In this section we will show that the second approach yields improved accuracy. We assume in both cases we wish to obtain a skeleton at m − 1 (with m ≥ 2) time points in (0, T ). Let h = T /m. Approach 1. Direct application of Euler discretisation to X • gives that the law of Note that the largest difference is attained for s = T − h. Approach 2. Straightforward calculus gives that U satisfies the SDE If we denote the Euler discretisation applied to this equation by U , then Using the result of the previous display, some tedious calculations reveal that The true law of X • τ (s+h) , conditional on X • τ (s) = x has exactly the same mean, while the true covariance equals Therefore, Note that this difference is zero if s = T − h; while its maximal value is attained for s = 0.
Comparing both approaches we see that This ratio is always smaller than 1 and i → R(i) decreases roughly linearly to zero. This implies that it is always advantageous in this example to use the second approach, the closer we come to the endpoint T , the more we gain. As we apply the discretisation iteratively over all points it is important that the ratio is uniformly smaller than 1. Further, from figure 2 it is apparent that relative gain of approach 2 is higher for small values of m. Simulation results indicate that this advantage also holds more generally, though a proof for this is not available yet.

Example for one-dimensional diffusion
In this section we consider an example. The goal is twofold: (i) to show that the proposed algorithm does not deteriorate when increasing the number of imputed points, (ii): to show that the time change and scaling proposed in the previous section reduces discretisation error. We consider the diffusion process driven by the SDE Assume that we observe X at times points t = 0, 0.3, 0.6 . . . , T = 30 and wish to estimate (α, β, σ). As true values we took α = −2, β = 0 and σ = 0.75. For generating the discrete time data we simulated the process on [0, T ] at 400 001 equidistant time points using the Euler scheme and take a subsample. See figure 3. For α and β we chose apriori independently a N (0, ξ 2 )-distribution with variance ξ 2 = 5. For log σ we used an uninformative flat prior. We applied algorithm 2 with random walk proposals for q(σ • | σ) of the form We initialised the sampler with α = −0.1, β = −0.1 and σ = 2 and varied the number of imputed points over m = 10, 100 and 1000.
Figures 4 and 5 illustrate the results of running the MCMC chain for 10.000 iterations using m = 10, 100, 1000 imputed points respectively for each bridge (including endpoints), both with time change and without. Two things stand out: firstly, increasing the number of imputed points m does not worsen the mixing of the chain and secondly the vastly reduced bias when using discretisation of U (especially when m is small, as we anticipated in section 6.1.)

Model and data
In this section we apply our methods to an example presented in section 4 of Golightly and Wilkinson (2010) on prokaryotic auto-regulation. We briefly introduce the example and focus on estimation; the interested reader can consult Golightly and Wilkinson (2010) for more details and background on this example. A gene is expressed into a protein P in several steps including transcription of the DNA encoding the protein into mRNA by RNA-polymerase and synthesis of the protein in the ribosome as specified by the mRNA. In the autoregulation model, pairs of the protein may reversibly form dimers P 2 which bind to certain regions of the DNA resulting in bound DNA ·P 2 which is not transcripted into RNA for P, thus providing an auto-regulation mechanism for the synthesis of this particular protein P . The total amount of bound and free DNA is constant DNA + DNA ·P 2 = K so the quantities of RNA, P, P 2 and DNA describe the state of the system. These reactions can be modelled as Markov jump process. While it is straightforward to exactly simulate sample paths of the jump process with values (RNA, P, P 2 , DNA) ∈ N 4 , estimation can be carried out by exploiting a diffusion approximation to this process. This approximating diffusion process with values in R 4 solves the Chemical Langevin Equation driven by a R 8 -valued Brownian motion, where is the stoichiometry matrix of the system describing the chemical reactions and is a function describing the hazard for a particular reaction to happen. Here • denotes the Hadamard (or entrywise) product of two vectors and is the vector of the unknown rates of each of the 8 distinct reactions which are known to happen proportionally to the amount of respectively) in the system. Note that this is a natural situation in which both drift and diffusion coefficient depend on the same parameters. Golightly and Wilkinson (2010) proceed from (7.1) to a SDE driven by a 4-dimensional Brownian motion with σ(x) = (Sh θ (x)S ) . Prokaryotic auto-regulation: Quantities of (RNA, P, P 2 , DNA) at integer times (simulated from discrete model).
For inference, Golightly and Wilkinson (2010) use independent U[−5, 5] priors on log θ i and a Gaussian random walk proposal updating the components of θ jointly. They find that the few observations taken allow for remarkably good recovery of the true parameters.

Choice of guided proposals
We perform an analysis of the same data, this time with the guided innovation scheme using the time change from section 6 using (7.1) as model. We chooseB andβ to depend on θ (but not on time) so thatBx +β approximates b θ (x). While it is possible to take different approximations specifically tailored for each bridge segment, it is computationally advantageous to work with a global approximation to b θ . To this end, we replace h by a linear approximationh which allows for obtainingB θ andβ θ from the equationB Except for its first and fifth entry, x → h(x) is linear. Therefore, for i ∈ {2, 3, 4, 6, 7, 8} we seth i = h i . For i = 1, we takeh 1 = c 1 + u 1,3 x 3 + u 1,4 x 4 . Values for c 1 , u 1,3 and u 1,4 are obtained from a weighted linear regression of x 3 x 4 on x 3 and x 4 , with weights proportional to x 3 x 4 . Similarly, for i = 5, we takeh 5 = c 5 + u 5,2 x 2 . Values for c 5 and u 2,5 are obtained from a weighted linear regression of 1 2 x 2 (x 2 − 1) on x 2 and x 4 , with weights proportional to x 2 (x 2 − 1).
We take a weighted regression in this way because for a good proposal the error matters more if the corresponding dispersion component is small. This automatic procedure gives u 5,2 = 18, c 5 = −82, u 1,3 = 5, u 1,4 = 6, and c 1 = −32 for the data obtained.

Rejection of nonpositive bridges
Contrary to the underlying jump process, the chemical Langevin approximation may have positive probability of leaving the positive cone, even before Euler discretisation Szpruch and Higham (2009/10). This is also the case for this example, to intuitively see this, note for example that a 11 (x) = θ 2 3 x 4 + θ 2 7 x 1 . So even if the first coordinate of X approaches 0 at time t, the diffusivity of X in that direction does not vanish if X t+h < 0 with positive probability. As we are naturally not interested in values of θ which explain the data via bridges X violating the condition X ≥ 0, we resample proposals with X s < 0 for some s. In this way we effectively draw from the conditional distribution of θ, conditional on the discrete time data and X s ≥ 0 for all values of s that correspond to imputed points.

Results from applying algorithm 1
As priors for log θ i we use independent U[−7, 7] distributions. We used algorithm 1 with step 2(a) replaced with • Sample a Wiener process Z • i , conditional on g(θ, Z • i ) ≥ 0 (using rejection sampling). and step 3(a) replaced with We took random-walk type proposals for θ on the log scale, where conditional of the current value θ a new value θ • is sampled from We choose θ = (0.05, . . . , 0.05) as first iterate. A run of 100, 000 iterations using m = 20, 50 imputed points (including endpoints) for each bridge yields the estimates for θ 1 , . . . , θ 8 given in the upper panel of table 1.
In case m = 50, on average 76% of the sampled bridges for a segment where accepted in step 2(b) and the average acceptance probability for θ • in step 3 was 24%.

Results from applying algorithm 3
In this example one can clearly see from figure 8 that both θ 1 and θ 2 as θ 4 and θ 5 are highly correlated. This worsens mixing of the Markov chain for these parameters. Prokaryotic auto-regulation: Estimated parameters of the posterior distribution using algorithms 1 and 3.
The estimated autocorrelation time was computed using R package coda.

Since
we see that b is of the form (4.5). Therefore, we can apply algorithm 3. We make a slight adjustment, in which step 3.2'(c) is replaced with A run of 100, 000 iterations using m = 20, 50 imputed points (including endpoints) for each bridge yields the estimates for θ 1 , . . . , θ 8 given in the lower panel of table 1. We took the same priors as in subsection 7.2.4. The scaling parameter of the random walk proposal was chosen according to the remark after algorithm 3. The corresponding acceptance probability for the proposals for θ was 26% (for m = 50). This compares well with the anticipated value of 0.234.
From table 1 we see that indeed application of algorithm 3 greatly reduces autocorrelation times over algorithm 1.  Note that both J andv do not depend on x. This implies that these functions only need to be computed once on the grid of a single bridge in each iteration. In caseB does not depend on θ then these functions can be precomputed on a grid in advance to the MCMC-algorithm. As computing matrix exponentials is expensive, this reduces computing time substantially.