Particle-based likelihood inference in partially observed diffusion processes using generalised Poisson estimators

This paper concerns the use of the expectation-maximisation (EM) algorithm for inference in partially observed diffusion processes. In this context, a well known problem is that all except a few diffusion processes lack closed-form expressions of the transition densities. Thus, in order to estimate efficiently the EM intermediate quantity we construct, using novel techniques for unbiased estimation of diffusion transition densities, a random weight fixed-lag auxiliary particle smoother, which avoids the well known problem of particle trajectory degeneracy in the smoothing mode. The estimator is justified theoretically and demonstrated on a simulated example.


Introduction
In this paper we discuss the use of sequential Monte Carlo (SMC) methods (alternatively termed particle methods) for likelihood-based inference in partially observed diffusions (PODs). The proposed method relies on a novel approach for estimating transition densities of diffusion processes via so-called generalised poisson estimators (GPEs). For the models under consideration, the likelihood function of the observed data cannot be expressed on closed-form; however, since partially observed diffusion models are, like more general latent variable models, specified using conditional dependence relations, this inference problem can be efficiently cast into the framework of the expectation-maximisation (EM) algorithm proposed by Dempster et al. (1977). When applying the EM algorithm in the POD context there are two main difficulties: firstly, in all except a few cases, the transition density of the diffusion process, and thus the complete data log-likelihood function, lacks an analytic expression; secondly, computing the intermediate quantity of the expectation-step involves taking expectations under the smoothing distribution, i.e. the conditional distribution of the hidden states at the observation time points given the observed data record, which is not-even in the case of a known transition density-available on closed-form. These two issues make, as documented by several authors, MLE-based inference in PODs very challenging. In this paper we address these problems by applying the GPE suggested (as a refinement of results obtained in Beskos et al., 2006) by Fearnhead et al. (2008) in conjunction with SMC smoothing algorithms. Unfortunately, it has been observed by several authors that using standard SMC methods in the smoothing mode may be unreliable for larger observation sample sizes n, since resampling systematically the particles leads to degeneracy of the particle paths. As a solution, we adapt the fixed-lag smoother proposed by  to the framework of PODs. This technique relies, in the spirit of Kitigawa (1998), on forgetting properties of the conditional hidden chain; by this is meant that the hidden chain forgets its past when evolving, backwards as well as forwards, conditionally on the given observation sequence. The constructed algorithm avoids efficiently particle trajectory degeneracy at the cost of a bias which can however be controlled by a suitable choice of the introduced lag parameter.
In order to obtain a high performance of the particle smoother it is in general necessary to propose (mutate) the particles according a kernel that takes the information provided by the current observation into account; indeed, mutating, as in the bootstrap particle filter, the particles "blindly" according to the dynamics of the hidden Markov chain will often lead to severe degeneracy of the particle importance weights. However, such an improved proposal strategy is not straightforwardly adopted to PODs, since computing the resulting importance weights involves computing a ratio of the transition density of the hidden diffusion process (for which a closed-form expression is missing in general) and that of the chosen proposal kernel. To cope with this, we follow Fearnhead et al. (2008) and replace each evaluation of the hidden process transition density by a draw from the GPE. Thus, the GPE serves two purposes in our algorithm as it is used, firstly, for computing unbiased estimates of particle importance weights for a particle filter based on a proposal kernel different from the transition kernel of the hidden diffusion process and, secondly, for estimating the EM intermediate quantity itself.
The contribution of our study is fourfold, since the proposed intermediate quantity estimator 1. approximates efficiently the expectation step in a single sweep of the data record, yielding an algorithm with a computational complexity of order O(nN ); 2. copes, as it is not based on any Euler discretisation or linearisation technique, efficiently with model nonlinearities; 3. has only limited computer data storage requirements, which is essential in, e.g., high frequency applications where sometimes very long measurement sequences are considered; 4. is provided with a rigorous convergence result describing its convergence to the true intermediate quantity. This result is derived via a convergence result, obtained under minimal assumptions, for the GPE-based particle smoother.
For models exhibiting poor mixing properties, in which case we cannot expect a high performance of the fixed-lag smoother, we propose an alternative algorithm where the GPE is used in conjunction with the particle-based forwardfiltering backward-smoothing procedure proposed by Godsill et al. (2004). This scheme, which relies on a decomposition of the smoothing measure that incorporates the so-called backward kernels (i.e. the transition kernels of the hidden Markov chain when evolving backwards in time and conditionally on the observations) of the model, avoids particle path degeneracy completely through an additional simulation pass in the time-reversed direction. Moreover, it does not suffer from the additional, model dependent bias of the fixed-lag smoother. However, these appealing properties are obtained at the cost of a significant increase of computational work, since the complexity of the scheme in question is quadratic in the number of particles.
The paper is organised as follows: In Section 2 we recall the concept of PODs and discuss likelihood-based inference in such models via data augmentation and the EM-algorithm. GPEs are described in Section 2.1 and Section B, and Section 2.2 is devoted to SMC smoothing in general. In Sections 2.3 and 2.4 we introduce the fixed-lag smoother and the forward-filtering backward-simulation smoother, respectively; moreover, we discuss how these techniques can be adjusted to PODs using GPEs. A theoretical result describing the convergence of the fixed-lag-based estimator is found in Section 2.3.1, and in Section 3 we illustrate the method on partially observed log-growth and genetics diffusion models. In Section 4, the paper is concluded by some final conclusions and remarks. Proofs are found in Section A.

Preliminaries
In the following we assume that all random variables are defined on a common probability space (Ω, F , P) and let E denote expectations associated with P.
Denoting by ½ the indicator function and letting X be any random variable on (Ω, F ), we will often make use of the short-hand notation E[X Let X def = (X t ) t≥0 be continuous-time diffusion process taking values in some space (X, X ), with X ⊆ R dX . More specifically, the dynamics of the process is governed by the the stochastic differential equation is Brownian motion. We denote by W (x) the law of W given that W 0 = x and let (F t ) 0≤t be the filtration generated by W . The functions µ(·, θ) and σ(·, θ) are assumed to satisfy regularity conditions (locally Lipschitz with a linear growth bound) that guarantee a weakly unique, global solution of (2.1). We will consider a framework where the process X is only partially observed at discrete time points (t k ) k≥0 through the process Y def = (Y k ) k≥0 taking values in some measurable space (Y, Y). The observations of Y are assumed to be, conditionally on the latent process X, independent and such that the conditional distribution G θ of Y k given X depends on X t k only. In the following we write, in order to simplify the notation, X k instead of X t k . The dynamics of the diffusion as well as the measurement process depend on some unknown model parameter θ which is assumed to belong to some compact parameter space Θ ⊆ R d θ . Our main target is to estimate θ using the maximum likelihood method. For simplicity we assume that the observation time points are equally spaced and denote by Q θ and χ the transition kernel and initial distribution, respectively, of the time homogeneous Markov chain (X k ) k≥0 . The family (Q θ (x, ·); x ∈ X, θ ∈ Θ) is dominated by the Lebesque-measure λ with corresponding Radon-Nikodym derivatives (q θ (x, ·); x ∈ X, θ ∈ Θ). Moreover, suppose that G θ has a density function g θ with respect to some measure µ on (Y, Y) such that, for k ≥ 0, Given a record Y 0:n = (Y 0 , Y 1 , . . . , Y n ) (similar vector notation will be used also for other quantites) of observations, a consistent estimate of the parameter θ is ideally formed by maximising the observed data likelihood function ℓ n (θ; A problem with this approach is that we in general cannot compute L n on closed-form, since this involves the evaluation of a high-dimensional integral over a complicated integrand. Since the partially observed diffusion model above is, like more general latent variable models, specified using conditional dependence relations, computation of parameter posterior distributions is facilitated significantly by maximising instead the complete data log-likelihood function by means of the EM algorithm: Assume that we have at hand an initial estimate θ ′ of the parameter vector. In the EM algorithm an improved estimate is obtained by computing and maximising the intermediate quantity Q(θ; ·) defined by (2.2) Here we have written E θ ′ to stress that the expectations are taken under the dynamics determined by the initial parameter θ ′ . Under weak assumptions, repeating recursively this procedure yields a sequence of parameter estimates that converges to a stationary point θ * of the observed data log-likelihood (Wu, 1983). As clear from (2.2), computing Q n requires the computation of expected values under the smoothing distribution, i.e. the distribution of the state sequence X 0:n conditionally on the observations Y 0:n , given by, for A ∈ X (n+1) , Of special interest is the filter distribution, i.e. the distribution of X n conditionally on Y 0:n , given by the restriction φ n|n (A) def = φ n (X n × A), A ∈ X , of the smoothing distribution to the last component. It is easily shown that the flow (φ k ) ∞ k=0 satisfies the well-known forward smoothing recursion (2.4) where A ∈ X ⊗(k+2) . By introducing the (non-Markovian) transition kernel for x k ∈ X and A ∈ X , we may rewrite the recursion (2.4) as Here the normalised (Markovian) kernel L k (x k , A; θ)/L k (x, X; θ) is the so-called optimal kernel describing the distribution of X k+1 given X k = x k and the new observation Y k+1 . In general, a closed-form solution of the recursion (2.4) is not available. A standard approach is thus to apply some SMC smoothing algorithm (described in in Section 2.2) to approximate the expectations in (2.2). Unfortunately, both the SMC smoother itself as well as the intermediate quantity (2.2) call for the transition density q θ , which is usually unknown except in a few special cases. Nevertheless, results obtained by Beskos et al. (2006) and Fearnhead et al. (2008) offer a method for estimating this density without bias. A full treatment of this technique-which is a key ingredient of the estimation technique proposed here-is beyond the scope of this paper; nevertheless, the main framework and assumptions are described briefly in the next section. In addition, some more details can be found in Appendix B.
Under these conditions, the GPE approach developed by Fearnhead et al. (2008) makes it possible to generate random variablesṼ θ (x, x ′ ) with EṼ θ (x, x ′ ) = q θ (x, x ′ ) for any (x, x ′ ) ∈ X 2 , i.e.Ṽ θ (x, x ′ ) estimates the transition densityq θ without any bias, for a large class of diffusions of type (2.6). Then, letting . A full description of GPEs is beyond the scope of this paper; however, its main features are discussed in Appendix B. In this paper we represent the GPE by a kernel P θ in sense that V θ (x, x ′ ) ∼ P θ (x, x ′ , ·). Similarly, using the related exact algorithm developed by Beskos et al. (2006), it is possible to construct a kernelP θ such that EV θ (x, x ′ , θ) = log q θ (x, x ′ ) for drawsV θ (x, x ′ , θ) ∼P θ (x, x ′ , ·). Appealingly, it is in many cases (see Section 3 for examples) possible to construct P θ andP θ such that the functions θ → V θ (x, x ′ )(ω) and θ →V θ (x, x ′ )(ω) are continuous for any fixed outcome ω ∈ Ω, yielding unbiased estimates of q θ and log q θ for all θ ∈ Θ simultaneously. This useful property makes, as we will see, the GPE approach well suited to numerical (log-)likelihood function optimisation.

GPE-based particle smoothing
Since we in this part deal with the problem of sampling φ k (·; θ) for a given fixed parameter value, we will throughout this section expunge θ from the notation. To begin with, we assume that we know the transition kernel density q.
In order to describe precisely how SMC methods may be used for producing approximate solutions to the smoothing recursion (2.4), we suppose that we are given a weighted sample (ξ i 0:k|k , ω i k ) N i=1 of particle and associated weights, each particle ξ i 0:k|k = (ξ i 1|k , . . . , ξ i k|k ) being a random variable in X k+1 , approximating φ k in the sense that where Ω N k def = N ℓ=1 ω ℓ k , for a large class of estimand functions f on X k+1 . Now, in order to form an updated particle sample approximating φ k+1 , as a new observation Y k+1 becomes available, a natural approach is to replace φ k in (2.5) by its particle approximation. This yields the mixture (recall the notation δ a for a Dirac mass located at a) for A ∈ X ⊗(k+2) . Now, the aim is to simulate a new set of particles from φ N k+1 and repeat this recursively to obtain particle samples approximating the smoothing distributions at all time steps. However, since we in general cannot neither simulate draws from the optimal kernel nor compute the mixture weights L k (ξ i k|k , X), we apply importance sampling and draw new particles from the instrumental mixture distribution are positive numbers referred to as adjustment multiplier weights. We will from now on assume that ψ i k = Ψ k (ξ i 0:k|k ) for some nonnegative function Ψ k : X k+1 → R + and that each kernel R k has a density r k with respect to λ. Simulating a particle ξ i 0:k+1|k+1 from π N k+1 is easily done by, firstly, drawing, according to the probability distribution proportional to (ω . . , N } and, secondly, extending the selected ancestor with a draw from the proposal kernel, i.e. letting ξ i ·). After this, the drawn particle is assigned the importance weight . Finally, the weighted particle sample formed by the updated particles and weights is returned as an approximation of φ k+1 . Moreover, since the filter distribution is the marginal of the smoothing distribution with respect to the last component, an estimate of φ k+1|k+1 is formed by the marginal sample (ξ i k+1|k+1 , ω i k+1 ) N i=1 . Proposing and selecting the particles according to the dynamics of the latent process, i.e. without making use of the information about the current state provided by the current observation, by letting R k ≡ Q and Ψ k ≡ 1 for all k, corresponds to the bootstrap particle filter proposed by Gordon et al. (1993).
The algorithm, which was developed gradually by, mainly, Handschin and Mayne (1969), Gordon et al. (1993), and Pitt and Shephard (1999), will be referred to as the auxiliary particle smoother (APS). In the setting of a partially observed diffusion process we do not have access to a closed-form expression of the transition density q, which is needed when evaluating the importance weight function Φ k+1 . However, the GPE makes it possible to estimate this density without bias via the kernel P . This yields following algorithm, in following referred to as the GPE-based particle smoother (GPEPS), in which q in the weighting operation (2.9) is replaced by the Monte Carlo estimate the resulting estimated importance weight function. One iteration of the GPEPS is described in detail in the following scheme.
Algorithm 1 ( * One iteration of GPEPS * ) Here we have used the notations . Algorithm 1 extends the random weight auxiliary particle filter proposed by Fearnhead et al. (2008) to the smoothing mode. Note that we have, in the scheme above, suppressed the dependence of the particles and the particle weights on α from the notation for clarity.
In the selection operation of Step (2), each particle index is drawn from the probability distribution formed by the adjusted weights (ω j k denote the number of times that index i was drawn, the selection operation may be alternatively expressed as There are however many alternative ways of performing selection; e.g., one may where ⌊x⌋ denotes the integer part of a real number x and x def = x − ⌊x⌋. In this selection schedule, which was proposed by Liu and Chen (1995) under the name deterministic plus residual multinomial resampling, index i is first copied All theoretical results obtained in the following will hold for both the selection schedules (2.12) and (2.13). In addition, our results are easily extended to selection schemes based on Poisson, binomial, and Bernoulli branching (see Douc and Moulines, 2008, for a theoretical analysis of these algorithms); however, since the number of drawn indices are random in this case, we omit these results for brevity.

Convergence of the GPEPS
We will describe the convergence, as N tends to infinity, of the self-normalised Monte Carlo approximations formed by weighted particle samples returned by Algorithm 1 using the concept of consistency (adopted from Douc and Moulines, 2008) defined in the following. Let (Ξ, B(Ξ)) denote some given state space and (2.14) and, additionally, The following assumption is mild (in fact, minimal) but essential when establishing consistency of the GPEPS scheme.
Proposition 2.1. Assume (A1-4) and that the initial sample produced by Algorithm 1 is consistent for (φ k , L 1 (X k+1 , φ k )). The same is true when the multinomial selection schedule (2.12) is replaced by deterministic plus residual multinomial selection (2.13).
The proof of Proposition 2.1 is postponed to Appendix A.1.

Fixed-lag smoothing
Unfortunately, it has been observed by several authors that using standard SMC methods in the smoothing mode may be unreliable for larger observation sample sizes n, since resampling systematically the particles degenerates the particle paths. Indeed, when k ≪ n, most (or possibly all) marginal particles (ξ i k|n ) N i=1 will coincide, resulting in a significant Monte Carlo error when estimating any expectation of X k given Y 0:n using the produced particles. Especially, returning to the problem of estimating the intermediate quantity Q n in (2.2), for any type of additive functional t(x 0:n ) def = n−1 k=0 s k (x k:k+1 ), (s k ) n−1 k=0 being a set of functions (cf. the two terms of (2.2)), we may expect that the estimator of E[t(X 0:n )|Y 0:n ] is poor when n is large. To compensate for this degeneracy the particle sample size N has to be increased drastically, yielding a computationally inefficient algorithm.
On the other hand, since we may expect that remote observations are only weakly dependent, it should hold that, for a large enough integer ∆ n , (2.17) Thus, as long as the approximation (2.17) is relatively precise for a ∆ n which is smaller than the average particle trajectory collapsing time, i.e. most marginal particles (ξ i k|k(∆n) ) N i=1 are different for all k, we should replace (2.16) by the estimator .
( 2.18) The lag-based approximation (2.18) may be computed recursively in a single sweep of the data with only limited computer data storage demands, and computing (2.18) is clearly not more computationally demanding than computing (2.16) (having O(nM ) complexity); see  for details. Finally, using (2.18) in conjunction with the kernelP θ for estimating log q θ gives us the following approximation of the intermediate quantity Q n (θ; θ ′ ): . In (2.19) we have added θ ′ as an index to the particles as well as the associated weights to indicate that the particle system of the fixed-lag smoother is evolved under the dynamics determined by the initial parameter value.

Convergence of the intermediate quantity
Under weak assumptions on the functions Ψ k , the kernels L k andP , and the local likelihoods functions log g θ (·, Y k ) one may establish the convergence of the approximate intermediate quantity (2.19). Thus, define, for a given lag ∆ n and parameters (θ, θ ′ ), the bias imposed by the fixed lag. We then have the following result, which is the main result of this section.
where the bias b n is defined in (2.20).
The proof is given in Appendix A.2. The bias term b n , which was treated by , is controlled by the speed with which the hidden chain (X k ) k≥0 forgets its initial distribution when evolving conditionally on the observations. Indeed, when the state space X is compact it can be shown (see , for details) that b n is O(nρ ∆n ), where 0 < ρ < 1 is the uniform (with respect to observation records Y 0:n as well as initial distributions χ) mixing coefficient of the conditional chain. From this we deduce that the lag ∆ n should be increased with n at the minimum rate c log n, c > −1/ log ρ in order to keep the bias suppressed. Increasing ∆ n faster eliminates the bias and increases the variance of the approximation; see again  for a detailed study of these issues. Since a similar forgetting property holds also in the case of a non-compact state space X (Douc et al., 2009a), the same arguments can be applied for very general models; however, the analysis of the general case is significantly more involved, since the mixing coefficient is neither uniform with respect to observation records nor initial distributions χ in this case.
Remarkably, the convergence result in Theorem 2.1 holds for any fixed sample sizes (α,ᾱ). In particular, nothing prevents us from letting α =ᾱ = 1, yielding a computationally very efficient algorithm; this is the choice of Section 3.

Forward-filtering backward-smoothing
Even though naive SMC implementations generally fail to estimate joint smoothing distributions efficiently, they can, as discussed above, be successfully used for estimating the marginal filter distributions (corresponding to k = n in the discussion of Section 2.3). Nevertheless, any joint smoothing distribution may be expressed in terms of marginal filter distributions via the so-called forwardfiltering backward-smoothing decomposition. Indeed, for any probability measure η on (X, X ), define the reverse kernel for A ∈ X (n+1) . Using the Markovian structure of the decomposition above, a trajectory X 0:n can be simulated from φ n (·; θ) by, firstly, computing recursively (via (2.4)) the filter distributions (φ k|k (·; θ)) n k=0 and, secondly, simulating X n from φ n|n (·; θ) and hereafter, recursively for k = n − 1, n − 1, . . . , 0, X k from ← − Q φ k|k (X k+1 , ·; θ). This scheme will in the following be referred to as forwardfiltering backward-simulation (FFBS), and we refer again to Cappé et al. (2005) for a detailed treatment.
In general we lack closed-form expressions of the filter distributions, but may estimate these efficiently using Algorithm 1. Hence, following Doucet et al. (2000), a non-degenerate particle estimate of φ 0:n (·; θ) can be obtained by replacing, in the decomposition (2.22), φ n|n by the empirical measure φ N n|n and the reverse kernels Note that a draw according to In the case of PODs, a closed-form expression of q θ is in general missing, and we thus replace each number q θ (ξ i k|k , x k+1 ) by a draw V θ (ξ i k|k , x k+1 ) from the GPE P θ (ξ i k|k , x k+1 , ·). This gives us the following algorithm for simulating a trajectory X 0:n that is approximately distributed according to φ n .
Algorithm 2 avoids the problem of degeneracy of the genealogical tree without any implicit assumption on geometrical ergodicity of the conditional hidden chain. On the other hand, simulating a single trajectory according to Algorithm 2 involves O(N ) operations, implying an overall computational cost of order O(N 2 ) for producing a sample of size N . Recently, Douc et al. (2009b) showed how the overall computational cost of the particle-based FFBS can be reduced to O(N ) by means of accept-reject-methods; however, it is not straightforward to adapt this approach to our framework, since one for general PODs cannot find an upper bound on the transition density of the hidden chain. For models with forgetting properties, Algorithm 2 should be outperformed by the fixed-lag smoother because of the quadratic complexity of the former scheme (see the coming section for examples); the FFBS should thus be seen as a generic and alternative solution in cases of poor mixing.

Simulation study
In this section, the proposed methods are illustrated on two simulated examples, consisting of noisy observations of the models treated by Beskos et al. (2006) and Beskos et al. (2008). In both examples we let, for simplicity, the measurement noise variance σ ǫ be known and set to 0.1 and assume equidistant measurements with t k+1 − t k = 1 for all k ≥ 0. We use consequently α =ᾱ = 1. The approximate intermediate quantity Q N n is maximised using the Nelder-Mead simplex algorithm as implemented in MATLAB's fminsearch-command. In order to obtain convergence of the parameter sequence returned by the Monte Carlo EM-algorithm, it is necessary to decrease, at each iteration, the bias of the particle approximation by increasing the number of particles with the iteration index. We thus follow the recommendations of Fort and Moulines (2003) and increase the particle sample size as the square root of the iteration number, with an initial size of 100 particles. A detailed discussion on the effect of the lag size on the quality of the final parameter estimates is given in ; thus, we do not repeat this discussion here and stick consequently to the recommendation of increasing the lag logarithmically with the size of the observation record.

Log-growth model
In the first example we estimate, from simulated data, the parameters of a partially observed version of the log-growth model discussed by Beskos et al. (2006). The model is specified by the following system of equations: where (ǫ k ) k≥0 are mutually independent, standard normal-distributed random variables. The noise sequence is supposed to be independent also from W . Applying Itô's formula to the transformationX t = η(X t , σ), with η(x, σ) . Since α is bounded from above, we are only required to simulate the minimum of the Brownian path and let W − α be α evaluated at this minimum; see Section B for the meaning ofW − α . The minimum of the Brownian bridge has a known law, and given the minimum, the bridge can be constructed retrospectively using Bessel bridges (see Beskos et al., 2006). Our aim is to estimate the unknown parameters θ def = (κ, Λ, σ) given a record Y 0:1000 of observations. The observation set was obtained through simulation under the parameters θ * = (0.1, 1000, 0.1). When computing the approximate intermediate quantity Q N n , the random weight fixed-lag smoother used the lag ∆ n = 40 and the proposal where t(·; n) denotes the density of the student's t-distribution with n degrees of freedom. Further the adjustment multiplier weights are set to 1. The proposal (3.3) is obtained by discretising the hidden dynamics using the Euler scheme. We set α =ᾱ = 1. The EM output is presented in Figure 3.1. For comparison, the estimation problem of the log-growth model was also solved using the GPE-based particle FFBS in Section 2.4. The setup was the same as for the fixed-lag smoother, but due to the significant higher computational cost of the FFBS scheme (recall Section 2.4) the number of observations was reduced to 100. For the FFBS-based procedure, the GPE needs to be evaluated N + 1 times per particle and time step, i.e., once in the forward filtering pass and N times in the backward simulation sweep, compared to only once for the fixed-lag smoother.
The output of the EM learning curves obtained using the GPE-based particle FFBS is presented in Figure 3.1. Convergence of Λ (solid, left y-axis), κ (dashed, right y-axis), and σ (dotted, right y-axis) using the GPE-based particle FFBS on 100 observations.

Genetics diffusion model
In a second example we estimate, again from simulated data, the parameters of a partially observed version of the genetics diffusion model presented in Kloeden and Platen (1992) and discussed by Beskos et al. (2008). The model is given by where the sequence (ǫ k ) k≥0 is as in the previous example. Applying Itô's formula )/σ, allows for using the GPE for estimating the transition density of the latent process. In this case, the drift function α of the transformed process becomes more involved than in the previous example, and it is neither bounded from above nor below. Thus, we have to draw bothW − α andW + α and a Brownian bridge (W s ) t s=0 such thatW − α ≤ α(W s ) ≤W + α for all 0 ≤ s ≤ t; see Section B for a justification of this. For this purpose we apply the method proposed in Beskos et al. (2008), which involves sampling first a maximumW + id and a min-imumW − id , and then a Brownian bridge such thatW − id ≤W s ≤W + id for all 0 ≤ s ≤ t. Since a linear transformation of a Brownian bridge is still a Brownian bridge, it suffices to consider the case when the path (W s ) t s=0 is conditioned to start and end in zero. Sampling a lower and upper bound can then be done by using rejection sampling in the following way: let (a i ) i≥0 with a 0 = 0 be an increasing sequence and consider the intervals (−a i , a i ]. Since the probability that a Brownian bridge stays in a specific interval [−K, K] has a known expression (having the form of an infinite series), it is possible to calculate the probability that it is contained in (−a i , a i ] but not in (−a i−1 , a i−1 ]; this means that either its maximum is contained in (a i−1 , a i ] or its minimum is contained in (−a i , −a i−1 ] or both. Thus, we first propose an interval (a i−1 , a i ]; given this interval, we then propose, with probability 1/2, a maximum conditioned to belong to (a i−1 , a i ], otherwise a minimum in (−a i , −a i−1 ]. Since the distributions of the maximum and minimum are known on closed-form, this is easily done. Next, we propose a Brownian bridge by decomposing around the proposed maximum (minimum) as in the previous example. The resulting path (W s ) t s=0 is accepted, with a probability depending on the path in question, only if it remains in the interval; see Beskos et al. (2008) for details. Finally, we setW ± α def = α(W ± id ). Again we attempt to estimate the unknown parameters θ def = (µ, ν, σ) given a record Y 0:1000 of observations obtained through simulation under the parameters θ * = (0.05, 0.1, 1). When computing the approximate intermediate quantity Q N n , the random weight fixed-lag smoother used the lag ∆ n = 20. Since the state space R(0, 1) is compact, we propose the particles by simply drawing uniforms over (0, 1). We set α =ᾱ = 1. The EM output in presented in Figure 3

Conclusion
Parameter inference in general discretely and partially observed diffusion processes is an inherently difficult problem due to the lack of closed-form transition densities of the hidden Markov chain. Assuming the possibility of simulating exactly transitions of the latent diffusion process, it is possible to produce pointwise and consistent estimates of the likelihood function using the standard bootstrap particle filter, in which the particles are assigned importance weights determined completely by the known local likelihood function. In such a framework, the likelihood surface can be explored using e.g. grid-based methods (Olsson and Rydén, 2008). Ionides et al. (2009) use the bootstrap particle filter for computing pointwise approximations of the score function and locate the maximum likelihood estimate by means of stochastic approximation. However, simulating exactly transitions of a diffusion process is in general infeasible and we are most often referred to discretisation-based methods such as the Euler scheme, imposing a nontrivially controlled bias of the final parameter estimates. Moreover, mutating blindly, as in the bootstrap particle filter, the particles without incorporating, in the proposal kernel, the information provided by the observations will in general lead to serious degeneracy of the particle weights, especially for models where the observations are informative. Thus, in the present paper we proposed an alternative, EM-based method for estimating unknown parameters of PODs. The method combines recent approaches for estimating efficiently the joint smoothing distribution in hidden Markov models with recently proposed techniques for estimating, without bias, transition densities of a large class of diffusion processes via GPEs (Beskos et al., 2008). Interestingly, the GPE provides a way of producing unbiased estimates of the transition densities simultaneously for all parameter values; this is critical when carrying through the maximisation-step of the EM-algorithm. For models having forgetting properties, the degeneracy of the particle trajectories can be efficiently avoided by means of fixed-lag smoothing (Kitigawa, 1998;. The decrease of variance gained by the fixed-lag approximation is obtained at the cost of a bias; the bias is however easily controlled by increasing logarithmically the size of the lag with the size of the observation record, yielding an algorithm of O(N ) computational complexity. We provide a detailed study of the convergence of the GPE-based particle smoother as well as the full intermediate quantity of EM. The results are obtained under, what we believe, minimal assumptions and may, since we analyse separately the GPE-based mutation step (Lemma A.1), be extended to any selection schedule for which consistency has been established in the literature. In this way, our GPEPS convergence results differ significantly from that presented in Fearnhead et al. (2008). In the nonergodic case, we proposed a method for sampling the joint smoothing distribution which is based on the forward-filtering backward-smoothing decomposition of the same. Basically, the method, which relies on an algorithm proposed by Godsill et al. (2004) and analysed further by Douc et al. (2009b), consists of a forward-filtering pass followed by a backward-simulation pass where trajectories are drawn according to approximations of the backward kernels obtained using the particle filter estimates obtained in the forward pass. During the two passes we replace, when needed, any evaluation of the diffusion process transition density by a draw from the GPE. At the end of the day, we obtain an O(N 2 ) algorithm that is significantly more costly than the fixed-lag smoother, but which avoids elegantly the problem of degeneracy of the genealogical tree of the particles. The methods were successfully demonstrated on two examples.
There exist alternative techniques, either Monte Carlo-based (see e.g. Pedersen, 1995) or based on basis expansions (Aït-Sahalia, 2008), for approximating the transition density. Nevertheless, none of these approaches produce unbiased estimates. The former is, while quite general, computationally very demanding and the latter is only valid for very short time intervals (recall that the performance of the GPE is independent of the size of the time grid). Sometimes more direct numerical approaches, such as solving the Fokker-Plank equations or taking the Fourier inverse of the characteristic function of the SDE, are possible; however, these methods often tend to be computationally expensive. Anyway, the theoretical results obtained by us presume only unbiasedness of the transition density estimator, and thus other approximation schemes may be applicable within our framework.

Appendix A: Proofs
The proofs of Proposition 2.1 and Theorem 2.1 rely on recent results on limit theorems for weighted samples obtained by Douc and Moulines (2008). Since we in this section deal exclusively with asymptotic properties of the sample as the sample size tends to infinity, we let, when not specified differently, the limit notation → refer to an increasing number N of particles only. In addition, we let also the particles and the associated weights be indexed by N for clearness. The following kernel notation will be useful in the following: Let µ be a measure on (Ξ, B(Ξ)), f a measurable function on (Ξ, B(Ξ)), and K a kernel from (Ξ, B(Ξ)) to (Ξ, B(Ξ)); then we set and The following definition specifies the structure that we want any class of estimand functions to have.
Definition A.1. A set C of measurable functions on Ξ is proper if the following holds.
(i) C is a linear space; that is, if f and g belong to C and (α, β) ∈ R 2 , then αf + βg ∈ C; (ii) if g ∈ C and f is measurable with |f | ≤ |g|, then f ∈ C; (iii) for all c ∈ R, the constant function ξ → c belongs to C.
We will frequently make use of the following lemma obtained by Douc and Moulines (2008). Let (Ω, F , P) be a probability space and (F N,i ) N i=0 , N ≥ 1, a triangular array of sub-σ-fields of F such that F N,i−1 ⊆ F N,i for all 1 ≤ i ≤ N and N ≥ 1. In addition, let (U N,i ) N i=1 , N ≥ 1, be a triangular array of random variables such that each U N,i is F N,i -measurable.
Algorithm 3 ( * random weight mutation * ) The sample (ξ N,i ,ω N,i ) N i=1 returned by the algorithm is taken as an approximation of µ. In order to evaluate the quality of this sample, define the set then the following result stating consistency for weighted samples produced by Algorithm 3 is instrumental when establishing Proposition 2.1.
is consistent for (ν, C) and that the function L(·,Ξ) belongs to C. Then the setC defined in (A.3) and the weighted particle sample (ξ N,i ,ω N,i ) N i=1 produced by Algorithm 3 are proper resp. (µ,C)-consistent for any fixed α ∈ N.
To establish Condition (2.14) in Definition 2.1 it is enough to show that, for all f ∈C, indeed, sinceC contains the unity mappingξ → 1 (asC is proper), (A.4) implies that from which Condition (2.14) in Definition 2.1 follows by Slutsky's lemma. Thus, we define the triangular array U N,i We then get, by applying the tower property of conditional expectations and the consistency of the ancestor sample, in probability, implying (A.4), we apply Theorem A.1. In order to establish the first condition of that theorem we reuse the arguments above and use that L(·, |f |) ∈ C, yielding the limit Now, since convergence in probability implies tightness, we conclude that Condition (i) in Theorem A.1 is fulfilled.
To verify (ii), define, for some ǫ > 0, Since, as the ancestor sample is assumed to be consistent, max 1≤i≤N ω N,i /Ω N vanishes in probability as N tends to infinity, the same holds for the product where L(·, |f |) ∈ C, we conclude, using Property (ii) of Definition A.1, that the mapping on Ξ belongs to C as well. Thus, consistency of the ancestor sample implies that In addition, since the constant C may be chosen arbitrarily large, the limit (A.6) can be made arbitrarily small by the dominated convergence theorem. We hence conclude that A N tends to zero in probability as N tends to infinity. This establishes (A.4).
In order to establish (2.15) it is, by Slutsky's theorem and (A.5), enough to prove that Thus, take again a constant C > 0 and write To prove that the right hand side of (A.8) converges, we introduce the triangular and let the sub-σ-fields F N , N ≥ 1, be defined as above. Next, we use again Theorem A.1. To verify the first condition, take conditional expectation with respect to F N and reuse (A.6) with f being the unity function; this yields implying (i). To verify (ii), take an ǫ > 0 and define implying that, for an arbitrary constant C ′ > 0, following the lines of (A.6), On the other hand, Thus, since the limit (A.9) can be made arbitrarily small by increasing C ′ , we conclude that A N tends to zero as N tends to infinity. This in turn implies that the upper bound in (A.8) tends to α ℓ=1 v ℓ ≥αC v 1 S α (ξ,ξ, dv 1:α ) R(ξ, dξ) ν(dξ) . (A.10) Finally, we complete the proof by noting that (A.10) can be made arbitrarily small by increasing C.
We now use Lemma A.1 to prove consistency of Monte Carlo estimates produced by the GPEPS. For this purpose, letξ i 0:k|k def = ξ I i k 0:k|k , 1 ≤ i ≤ N , denote the selected particles obtained in Step (2) of Algorithm 1. Consequently, the sample (ξ i 0:k|k ) N i=1 is obtained by resampling the ancestor particles (ξ i 0:k|k ) N i=1 multinomially with respect to the normalised adjusted weights (ω j k ψ j k / N ℓ=1 ω ℓ k ψ ℓ k ) N j=1 . This operation will in the following be referred to as selection. Using this notation and terminology it is now possible to describe one iteration of the GPEPS by the following three transformations: Here the third operation refers to the random weight mutation procedure described in Algorithm 3.
To prove Proposition 2.1 we proceed by induction and assume that ( is consistent for (φ k , L 1 (X k+1 , φ k )). Next, we show how consistency is preserved through one iteration of the algorithm by analysing separately Steps (I-III).
Step I. Define the modulated smoothing measure then the weighting operation in Step I can be viewed as a transformation according Algorithm 3 with Ξ = X n+1 ,Ξ = X n+1 , and Thus, by applying Lemma A.1 we conclude that (ξ i 0:k|k , ψ i k ω i k ) N i=1 is consistent for φ k Ψ k and the (proper) set Step II. Applying Theorem 3 in Douc and Moulines (2008) gives immediately that (ξ i 0:k|k , 1) N i=1 is consistent for [φ k Ψ k , L 1 (φ k Ψ k , X n+1 )] for both the selection schedules (2.12) and (2.13).
Step III. Also the third step is handled using Lemma A.1. In this case, we set Ξ = X n+1 ,Ξ = X n+2 , and where P is the GPE described in Section 2.1 (and in more detail in Appendix B). Thus, using Lemma A.1 yields that (ξ i 0:k+1|k+1 , ω i k+1 ) N i=1 is consistent for φ k+1 and the set Finally, we complete the proof by noting that the induction hypothesis is fulfilled for k = 0 by assumption.

A.2. Proof of Theorem 2.1
Decompose the error according to where the bracket terms are errors originating from the GPEPS and the second term b n , defined in (2.20), is the cost of introducing the fixed lag. By combining Proposition 2.1 with Slutsky's theorem we conclude that as x 0:k(∆n) → log g θ (x k , Y k ) belongs to L 1 (φ k(∆n) (·; θ ′ ), X k(∆n)+1 ) by assumption. Thus, the second term of the intermediate quantity estimator (2.19) is consistent. In order to establish consistency of the complete estimator it remains to prove that To do this, we defineŪ N,i } and appeal to Theorem A.1 and Proposition 2.1. Since log q θ (x k , x k+1 ) ≤ |v|P θ (x k , x k+1 , dv) for all x k:k+1 ∈ X 2 , the mapping from which we conclude that (A.13) may be established by verifying the two assumptions of Theorem A.1. Following (A.14) and using again that x 0:k(∆n) → |v|P θ (x k , x k+1 , dv) belongs to L 1 (φ k(∆n) (·; θ ′ ), X k(∆n)+1 ) by assumption, we conclude that N i=1 E Ū N,i F N P −→ |v|P θ (x k , x k+1 , dv) φ k(∆n) (dx k:k+1 ; θ ′ ) , which verifies Assumption (i) (by tightness of sequences converging in probability). To verify (ii), let ǫ > 0 and setĀ N def = N i=1 E[|Ū N,i |; |Ū N,i | ≥ ǫ|F N ]. Then, for any constant C > 0, by consistency of the particle sample, On the other hand, Now, since, for all x k:k+1 ∈ X 2 , | ᾱ ℓ=1 v ℓ |≥Cᾱ |v 1 |P ᾱ θ (x k , x k+1 , dv 1:ᾱ ) ≤ |v|P θ (x k , x k+1 , dv) , we get, using Proposition 2.1, We now note that the limit in (A.16) can be made arbitrarily small by increasing C. This verifies condition (ii) in Theorem A.1, which completes the proof of (A.13). Finally, combining (A.13) with (A.12) completes the proof of Theorem 2.1.

Appendix B: More on the GPE
The outline of this section follows Beskos et al. (2006) and Fearnhead et al. (2008), and we limit our scope to the one-dimensional case; multivariate extensions are treated by Beskos et al. (2008). Let (C[0, t], C[0, t]) be the measurable space of continuous functions on [0, t] and denote by ßx θ the law ofX on (C[0, t], C[0, t]) for the initial conditionX 0 = W 0 = x. Also, let W (t,x,x ′ ) be the law, on the same space, of the Brownian bridge processW = (W s ) 0≤s≤t starting in x at time zero and ending in x ′ at time t. Similarly, denote by ßt, x, x ′ θ the law of the diffusion bridge obtained whenX is conditioned to start atX 0 = W 0 = x and to finish atX t = x ′ . Recall the definition (2.1) of α(·, θ) and let A(u, θ) def = u α(v, θ) dv be any antiderivative of α(·, θ). The role of Assumptions (A1-A3) is to guarantee that ßt, x, x ′ θ is absolutely continuous with respect to W (t,x,x ′ ) with Radon-Nikodym derivative dßx, x ′ , t θ dW (x,x ′ ,t) (w) where w ∈ C[0, t] and N t denotes the density function of the zero mean normal distribution with variance t. Now, define, for u ∈ R, the drift functional where l(θ) is the lower bound given in Assumption (A3). The transition densitỹ q θ can, using (B.1), be expressed as Accordingly, we wish to calculate expectations of the form Now assume that it is possible to simulate simultaneously a pair (W − f ,W + f ) of random variables and a trajectory (W s ) t s=0 such that W − f ≤ f (W s ) ≤W + f , for all s ∈ [0, t] ; in practice this will most often be carried through by first simulating a maximum and a minimum of the Brownian bridge processW and hereafter interpolating, using Bessel bridges, the rest of the bridge conditionally on these. Let κ be a discrete random variable having, conditionally onW ± f , probability distribution p t (·|W ± f ). Then it is easily established that the GPE (associated with p t ) is an unbiased estimator of (B.2). Here (ψ ℓ ) ℓ≥1 are mutually independent variables that are uniformly distributed over [0, t] and independent of F t . Note that the distribution p t can be chosen freely, yielding a whole class of GPEs, and an optimal choice is discussed by Fearnhead et al. (2008). In all applications considered in this paper we will use let κ be Poisson-distributed. Using the Girsanov theorem, it can be shown that Since the right hand side of (B.1) can be bounded from above and below, a rejection sampler producing samples from the diffusion bridge can be constructed. This is possible as the right hand side of (B.1) is proportional to the probability that a marked Poisson process on [0, t] × [0, 1] with intensity r def = sup x {φ(x);W − φ < x <W + φ } is below the graph s → φ(W s ; θ)/r. However, while observing the path for all s is impossible, a finite construction can be devised by sampling the Brownian bridge at points specified by the marked Poisson process; we refer to Beskos et al. (2006) for details. The algorithm is described by the following.