Geometric ergodicity of Rao and Teh's algorithm for Markov jump processes and CTBNs

Rao and Teh (2012, 2013) introduced an efficient MCMC algorithm for sampling from the posterior distribution of a hidden Markov jump process. The algorithm is based on the idea of sampling virtual jumps. In the present paper we show that the Markov chain generated by Rao and Teh's algorithm is geometrically ergodic. To this end we establish a geometric drift condition towards a small set. A similar result is also proved for a special version of the algorithm, used for probabilistic inference in Continuous Time Bayesian Networks.


Introduction
Markov jump processes (MJP) are natural extension of Markov chains to continuous time. They are widely applied in modelling of the phenomena of chemical, biological, economic and other sciences. An important class of MJP are continuous time Bayesian networks (CTBN) introduced by Schweder (1970) under the name of composable Markov chains and then reinvented by Nodelman, Shelton and Koller (2002a) under the current name. Roughly, a CTBN is a multivariate MJP in which the dependence structure between coordinates can be described by a graph. Such a graphical representation allows for decomposing a large intensity matrix into smaller conditional intensity matrices.
In many applications it is necessary to consider a situation where the trajectory of a Markov jump process is not observed directly, only partial and noisy observations are available. Typically, the posterior distribution over trajectories is then analytically intractable. In the literature there exist several approaches to the above mentioned problem: based on sampling (Boys, Wilkinson and Kirkwood, 2008;El-Hay, Friedman and Kupferman, 2008;Fan and Shelton, 2008;Golightly and Wilkinson, 2011;Golightly and Wilkinson, 2014;Golightly, Henderson and Sherlock, 2015;Nodelman, Shelton and Koller, 2002b;Teh, 2013, 2012), and also based on numerical approximations. To the best of our knowledge the most general efficient method for a finite state space is that proposed by Rao and Teh (2013), and extended to a more general class of continuous time discrete systems in Rao and Teh (2012). Their algorithm is based on introducing so-called virtual jumps and a thinning procedure for Poisson processes.
Recently Miasojedow and Niemiro (2016) proved geometric ergodicity of Rao and Teh's algorithm in a special case of the homogeneous MJPs observed at discrete moments and when the virtual jumps are introduced by unifromization procedure. In the present paper we generalise results from Miasojedow and Niemiro (2016) to a larger class of MJPs, more general observation models and also for more general class of state dependent thinning procedures. We also establish geometric ergodicity of a Gibbs sampler for CTBN's. Geometric ergodicity is a key property of Markov chains which implies Central Limit Theorem for sample averages.
Note that in practice the parameters of the hidden MJP may be unknown and have to be estimated. Then for both Bayesian and frequentist statistical inference the Rao and Teh's algorithm can be applied as a part of more complex algorithms. In the Bayesian approach, the Rao and Teh's algorithm can be used within a Gibbs sampler or Metropolis-Hastings algorithm which updates unknown parameters, according to some posterior distribution. In the frequentist approach, the Rao and Teh's algorithm can be applied to perform E-step of Monte Carlo or stochastic approximation version of EM algorithm. Such extended versions of the Rao and Teh's algorithm are not considered in our paper. We assume that the probability law of a hidden MJP is known. However we strongly believe that geometric ergodicity of the Rao and Teh's algorithm for a given parameters of hidden process will be crucial in the theoretical analysis of such complex methods.
The rest of the paper is organised as follows. In Section 2 we briefly introduce hidden Markov jump processes, next in Section 3 we recall the dependent thinning procedure and the Rao and Teh's algorithm. The main result is proved in Section 4 and extensions for CTBN's are given in Section 5.
Throughout this paper we use p(X) as the generic notation for a probability density of a random object X, so it may denote different functions. Set {n, n + 1, . . . , m} is denoted by [n : m] (for integer n ≤ m).
Suppose that process X cannot be directly observed but we can observe some random quantity Y with probability distribution L(Y |X). Let us say Y is the evidence and L is the likelihood. The problem is to restore the hidden trajectory of X given Y . From the Bayesian perspective, the goal is to compute/approximate/sample from the posterior Function L, transition probabilities Q and initial distribution ν are assumed to be known. We consider two typical forms of noisy observation. In the first part of our paper we assume that the trajectory X([t min , t max ]) is observed independently at k deterministic time points with some random errors. Formally, we observe for some fixed known points t min ≤ t obs 1 < · · · < t obs k ≤ t max . Another type of evidence is considered later in Section 5, in the context of CTBNs. In Remarks 8 and 14 we mention some alternative assumptions about the form of evidence.
The obvious standing assumption in our paper is that L(Y |X) > 0 happens with nonzero probability if X is given by ν and Q. It means that the hidden MJP under consideration is "possible", i.e. the data do not contradict the probabilistic model.

Dependent thinning and Rao and Teh's algorithm
The so-called "dependent thinning" is a useful representation of a Markov jump process in terms of potential times of jumps and the corresponding states (Rao and Teh, 2012). The intensities are assumed to be uniformly bounded, so the process X has a finite number of jumps in the bounded interval [t min , t max ]. Every trajectory X([t min , t max ]) is right continuous and piecewise constant: X(t) = S i−1 for T i−1 ≤ t < T i , where random variables T i are such that t min < T 1 < · · · < T N < t max < T N +1 . By convention, T 0 = t min . Random sequence of states S = (S 0 , S 1 , . . . , S N ) such that S i = X(T i ) is called a skeleton. We do not assume that S i−1 = S i , and therefore the two sequences represent the process X in a redundant way: many pairs (T, S) correspond to the same are moments of true jumps and T −J = T \ T J = (T i : i ∈ J) are virtual jumps. By a harmless abuse of notation, we identify increasing sequences of points in [t min , t max ] with finite sets. Note that the trajectory of X is uniquely defined by (T J , S J ). Let us write X ≡ (T J , S J ) and also use the notation J(X) = T J for the set of true jumps. The state-dependent thinning procedure taken from Rao and Teh (2012) is the following. We choose a function R(t; s) ≥ Q(t; s), interpreted as intensity of an inhomogeneous Poisson process depending on state s ∈ S. The first point of this Poisson process after time u, say W , has the probability density Let Sampling of (T, S) then proceeds as described in Algorithm 1.
Algorithm 1 Sampling by state-dependent thinning.
By (2) and (3), the joint probability distribution of (T, S) is the following.
The last part of the above expression is equal to P(T N +1 > t max |T N , S N ). The pair (T, S) produced by Algorithm 1 is a redundant representation of MJP X defined by ν and Q (probability distribution of X obtains if we "integrate out" virtual jumps). Rao and Teh (2012) exploit dependent thinning to construct a special version of a Gibbs sampler which converges to the posterior p(X|Y ). The key facts behind their algorithm are the following. First, given the trajectory X ≡ (T J , S J ) the conditional distribution of virtual jumps T −J is that of the inhomogeneous Poisson process with intensity R(t; X(t)) − Q(t; X(t)) ≥ 0. Second, this distribution does not change if we introduce the likelihood. Indeed, L(Y |X) = L(Y |T J , S J ), so Y and T −J are conditionally independent and thus p(T −J |T J , S J , Y ) = p(T −J |T J , S J ). Third, the conditional distribution p(S|T, Y ) is that of a hidden discrete time Markov chain and can be efficiently sampled from using the algorithm FFBS (Forward Filtering-Backward Sampling, Carter and Kohn (1994);Frühwirth-Schnatter (1994)). Indeed, from (4) and (1) it follows that where P i (s, s ′ ) = P (T i , s, s ′ ) is the stochastic matrix defined by (3) and Note that functions g i include not only the likelihood but also a part due to the prior distribution p(T, S). The Rao and Teh's algorithm generates a Markov chain X 0 , X 1 , . . . , X m , . . . (where X m = X m ([t min , t max ]) is a trajectory of a MJP), convergent to p(X|Y ). A single step, that is the rule of transition from X m−1 = X to X m = X ′ is described in Algorithm 2.
Algorithm 2 Single step of Rao and Teh's algorithm.
Convergence of the algorithm has been shown by its authors in Rao and Teh (2012). It follows from the fact that the chain has the stationary distribution p(X|Y ) and is irreducible and periodic, provided that R(t; s) > Q(t; s).

Main result
Let A be the transition kernel of the Markov chain X m generated by the Rao and Teh's algorithm. Let Π be the target distribution. (It is the posterior distribution of X given Y . In this paper we consider only Monte Carlo randomness, so Y is fixed and can be omitted in notation.) Theorem 1. Consider a hidden MJP in which the evidence is of the form (1). Assume that Then the chain X m produced by the Rao and Teh's Algorithm 2 is geometrically ergodic, i.e. there exist constant γ < 1 and function M such that for every initial trajectory X, We begin with some auxiliary results. In Lemmas 2 and 5 we consider an inhomogeneous Markov chain S 0 , S 1 , . . . , S n on a finite state space S with the joint probability distribution given by where P i are stochastic matrices and g i are non-negative functions (formula (5) shows that the conditional distribution of skeleton is of this form). Assuming that 0 < n 0 ≤ i < n, we define Lemma 2. Assume that 1. for some ξ > 0 inequality P i−n0:i (s i−n0 , s) ≥ ξ holds for all s i−n0 , s ∈ S, 2. for some η > 0 inequality P i+1 (s, s) ≥ η holds for all s ∈ S, 3. for some g min l and g max l we have g min Then Proof. We condition additionally on S i−n0 = s i−n0 and use two-sided Markov property imsart-bj ver. 2011/11/15 file: GenErgodicity_RaoTeh_BJ.tex date: March 1, 2018 to obtain Finally let us remark that conclusion of the the lemma remains trivially true if g min l = 0 for some l ∈ [i − n 0 + 1 : i] and thus δ i = 0.
Remark 3. In Lemma 2 we bound from below the backward transition probability used by the FFBS algorithm. However, the identical inequality is true also for the forward transition probability P(S i = s|S i−1 = s).
Remark 4. In the time-homogeneous case, when P i = P , the first assumption of Lemma 2 is essentially equivalent to irreducibility and aperiodicity of matrix P . Note also that the two constants ξ and η play different roles in Rao and Teh's algorithm.

Lemma 5. Let the assumptions of Lemma 2 hold for all
Proof. Note that |J| = 1 + n−1 i=0 I(S i = S i+1 ). We apply Lemma 2 to each i ∈ [n 0 : n − 1] to obtain P( In the next proposition we establish a geometric drift condition for the Markov chain X 0 , X 1 , . . . , X m , . . .. Consider a single step, that is transition from X m−1 = X to X m = X ′ . The dependence on the input trajectory X (and also on Y ) is implicitly assumed but indicated only when necessary. Recall that |J(X)| is the number of true jumps of the trajectory X([t min , t max ]).
Proposition 6 (Drift Condition). Under the assumptions of Theorem 1, there exist δ > 0 and c < ∞ such that in a single step of the Rao and Teh's algorithm Proof. Let us analyse what happens in both two stages (V) and (S) of Algorithm 2. The initial X is fixed. In stage (V) we add a new set V of potential jumps. Since |V | has the Poisson distribution with intensity (5) shows that conditionally, for fixed T ′ , sampling of the new skeleton S ′ fulfils the conditions of Lemma 5, with n+1 = |T ′ |. Indeed, in view of (3), Assumption 1 of Theorem 1 entails Condition 1 of Lemma 2, at least for sufficiently large n 0 and all i ≥ n 0 . Indeed, we can choose ξ and n 0 are such that (P min ) n0 (s, s ′ ) ≥ ξ, where P min is the stochastic matrix with off-diagonal elements P min (s, s ′ ) = Q min (s, s ′ )/r max . Assumption 2 of Theorem 1 entails directly Condition 2 of Lemma 2. Moreover, the formula (6) for g i (s) includes the "likelihood factor" L j (Y j |s) for at most k indices i, simply because there are k points t obs j . For the remaining n − k indices, by (6), we have two-sided bounds g min ≤ g i (s) ≤ g max , where g min and g max do not depend on i and g min /g max > 0. Indeed, we can choose g min = q min exp[−r max (t max − t min )] and g max = r max , where q min = min s Q min (s) = min s s ′ =s Q min (s, s ′ ). Assumption 1 of Theorem 1 entails R(t; s) ≥ q min > 0. Consequently, in the conclusion of Lemma 5 we have for at at least n − (k + 1)n 0 indices, with n + 1 = |T ′ | and fixed (k + 1)n 0 . For the remaining indices we put δ i = 0 and thus obtain Consequently, The conclusion of the proposition follows.
Recall that A denotes the transition kernel of the Markov chain defined via Algorithm 2. Φ is a called a regeneration measure.
Proof. The scheme of our proof is the following. We will define a sequence of states s * = (s * 0 , ..., s * n ) and a sequence of times t * = (t * 0 , ..., t * n ). Both these sequences are deterministic and fixed. The regeneration measure Φ(dX ′ ) is described in terms of s * and t * as follows: Trajectory X ′ is determined by (T ′ , S ′ ) as described in Section 1. Note that the skeleton S ′ is deterministic and random vector (T ′ 1 , . . . , T ′ n ) has the uniform distribution on the . . , n} . We will show that Algorithm 2 can be equivalently executed in such a way that the resulting X ′ is distributed according to Φ with probability at least β > 0, provided that |J(X)| ≤ h (β must not depend on X; it will be defined in the course of our proof). Now we proceed to details of our construction. To define s * and t * , let us first choose a sequence s † Now we are going to use Assumptions 1 and 3 of Theorem 1. By irreducibility of matrix Q min we can embed s † in a skeleton s * , which has probability bounded below for the chain with transition matrices P i (whatever the choice of the times of jumps, on which these matrices depend). Put differently, we define a sequence s * = (s * 0 , ..., s * n ) for some n ≥ k such that s † is a subsequence of s * , s * i−1 = s * i and, uniformly in t, we have To get a sequence of times "compatible with" the skeleton s * , we embed the sequence t obs = (t obs 1 , . . . , t obs k ) in a longer sequence t * = (t * 0 , t * 1 , . . . , t * n ). More precisely, we choose a sequence t min = t * 0 < t * 1 < · · · < t * n < t max such that s * ij = s † j implies t * ij = t obs j for j = 1, . . . , k.
Fix X with |J(X)| ≤ h. We are going to describe a special way in which Algorithm 2 can be executed. Note that Assumptions 2 and 1 of Theorem 1 ensure that R(t; s) − Q(t; s) ≥ ηq min =: ǫ > 0.
In stage (V) we can independently sample two Poisson processes on the interval [t min , t max ], say V 0 and V rest , with intensities ǫ and R(t; s) − Q(t; s) − ǫ, respectively. Next let Moreover, since V rest is a Poisson process with intensity bounded by r max we have In stage (S) of the Algorithm 2 we construct skeleton S ′ . Although the actual sampling from p(S ′ |T ′ , Y ) is by FFBS, an equivalent result can be obtained via rejection sampling. We have The rejection sampling proceeds as follows.
(Of course the rejection method is highly inefficient and is considered only to clarify presentation.) We consider the following random events E i : • E 1 : in stage (S1) all points belonging to J(X) are changed to virtual jumps, while jumps at V 0 form the skeleton s * . • E 2 : in stage (S2) we accept the skeleton obtained in stage (S1).
We can see that • E 0 happens with probability at least β 0 β rest .
(Of course, all the probabilities in the above statements are conditional on X.) Putting everything together, E := E 0 ∩E 1 ∩E 2 happens with probability at least β := β 0 β rest β 1 β 2 > 0. If E happens then the output X ′ of Algorithm 2 is independent of the input X and has the probability distribution Φ(dX ′ ) described in the beginning of this proof.
Remark 8. Note that Theorem 1 holds also for other observation models than given by (1). For example, we could consider the observed object Y of quite general nature but assume that L(Y |X) ≥ ε > 0. The proofs of the drift condition and especially of the small set condition would be then much simpler. However, assumption (1) does not imply L(Y |X) ≥ ε > 0, and we think that this latter condition is less realistic in applications.

Continuous Time Bayesian Networks
Let (N , K) denote a directed graph with possible cycles. We write w → u instead of (w, u) ∈ K. For every node w ∈ N consider a corresponding space S w of possible states.
Assume that each space S w is finite. We consider a continuous time homogeneous Markov process on the product space S = w∈N S w . Thus a state s ∈ S is a configuration s = (s w ) = (s w ) w∈N , where s w ∈ S w . If W ⊆ N then we write s W = (s w ) w∈W for configuration s restricted to nodes in W. We also use notation S W = w∈W S w , so that we can write s W ∈ S W . The set W \ {w} will be denoted by W − w and N \ {w} simply by −w. We define the set of parents of node w by pa(w) = {u ∈ N : u → w} , and we define the set of children of node w by ch(w) = {u ∈ N : w → u} .

Suppose we have a family of functions
For fixed c ∈ S pa(w) , we consider Q w (c; ·, · ) as a conditional intensity matrix (CIM) at node w (only off-diagonal elements of this matrix have to be specified, the diagonal ones are irrelevant). The state of a CTBN at time t is a random element X(t) of the space S of configurations. Let X w (t) denote its wth coordinate. The process {X w (t) w∈N , t ≥ 0} is assumed to be Markov and its evolution can be described informally as follows. Transitions at node w depend on the current configuration of the parent nodes. If the state of some parent changes, then node w switches to other transition probabilities. Formally, CTBN is a time-homogeneous MJP with transition intensities given by for s = s ′ . Define also Q w (c; s) = s ′ =s Q w (c; s, s ′ ) for c ∈ S pa(w) , s ∈ S w . For a CTBN, the density of sample path X = X([t min , t max ]) in a bounded time interval [t min , t max ] decomposes as follows: where ν is the initial distribution on X and p(X w X pa(w) ) is the density of piecewise homogeneous MJP with intensity matrix equal to Q w (c; ·, · ) in every time sub-interval such that X pa(w) = c. Formulas for the density of CTBN appear e.g. in Nodelman, Shelton and Koller (2002b, Sec. 3.1), Fan, Xu and Shelton (2010, Eq. 2), Fan and Shelton (2008, Eq. 1) and Miasojedow et al. (2014). Our notation p(X w X pa(w) ) is consistent with the notion of "conditioning by intervention", see e.g. Lauritzen (2001). Indeed, p(X w X pa(w) ) is the density of the process X w at note w under the assumption that the sample paths at the parent nodes X pa(w) are fixed and X w (t min ) is given, see e.g. Miasojedow et al. (2014), for details. Below we explicitly write an expression for p(X w X pa(w) ). We need the following notations: Let ι X w (c; s, s ′ ) denote the number of jumps from s ∈ S w to s ′ ∈ S w at node w, which occurred when the parent configuration was c ∈ S pa(w) .
Let τ X w (c; s) be the length of time that node w spent in state s ∈ S w when the parent configuration was c ∈ S pa(w) .
With these notations we can write Let us also write p jump (X w X pa(w) ) and p stay (X w X pa(w) ) for the first and second expression in (10), respectively, to facilitate future references. The problem of probabilistic reasoning for a CTBN can be formulated as follows. Assume that the available evidence is the complete observation of some nodes, say X O ([t min , t max ]) for some set O ⊂ N . We are to compute the posterior distribution over unobserved nodes, i.e. on the trajectories of X W ([t min , t max ]), where W = N \ O. The basic idea, proposed in (Rao and Teh, 2013) is to use "Algorithm 2 within Gibbs sampler". Let us fix a node w ∈ W. By (9), the full conditional distribution is the following.
The density ν(X w (t min )|X −w (t min ))p(X w X pa(w) ) corresponds to a piecewise homogeneous Markov process and can be treated as the prior distribution in Algorithm 2. The expression u∈ch(w) p(X u X pa(u) ) can be treated as likelihood. In Algorithm 2 we can use ,,instrumental" intensities R w (t; s) different for different nodes w, and possibly time-inhomogeneous (in practical implementations however, R w will usually be timehomogeneous).
Algorithm 3 Random scan Gibbs sampler for CTBN.
input: previous trajectory X W and evidence X O .
Choose at random w ∈ W (according to some probability distribution on W).
Update Xw to X ′ w . Apply Algorithm 2 with the target distribution p(Xw|X −w ) given by (11) { X −w includes all nodes u ∈ O as well as u ∈ W, u = w }.
Theorem 9. Consider a CTBN in which we observe trajectories X O ([t min , t max ]). Assume that 1. for every w ∈ W = N \ O there exists an irreducible matrix Q min w such that Q min w (s, s ′ ) ≤ Q w (c; s, s ′ ) for all s, s ′ ∈ S w , s = s ′ , c ∈ S pa(w) , 2. there exists η > 0 such that Q w (c; s)/R w (t; s) ≤ 1 − η for all w ∈ W, s ∈ S w , c ∈ S pa(w) and t ∈ [t min , t max ], 3. there exists r max < ∞ such that R w (t; s) ≤ r max for all w ∈ W, s ∈ S w , t ∈ [t min , t max ], 4. for every w ∈ N , the supports of Q w (c; ·, ·) do not depend on the parent configuration c ∈ S pa(w) , i.e. Q w (c; s, s ′ ) > 0 implies Q w (c ′ ; s, s ′ ) > 0.
Then the chain (X W ) m produced by Algorithm 3 is geometrically ergodic.
As in Section 4, we will show a drift condition towards a small set. The Lyapunov function in the CTBN setting will be the global number of jumps |J(X W )| = w∈W |J(X w )|. Let us begin with an elementary fact, needed in the proof of the drift condition.
Lemma 10. Let 0 ≤ ̺ < 1. For any x 1 , . . . , x n such that Proof. Without loss of generality we can assume that n i=1 x i = nb. Consider the following constrained optimisation problem: and its partial derivatives are Therefore the minimizer is x 1 = x 2 = · · · = x n = b and the conclusion follows.
Proof. For a given X W , there always exists a node w ∈ W such that |J(X w )| ≥ |J(X W )|/|W|, e.g. a node with maximum number of jumps. From now on, node w is fixed. We will prove that for some c 0 and ε 0 > 0, This is a "local version" of the drift condition. The conclusion of the proposition will easily follow from (12). Indeed, for every node u we have E(|J(X ′ u )||X, node u is updated) ≤ |J(X u )| + µ, with µ := r max (t max − t min ), as in the proof of Proposition 6. Let p min denote the minimum probability of selecting a node for update in Algorithm 3. The singled out node w is chosen with probability at least p min . Therefore which is the desired conclusion. It remains to show (12). Since node w is fixed, we will omit subscript w in notation whenever the context permits. The reasoning leading to (12) is similar to the proof of Proposition 6, but more delicate due to a different form of the likelihood. The role of observed Y is now played by X −w . We can write the conditional distribution of the skeleton S (at node w) given times of possible jumps T (at node w) in the same form as (5), namely where N + 1 = |T |,ν(X w (t min )) := ν(X w (t min )|X −w (t min )) and P i s are defined by P i (s, s ′ ) = P (T i , s, s ′ ), as in (5), with (The "prior distribution" at node w is that of a piecewise-homogeneous MJP). However, the expressions for the g i s are different than (6). We now have where X (i) = X([T i−1 , T i )) is the process restricted to an inter-jump-at-w interval and X (i,w,s) pa(u) denotes the trajectories of X parts p(· ·) in equation (15) can be decomposed into p(· ·) = p jump (· ·) × p stay (· ·), c.f. (10). If we write then h i is easy to bound (at least qualitatively), because by (10) we have The part of h i which corresponds to the prior can be bounded analogously as in the proof of Proposition 6 and thus we obtain with q min = min w min s Q min w (s), c.f. Assumption 1 of Theorem 9. We are now left with a task of bounding the expression with p jump . This is more difficult, because this part of the likelihood depends on |J(X W )|. By Assumptions 2 and 3 of Theorem 9, we have Q u (c; s, s ′ ) ≤ r max for all s = s ′ , s, s ′ ∈ S u , c ∈ S pa(u) , for all u.
To obtain a lower bound, we define q min + as the minimum of nonzero values of Q u (c; s, s ′ ), s = s ′ , s, s ′ ∈ S u , c ∈ S pa(u) , all u. From (10) it follows that Note that Assumption 4 of Theorem 9 is needed to justify the lower bound above. Indeed, under the obvious assumption that X is possible, i.e. p(X) > 0, the jumps of X (i) u are possible under the configuration X (i) pa(u) . By Assumption 4 of Theorem 9, they must be possible also under the configuration X (i,w,s) pa(u) . Combining (16) and (17) we obtain for some constants a 1 , a 2 and ̺ 1 , ̺ 2 which depend only on the parameters of the network and on the instrumental intensity R (they depend neither on w nor on i). Of course, |J(X Assume that stage (V) of Algorithm 2 has been completed, resulting in a new set T ′ of potential times of jumps at node w. Just as in the proof of Proposition 6, we infer that E(|T ′ ||X, w is updated) ≤ |J(X w )| + µ, where µ = r max (t max − t min ). Now consider stage (S) of Algorithm 2. After sampling a new skeleton S ′ at node w, the set T ′ is "thinned" of to the set of true jumps J(X ′ w ). We are to bound |J(X ′ w )| from above. We are going to apply Lemmas 2 and 5 to the skeleton chain S ′ at node w which has the probability distribution p(S ′ |T ′ , X −w ) given by (13). We now decompose the process X into inter-jump parts according to T ′ and use the notation X (i) := X([T ′ i−1 , T ′ i )). From (18) we infer that the conclusion of Lemma 2 holds with for some constants a, ̺, n 0 and for i ≥ n 0 . Indeed, we can choose n 0 and ξ such that (P min w ) n0 (s, s ′ ) ≥ ξ, where P min w is the stochastic matrix with off-diagonal elements P min w (s, s ′ ) = Q min w (s, s ′ )/r max , c.f. Assumption 1 of Theorem 9. Then put ̺ = (̺ 2 /̺ 1 ) n0 and a = (ã 2 /ã 1 )ξη/|S w |, where η appears in Assumption 2 of Theorem 9.
We are now prepared to use Lemma 10. To verify its assumption, it is necessary to bound the sum of the exponents in (19). Recall that J(X (i) u ) is the set of jumps at u in the time interval between consecutive potential jumps at w, i.e. [T ′ i−1 , T ′ i ). Therefore where the last inequality holds for |J(X w )| ≥ 2(n 0 + 1), Lemma 10 implies that Lemma 5 implies that It is now enough to use E(|T ′ ||X, w is updated) ≤ |J(X w )|+ µ to complete the proof.
Note that the evidence, i.e. the trajectory X O ([t min , t max ]) is considered as fixed. In particular the constants in Proposition 11 may depend on X O .
The following proposition is an analogue of Proposition 7 in the CTBN setting. Now A denotes the transition kernel of the Markov chain defined via Algorithm 3.
Proposition 12. The set {X W : |J(X W |)| ≤ h} is |W|-small for every h, i.e. there exists a probability measure Φ and a constant β > 0 such that Proof. In contrast to the drift condition, the proof of the small set condition is easier in the present setting. Under the assumptions of Theorem 9, for the regeneration measure Φ we can take the measure concentrated at a deterministic, constant trajectory s * W = (s * u , u ∈ W) ∈ S W . The only requirement is thatν(s * W ) > 0, whereν is the posterior initial distribution of X W (t min ), given X O (t min ). We are to bound from below the probability that X ′ W (t) = s * W , for t ∈ [t min , t max ], where X ′ W is the result of |W| steps of Algorithm 3, starting from an arbitrary X W such that |J(X W ) ≤ h.
With probability at least (p min ) |W| , Algorithm 3 in |W| steps will visit and update all nodes belonging to W. Let us now consider a single step, in which w ∈ W is updated via Algorithm 2. In the rest of the proof w is arbitrary but fixed. (T ′ , S ′ ) denote the times of potential jumps and the skeleton of X ′ w . Since Assumptions 1, 2 and 3 of Theorem 9 are, for a fixed w, essentially the same as Assumptions 1, 2 and 3 of Theorem 1, the reasoning is similar as in the proof of Proposition 7. We assume that stage (S) of Algorithm 2 is executed in a way described in that proof, via rejection sampling. We consider the following random events E i : • E 0 : in stage (V) we obtain V = ∅ so T ′ = J(X w ).
• E 1 : in stage (S1) all points belonging to J(X w ) are changed to virtual jumps. • E 2 : in stage (S2) we accept the skeleton obtained in stage (S1).
It is easy to obtain the following lower bounds.
Of course, if E 0 ∩ E 1 ∩ E 2 happens then X ′ w (t) = s * w for t ∈ [t min , t max ]. Putting everything together, we get X ′ w ([t min , t max ]) = s * w with probability at least β := p min β 0 β 1 β 2 > 0. In |W| steps we get X ′ W ([t min , t max ]) = s * W with probability at leastν(s * W )β |W| . The proof is complete.
Remark 13. In this paper the focus is on qualitative results. The constants in our bounds are chosen in a way which makes presentation clearer, and we did not attempt to optimize them.
Remark 14. For clarity of presentation we have proved Theorem 9 under the assumption that some nodes of CTBN are fully observed. However, by minor modification of the proofs we can establish geometric ergodicity of the Rao and Teh's algorithm in a more general case. Our results remain true if we assume that some nodes are only partially observed with random noise at discrete moments, just as in (1). Clearly, for a drift condition, the likelihood part can be treated in the same way as in proof of Proposition 6. For a small set condition, we can repeat the construction from the proof of Proposition 7 for every node w ∈ N , and then define the regeneration measure for whole network as a product of regeneration measures for single nodes. The proofs of the key propositions in the more general case are not essentially different but become notationally complicated and awkward. For this reason they are omitted.