Relative entropy and waiting times for continuous-time Markov processes

For discrete-time stochastic processes, there is a close connection between return (resp. waiting) times and entropy (resp. relative entropy). Such a connection cannot be straightfor-wardly extended to the continuous-time setting. Contrarily to the discrete-time case one needs a reference measure on path space and so the natural object is relative entropy rather than entropy. In this paper we elaborate on this in the case of continuous-time Markov processes with ﬁnite state space. A reference measure of special interest is the one associated to the time-reversed process. In that case relative entropy is interpreted as the entropy production rate. The main results of this paper are: almost-sure convergence to relative entropy of the logarithm of waiting-times ratios suitably normalized, and their ﬂuctuation properties (central limit theorem and large deviation principle)


Introduction
Many limit theorems in the theory of stochastic processes have a version for discrete-time as well as for continuous-time processes. The ergodic theory of Markov chains e.g. is more or less identical in discrete and in continuous time. The same holds for the ergodic theorem, martingale convergence theorems, central limit theorems and large deviations for additive functionals, etc. Usually, one obtains the same results with some additional effort in the continuous-time setting, where e.g. extra measurability issues can pop up.
For discrete-time ergodic processes, there is a famous theorem by Ornstein and Weiss connecting return times and entropy (see (9), (14), (11)). In words, it states that the logarithm of the first time the process repeats its first n symbols typically behaves like n times the entropy of the process. This provides a way to estimate the entropy of a process by observing a single trajectory. This result seems a natural candidate to transport to a continuous-time setting. The relation between entropy and return times is sufficiently intuitive so that one would not expect major obstacles on the road toward such a result for continuous-time ergodic processes. There is however one serious problem. On the path space of continuous-time processes (on a finite state space, say), there is no natural flat measure. In the discrete-time and finite state space setting, one cannot distinguish between entropy of a process and relative entropy between the process and the uniform measure on trajectories. These only differ by a constant (i.e., a quantity not depending on the process but only on the cardinality of the state space of the process). As we shall see below, the difference between relative entropy and entropy starts to play an important role in turning to continuous-time processes. In fact, it turns out that there is no naive continuous-time analogue of the relation between return times and entropy; the logarithm of return times turns out to have no suitable way of being normalized, even for very simple processes in continuous time such as Markov chains. To circumvent this drawback, we propose here to consider the logarithm of waiting time ratios and relate them to relative entropy.
Of course, defining a return time is already a problem if we are working in continuous time: a process never exactly reproduces a piece of trajectory. Our approach with respect to this problem is to first discretize time, and show that the relation between waiting times and relative entropy persists in the limit of vanishing discrete time-step. From the physical point of view, this is a natural procedure, and the time-step of the discretization can be associated to the acquisition frequency of the device one uses to sample the process. We can also think of numerical simulations for which the discretization of time is unavoidable. Of course, the natural issue is to verify if the results obtained with the discretized process give the correct ones for the original process, after a suitable rescaling and by letting the time-step go to zero. This will be done in the present context.
In this paper, we will restrict ourselves to continuous-time Markov chains with finite state space, for the sake of simplicity and also because the aforementioned problem already appears in this special, yet fundamental, setting. The main body of this paper is: a law of large numbers for the logarithm of waiting times ratios and its connection to relative entropy, a large deviation result and a central limit theorem for the same quantities. One possible application is the estimation of the relative entropy density between the forward and the backward process which is physically interpreted as the mean entropy production, and which is strictly positive if and only if the process is reversible (i.e., in "detailed balance", or in "equilibrium").
Our paper is organized as follows. In Section 2, we show why the naive generalization of the Ornstein-Weiss theorem fails. Section 3 contains the main results about law of large numbers, large deviations and central limit theorem for the logarithm of ratios of waiting times. In the final section we consider the problem of "shadowing" a given continuous-time trajectory drawn from an ergodic distribution on path space.

Naive approach
In this section we introduce some basic notation and start with an informal discussion motivating the quantities which we will consider in the sequel. Let {X t , t ≥ 0} be a continuous-time Markov chain on a finite state space A, with stationary measure µ, and with generator where p(x, y) are the transition probabilities of a discrete-time irreducible Markov chain on A, with p(x, x) = 0, and where the escape rates c(x) are strictly positive.
Given a time-step δ, we introduce the "δ-discretization" {X iδ , i = 0, 1, 2, . . .}, which is then a discrete-time irreducible Markov chain. Next we define the first time the δ-discretized process repeats its first n symbols via the random variable For x 1 , . . . , x n ∈ A we denote by x n 1 the set of those discrete-time trajectories ω ∈ A N such that ω 1 = x 1 , . . . , ω n = x n . We denote by P δ the probability measure on A N given by the joint distribution of {X nδ : n ∈ N}, starting from the stationary measure µ, i.e., with X 0 distributed according to µ. By P δ (X n 1 ) we denote the probability of the random n-cylinder given by the first n symbols of the discretized Markov chain, i.e., The analogue of the Ornstein-Weiss theorem (see (9), (11)) for this continuous-time process would be a limit theorem for a suitably normalized version of log R δ n for n → ∞, δ → 0. However, for δ > 0 fixed, the δ-discretization {X 0 , X δ , X 2δ , . . . , X nδ , . . .} is a discrete-time ergodic Markov chain for which we can apply the Ornstein-Weiss theorem. Moreover, in combination with Shannon-McMillan-Breiman theorem we can write lim n→∞ 1 n log R δ n (X)P δ (X n 1 ) = 0 a.s.
Using the fact that X is an ergodic Markov chain we obtain where by o(1) we mean a random variable converging to zero almost surely as n → ∞, where E denotes expectation in the Markov chain started from its stationary distribution and where p X δ denotes the transition probability of the δ-discretized Markov chain {X iδ , i = 0, 1, 2, . . .}, i.e., In the rhs, the first sum is of order δ whereas the second one is of order δ log δ. Therefore, this expression does not seem to have a natural way to be normalized. This is a typical phenomenon for continuous-time processes: we will need a suitable reference process in order to define "entropy" as "relative entropy" with respect to this reference process. In the discrete-time context this reference measure is obviously the uniform measure on trajectories. Instead of considering return times, as we will see in the next sections, by considering differences of waiting times one is able to cancel the δ log δ term and obtain expressions that do converge in the limit δ ↓ 0 to relative entropy.

Main results: waiting times and relative entropy
We consider continuous-time Markov chains with a finite state-space A. We will always work with irreducible Markov chains with a unique stationary distribution. The process is denoted by {X t : t ≥ 0}. The associated measure on path space starting from X 0 = x is denoted by P x and by P we denote the path space measure of the process started from its unique stationary distribution. For t ≥ 0, F t denotes the sigma-field generated by X s , s ≤ t, and P [0,t] denotes the measure P restricted to F t .

Relative entropy: comparing two Markov chains
Consider two continuous-time Markov chains, one denoted by {X t : t ≥ 0} with generator and the other denoted by {Y t : t ≥ 0} with generator where p(x, y) is the Markov transition function of an irreducible discrete-time Markov chain. We further assume that p(x, x) = 0, and c(x) > 0 for all x ∈ A. We suppose that X 0 , resp. Y 0 , is distributed according to the unique stationary measure µ, resp.μ so that both processes are stationary and ergodic.
Remark 1. The fact that the Markov transition function p(x, y) is the same for both processes is only for the sake of simplicity. All our results can be reformulated in the case that the Markov transition functions would be different.
We recall Girsanov's formula (see Proposition 2.6, in Appendix 1 of (6) for a straightforward approach, or section 19 of (7) for more background on exponential martingales in the context of point processes).
where N s (ω) is the number of jumps of the path ω up to time s. The relative entropy of P w.r.t. P up to time t is defined as Using (3.3), stationarity, and the fact that is a (mean zero) martingale, we have that for every function ϕ : Therefore, returning to (3.3) we obtain and where is called the relative entropy per unit time (or relative entropy density) of P with respect toP. Notice also that, by the assumed ergodicity of the Markov process X, In the case {Y t : t ≥ 0} is Markov chain with generator This is a Markov chain with transition rates which coincides with p(x, y) in the case where µ is a reversible measure for the Markov process X, i.e., when For the choice (3.9), the random variable has the interpretation of "entropy production", and the relative entropy density s(P|P) has the interpretation of "mean entropy production per unit time"; see e.g. (5; 8).

Law of large numbers
For δ > 0, we define the discrete-time Markov chain X δ := {X 0 , X δ , X 2δ , . . .}. This Markov chain has transition probabilities where 1 I is the identity matrix. Similarly, we define the Markov chain Y δ with transition probabilities The path-space measure (on A N ) of X δ , resp. Y δ , starting from the stationary measure, is denoted by P δ , resp.P δ . As before, (see (2.2)) we will use the notation P δ (X n 1 ) for the probability of the random n-cylinder X n 1 in the discretized process. We define waiting times, which are random variables defined on A N × A N , by setting where we make the convention inf ∅ = ∞. In words, this is the first time that in the process Y δ , the first n symbols of the process X δ are observed. Similarly, if X δ is an independent copy of the process X δ , we define In what follows, we will always choose the processes {Y t : t ≥ 0} and {X t : t ≥ 0}, as well as {X t : t ≥ 0} and {X t : t ≥ 0} to be independent. However, no independence of {Y t : t ≥ 0} and {X t : t ≥ 0} will be required. The joint distribution on path space of these three processes (X, Y, X ) will be denoted by P 123 . Correspondingly the joint law of the δ-discretization of (X, Y, X ) will be denoted by P δ 123 . Our first result is a law of large numbers for the logarithm of the ratio of waiting times, in the limit n → ∞, δ ↓ 0.
Before proving this theorem, we state a theorem from (1) about the exponential approximation for the hitting-time law, which will be the crucial ingredient in the whole of this paper. For an n-block x n 1 := x 1 , . . . , x n ∈ A n and a continuous-time trajectory ω ∈ A [0,∞) , we define the hitting time of the δ-discretization by (3.18) We then have the following result, see (1).
The constants appearing in Theorem 2 (except C) depend on δ, and more precisely we have We will come back to this later on.
With these ingredients we can now give the proof of Theorem 1.
Proof of Theorem 1. From Proposition 1 it follows that, for all δ > 0, P 123 almost surely (3.22) By ergodicity of the continuous-time Markov chain {X t : t ≥ 0}, the discrete Markov chains X δ , Y δ are also ergodic and therefore we obtain Using (3.12), (3.13) and p(x, x) = 0, this gives where in the last line we used the expression (3.6) for the relative entropy.
Let us now specify the dependence on δ of the various constants appearing in Theorem 2. For the lower bound on the parameter we have (see (1), section 5) where C is a positive number independent of δ and ) .
Here α(l) denotes the classical α-mixing coefficient: where, for 0 ≤ m ≤ n < ∞, F n m denotes sigma-field of subsets of A N generated by X i (m ≤ i ≤ n). By the assumption of ergodicity of the continuous Markov chain, the generator L (resp.L) has an eigenvalue 0, the maximum of the real part of the other eigenvalues is strictly negative and denoted by −λ 1 < 0. One then has α(l) ≤ exp(−λ 1 δl) . (3.26) Using (3.12) there exists λ 2 > 0 such that for k = 1, . . . , n/2. Therefore, there existsĉ > 0 such that Similarly, from the proof of Theorem 2.1 in (2) one obtains easily the dependence on δ of the constants appearing in the error term of (3.20).
In applications, e.g., in the estimation of the relative entropy from two given sample paths, one would like to choose the word-length n and the discretization δ = δ n together. This possibility is precisely provided by the estimates (3.27) and (3.28), as the following analogue of Proposition 1 shows.
Proof. The proof is analogous to the proof of Theorem 2.4 in (2). For the sake of completeness, we prove the upper bound of (3.30). We can assume that δ n ≤ 1. By the exponential approximation (3.20) we have, for all t > 0, n ≥ 1, the estimates Choosing t = t n = κ 2 log n δn , with κ 2 > 0 large enough makes the rhs of (3.31) summable and hence a Borel-Cantelli argument gives the upper bound.
The use of propositions 1, 2 lies in the fact that the logarithm of the waiting time can be well approximated by the logarithm of the probability of an n-block under the measure P δ , orP δ . By the Markov property, the logarithm of this probability is an ergodic sum which is much easier to deal with from the point of view of obtaining law of large numbers, large deviation theorem and central limit behavior.
Of course, whether proposition (2) is still useful, i.e., whether it still gives the law of large numbers with δ = δ n depends on the behavior of ergodic sums such as under the measure P δn , i.e., the behavior of under P. This is made precise in the following theorem: Proof. By Proposition 2 we can write Both sums on the right hand side of (3.34) are of the form where C > 0 is some constant. Now, using ergodicity of the continuous-time Markov chain {X t , t ≥ 0}, we have the estimate

Large deviations
In this subsection, we study the large deviations of · under the measure P 123 . More precisely, we compute the scaled-cumulant generating function F δ (p) in the limit δ → 0 and show that it coincides with the scaled-cumulant generating function for the loarithm of the Radon-Nikodym derivative dP [0,t] /dP [0,t] in the range p ∈ (−1, 1). As in the case of waiting times for discrete-time processes, see e.g. (3), the scaled-cumulant generating function is only finite in the interval (−1, 1). We introduce the function as t → ∞, i.e., the large deviations of this quantity are governed by the entropy function We can now formulate our large deviation theorem.

Theorem 4.
a) For all p ∈ R and δ > 0 the function exists, is finite in p ∈ (−1, 1) whereas b) Moreover, as δ → 0, we have, for all p ∈ (−1, 1): The following notion of logarithmic equivalence will be convenient later on.
Definition 1. Two non-negative sequences a n , b n are called logarithmically equivalent (notation a n b n ) if lim n→∞ 1 n (log a n − log b n ) = 0 .
Proof. To prove Theorem 4, we start with the following lemma.
1. For all δ > 0 and for |p| < 1, Proof. The proof is similar to that of Theorem 3 in (3).
. The random variables ξ n , ζ n have approximately an exponential distribution (in the sense of Theorem 2). Using this fact, we can repeat the arguments of the proof of Theorem 3 in (3) -which uses the exponential law with the error-bound given by Theorem 2-to prove that for p ∈ (−1, 1) where C 1 , C 2 do not depend on n, whereas for |p| > 1, (Remark that up to corrections to the exponential law (details spelled out in proof of Theorem 3 in (3)) this simply comes from the fact that for γ > 0, the integral is convergent if and only if γ < 1.) Therefore, with the notation of Definition 1, for |p| < 1 and for |p| > 1 we obtain (3.46) from (3.47).
This proves the existence of F δ (p). Indeed, the limit exists by standard large deviation theory of (discrete-time, finite state space) Markov chains (since δ > 0 is fixed), see e.g. (4), section 3.1.1.
In order to deal with the limit δ → 0 of F δ (p), we expand the expression in the rhs of (3.45), up to order δ 2 . This gives This will give the result of the theorem by an application of Hölder's inequality, as is shown in lemma 2 below. We first consider the difference If there does not exist an interval [iδ, (i + 1)δ[, i ∈ {0, . . . , n − 1} where at least two jumps of the process {N t , t ≥ 0} occur, then A(n, δ) = 0. Indeed, if there is no jump in [iδ, (i + 1)δ[, log c(Xs) c(Xs) dN s are zero and if there is precisely one jump, then they are equal. Therefore, using the independent increment property of the Poisson process, and the strict positivity of the rates, we have the bound where the χ i 's, i = 1, . . . , n, form a collection of independent Poisson random variables with parameter δ, and C is some positive constant. This gives (3.51) Next, we tackle If there is no jump in any of the intervals [iδ, (i + 1)δ[, this term is zero. Otherwhise, the contibution of the interval [iδ, (i + 1)δ[ is bounded by C δ. Therefore it is bounded by where the χ i 's, i = 1, . . . , n, form once more a collection of independent Poisson random variables with parameter δ, and C is some positive constant. This gives where C is some positive constant. Hence, (3.50) follows by combining (3.51) and (3.52) and using Cauchy-Schwarz inequality.
The following lemma referred to before is easy and standard, but we state and prove it here for the sake of completeness.
Lemma 2. Let X n and Y n be two sequences of (real-valued) random variables such that lim n→∞ 1 a n log Ee t(Xn−Yn) = 0 (3.54) for all t ∈ R, and for some sequence of positive numbers a n ↑ ∞. Suppose that for all t ∈ R, F X (t) = lim sup n→∞ 1 a n log Ee tXn (3.55) and F Y (t) = lim sup n→∞ 1 a n log Ee tYn (3.56) are finite. Then for all t ∈ R F X (t) = F Y (t) (3.57) Proof. Put n = X n − Y n then by Hölder's inequality and the assumption (3.54) we have for all p > 1, So we obtain the inequalities for all p, p > 1. Both functions F X and F Y are convex and hence continuous. Therefore the result of the lemma follows by taking the limits p, p → 1.
The following large deviation result for fixed δ > 0 is an application of Theorem 4 and (10).
where I δ is the Legendre transform of F δ .
Remark 3. In the case {Y t : t ≥ 0} is the time reversed process of {X t : t ≥ 0}, the scaledcumulant generating function function E(p) satisfies the so-called fluctuation theorem symmetry The large deviation result of Theorem 4 then gives that the entropy production estimated via waiting times of a discretized version of the process has the same symmetry in its scaled-cumulant generating function for p ∈ [0, 1].

Central limit theorem
Theorem 5. For all δ > 0, converges in distribution to a normal law N (0, σ 2 δ ), where Proof. First we claim that for all δ > 0 This follows from the exponential law, as is shown in (3), proof of Theorem 2.

Equation (3.61) implies that a CLT for log
) and the variances of the asymptotic normals are equal. For δ fixed, satisfies the CLT (for δ > 0 fixed, X i is a discrete-time ergodic Markov chain), so the only thing left is the claimed limiting behavior for the variance, as δ → 0.
As in the proof of the large deviation theorem, we first expand up to order δ: It is then sufficient to verify that which is an analogous computation with Poisson random variables as the one used in the proof of Theorem 4.

Shadowing a given trajectory
In the context of discrete-time processes, besides the study of return and waiting times, one is also interested in the hitting time of a given pattern. It turns out that if the process satisfies certain mixing conditions, then the exponential law for the hitting time of a pattern holds for all patterns, with a parameter that depends on self-repetitive properties of the pattern, see e.g., (1). This exponential law for all patterns can then be used in the context of waiting times, where one chooses a pattern "typical" for measure P and looks for the hitting time of it in a random sequence with distribution P, or for return times, where the pattern consists of the first n symbols of the process. In the same spirit, one can consider a fixed trajectory and ask for the limiting behavior of the logarithm of the time that one has to wait before seeing this trajectory in a continuous time process. As before, this question can only be asked properly by first discretizing, and then taking the limit of zero discretization δ. Moreover, in the spirit of what preceded, one has to consider differences of logarithms of hitting times in order to find a proper normalizable quantity in the limit of discretization step δ ↓ 0.
In the previous sections we required the processes X, Y, X to be Markov. To compare with the other sections, in this section we require Y, X to be Markov, while X will be replaced by a fixed trajectory γ. We recover in particular the analogous (to Theorem 1) law of large numbers (Theorem 6 below) if we require γ to be distributed according to a stationary ergodic (but now not necessarily Markovian) process. Let γ ∈ D([0, ∞), X) be a given trajectory. The jump process associated to γ is defined by For a given δ > 0, define the "jump times" of the δ-discretization of γ: For the Markov process {X t , t ≥ 0} with generator define the hitting time T δ n (γ|X) = inf{k ≥ 0 : (X kδ , . . . , X (k+n)δ ) = (γ 0 , γ δ , . . . , γ nδ )} .
The presence of the log(δ) term in the rhs of (4.3) causes the same problem as we have encountered in Section 2. Therefore, we have to subtract another quantity such that the log(δ) term is canceled. In the spirit of what we did with the waiting times, we subtract log T δ n (γ|Y ), where {Y t : t ≥ 0} is another independent Markov process with generator Lf (x) = y∈Ac (x)p(x, y)(f (y) − f (x)) .

(4.4)
We then have the following law of large numbers Theorem 6. Let P (resp.P) denote the stationary path space measure of {X t : t ≥ 0} (resp. {Y t : t ≥ 0} and let γ ∈ D([0, ∞), X) be a fixed trajectory. We then have P ⊗P-almost surely: Proof. Using proposition 4, we use the same proof as that of Theorem 1, and use that the sums in the rhs of (4.4) is up to order δ 2 equal to the integrals appearing in the lhs of (4.6). The other assertions of the theorem follow from the ergodic theorem.
Remark 4. If we choose γ according to the path space measure P, i.e., γ is a "typical" trajectory of the process {X t : t ≥ 0}, and choose p(x, y) =p(x, y), then we recover the limit of the law of large numbers for waiting times (Theorem 1): = s(P|P) .